Item Data Cleaning
If you visit other's blog,
you may find that the titles of articles are often such a format:
Ruia is a great framework | Ruia's blog.
Open the inspector of browser,
you will find such an element:
<title>Ruia is a great framework | Ruia's blog</title>
The title element contains two parts: the actual title of this article and the site name of the blog.
Now we just want to get the actual title
Ruia is a great framework.
We can write a statement in
parse method, like:
from ruia import Item, TextField class MyItem(Item): title = TextField(css_select='title') async def parse(self, response): title = MyItem.get_item(response.html).title title = title.split(' | ') with open('data.txt', mode='a') as file: file.writelines([title])
It works well. However, in ruia, we want to separate the two processes:
- Data acquisition, for parsing HTML and create structured data;
- Data processing, for data persistence or some other operations.
We provide a better way for data cleaning.
from ruia import Item, TextField class MyItem(Item): title = TextField(css_select='title') def clean_title(self, value): value = value.split(' | ') return value async def parse(self, response): title = MyItem.get_item(response.html).title with open('data.txt', mode='a') as file: file.writelines([title])
Now we get a better item.
We just get the property
and we can get a pure title we want.
ruia will automatically recognize methods starts with
If there's a field named
then its corresponding data cleaning method is
Just add a prefix
clean_ is okay.
The default clean method of each field is just return the string itself.
Before data cleaning, fields are all pure python strings (sometimes a list or a dict of pure python strings).
If you want
item.index to return a python integer,
clean_index method to
Separate the two processes makes our code more readable.