Item Data Cleaning
If you visit other's blog,
you may find that the titles of articles are often such a format:
Ruia is a great framework | Ruia's blog
.
Open the inspector of browser,
you will find such an element:
<title>Ruia is a great framework | Ruia's blog</title>
The title element contains two parts: the actual title of this article and the site name of the blog.
Now we just want to get the actual title Ruia is a great framework
.
We can write a statement in parse
method, like:
from ruia import Item, TextField
class MyItem(Item):
title = TextField(css_select='title')
async def parse(self, response):
title = MyItem.get_item(await response.text()).title
title = title.split(' | ')[0]
with open('data.txt', mode='a') as file:
file.writelines([title])
It works well. However, in ruia, we want to separate the two processes:
- Data acquisition, for parsing HTML and create structured data;
- Data processing, for data persistence or some other operations.
By separating data acquisition and data processing, our code can be more readable. We provide a better way for data cleaning.
from ruia import Item, TextField
class MyItem(Item):
title = TextField(css_select='title')
def clean_title(self, value):
value = value.split(' | ')[0]
return value
async def parse(self, response):
title = MyItem.get_item(await response.text()).title
with open('data.txt', mode='a') as file:
file.writelines([title])
Now we get a better item.
We just get the property title
of item
, like item.title
,
and we can get a pure title we want.
ruia
will automatically recognize methods starts with clean_
.
If there's a field named the_field
,
then its corresponding data cleaning method is clean_the_field
.
Just add a prefix clean_
is okay.
The default clean method of each field is just return the string itself.
Before data cleaning, fields are all pure python strings (sometimes a list or a dict of pure python strings).
If you want item.index
to return a python integer,
please define clean_index
method to return int(value)
.
Now let's focus on such a HTML code. For some reason, perhaps for css layout, there are some empty items. We want 5 movies, while ruia get 7. Of course you can delete useless items in parse function, however, it violated our principle that we should separate get items and save items.
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Title</title>
</head>
<body>
<div class="container">
<div class="movie"><a class="title">Movie 1</a><span class="star">3</span></div>
<div class="movie"><a class="title">Movie 2</a><span class="star">5</span></div>
<div class="movie"><a class="title">Movie 3</a><span class="star">2</span></div>
<div class="movie"><a class="title">Movie 4</a><span class="star">1</span></div>
<div class="movie"><a class="title">Movie 5</a><span class="star">5</span></div>
<div class="movie"><a class="title"></a><span class="star"></span></div>
<div class="movie"><a class="title"></a><span class="star"></span></div>
</div>
</body>
</html>
Ruia use an Exception
to solve this problem.
In clean_*
functions, we can raise a ruia.IgnoreThisItem
to skip useless items.
Here's a snippet as a demo.
from ruia import Item, TextField, IgnoreThisItem
class MyItem(Item):
target_item = TextField(css_select='.movie')
title = TextField(css_select=".title")
star = TextField(css_select=".star")
@staticmethod
async def clean_title(value):
if not value:
raise IgnoreThisItem
return value
async def main():
items = list()
async for item in MyItem.get_items(html=HTML):
items.append(item)
assert len(items) == 5
Now, the length of items is 5, instead of 7.