Spider Control
ruia.Spider
is used to control the whole spider,
it provides the following functions:
- Normalize your code
- Maintain a event loop
- Manage requests and responses
- Concurrency control
- Manage middlewares and plugins
Although it works well,
to use only ruia.Item
to create a spider,
ruia
recommend to use ruia.Spider
to implement a stronger spider.
Normalize your code
ruia.Spider
requires a class property start_urls
as the entry point of a spider.
Inner, ruia
will iterate start_urls
,
and send a request to server for each request.
After receiving server response,
ruia
will call spider.parse(response)
,
and this is the main part of your spider.
Here's a simple parse example, to simply save response fields to a text file.
We only have to define start_urls
,
and implement a parse
method.
import aiofiles
from ruia import Spider, Item, TextField, AttrField
class HackerNewsItem(Item):
target_item = TextField(css_select='tr.athing')
title = TextField(css_select='a.storylink')
url = AttrField(css_select='a.storylink', attr='href')
class HackerNewsSpider(Spider):
start_urls = [f'https://news.ycombinator.com/news?p={index}' for index in range(3)]
async def parse(self, response):
async for item in HackerNewsItem.get_items(html=await response.text()):
yield item
async def process_item(self, item: HackerNewsItem):
"""Ruia build-in method"""
async with aiofiles.open('./hacker_news.txt', 'a') as f:
await f.write(str(item.title) + '\n')
aiofiles
is a third-party library to operate files in asynchronous way.
It provides APIs the same as python standard open
function.
Now, we have written a spider, and time to start crawling.
import aiofiles
from ruia import Spider, Item, TextField, AttrField
class HackerNewsItem(Item):
target_item = TextField(css_select='tr.athing')
title = TextField(css_select='a.storylink')
url = AttrField(css_select='a.storylink', attr='href')
class HackerNewsSpider(Spider):
start_urls = [f'https://news.ycombinator.com/news?p={index}' for index in range(3)]
async def parse(self, response):
async for item in HackerNewsItem.get_items(html=await response.text()):
yield item
async def process_item(self, item: HackerNewsItem):
"""Ruia build-in method"""
async with aiofiles.open('./hacker_news.txt', 'a') as f:
await f.write(str(item.title) + '\n')
if __name__ == '__main__':
HackerNewsSpider.start()
Done. Now your code is more readable and maintainable.
Send Further Requests
I don't think that just crawling the catalogue of news satisfied you. Next we will crawl news itself. Hacker news gathers news from many websites, it's not easy to parse each article of it. For this example, we'll crawl Github Developer Documentation.
If you are a user of scrapy
, perhaps you'd like this essay for migration:
Write Spiders like Scrapy.
Ruia
provides a better way to send further requests by new asynchronous syntax async/await.
It provides more readability and is more flexible.
In any parse method, just yield
a coroutine, and the coroutine will be processed by ruia
.
Here is a simple pseudo-code.
from ruia import Spider
class MySpider(Spider):
async def parse(self, response):
next_response = await self.request(f'{response.url}/next')
yield self.parse_next_page(next_response, metadata='nothing')
async def parse_next_page(self, response, metadata):
print(await response.text())
It works well, except you want to yield many coroutines in a for loop. Look at the following pseudo-code:
from ruia import Spider
class MySpider(Spider):
async def parse(self, response):
for i in range(10):
response = await self.request(f'https://some.site/{i}')
yield self.parse_next(response)
async def parse_next(self, response):
print(await response.text())
You will find the requests in the for loop runs in a synchronous way!
Oh, awful. To solve this problem, ruia
provides a multiple_request
method.
Here is an example for Github Developer Documentation.
# Target: https://developer.github.com/v3/
from ruia import *
class CatalogueItem(Item):
target_item = TextField(css_select='.sidebar-menu a')
title = TextField(css_select='a')
link = AttrField(css_select='a', attr='href')
async def clean_link(self, value):
return f'https://developer.github.com{value}'
class PageItem(Item):
content = HtmlField(css_select='.content')
class GithubDeveloperSpider(Spider):
start_urls = ['https://developer.github.com/v3/']
concurrency = 5
async def parse(self, response: Response):
catalogue = []
async for cat in CatalogueItem.get_items(html=await response.text()):
if '#' in cat.link:
continue
catalogue.append(cat)
urls = [page.link for page in catalogue][:10]
async for response in self.multiple_request(urls, is_gather=True):
title = catalogue[response.index].title
yield self.parse_page(response, title)
async def parse_page(self, response, title):
item = await PageItem.get_item(html=await response.text())
print(title, len(item.content))
if __name__ == '__main__':
GithubDeveloperSpider.start()
Our crawler starts with start_urls
and parse
method as usual.
We get a list of urls.
Then we call self.multiple_request
method to send further requests.
multiple_request(urls, **kwargs)
method requires a positional argument urls
.
It is a list of string.
It also accept any other arguments like ruia.Request
.
Pay attention to use async for
statement.
multiple_request
method returns an asynchronous generator.
It yields responses.
You may want to use enumerate to get the index of responses like this:
async def parse(self, response):
urls = [f'https://site.com/{page}' for page in range(10)]
async for response in self.multiple_request(urls):
pass
Then you will get an Exception, telling you that enumerate cannot process asynchronous generator.
ruia
provides a property for every response object: response.index
.
It is useful when you want to pass some context to the next parsing method.
The order of responses currently is just the same as urls,
but it's an unstable feature.
Use response.index
to get its position.
multiple_request
has another argument is_gather
,
which indicates whether ruia should run the requests together or not.
If is_gather=True
, then the requests will run together.
If not, the requests will run one by one.
is_gather=True
is usually better, except one condition:
we have a catalogue contains 1000 pages.
If we use gather=True
, we will get the response after 1000 requests.
It may take too long before parsing.
Concurrency Control
Let's repeat the Github Developer spider.
# Target: https://developer.github.com/v3/
from ruia import *
class CatalogueItem(Item):
target_item = TextField(css_select='.sidebar-menu a')
title = TextField(css_select='a')
link = AttrField(css_select='a', attr='href')
async def clean_link(self, value):
return f'https://developer.github.com{value}'
class PageItem(Item):
content = HtmlField(css_select='.content')
class GithubDeveloperSpider(Spider):
start_urls = ['https://developer.github.com/v3/']
concurrency = 5
async def parse(self, response: Response):
catalogue = []
async for cat in CatalogueItem.get_items(html=await response.text()):
catalogue.append(cat)
for page in catalogue[:20]:
if '#' in page.link:
continue
yield Request(url=page.link, metadata={'title': page.title}, callback=self.parse_page)
async def parse_page(self, response: Response):
item = await PageItem.get_item(html=await response.text())
title = response.metadata['title']
print(title, len(item.content))
if __name__ == '__main__':
GithubDeveloperSpider.start()
This time, there's a line added:
concurrency = 5
Here's a brief introduction about concurrency. Some websites are friendly to crawlers, while some are not. If you visit a website too frequently, you will be banned from the server. Besides, to be a good crawler, we should protect the server, rather than making it crash. Not every server can burden a huge spider.
To protect both, we have to control our concurrency. Concurrency means the connection numbers in a time. In this case, we set it to 5.
Let's have a short look on the log.
Output:
[2019:01:23 00:01:59]-ruia-INFO spider : Spider started!
[2019:01:23 00:01:59]-ruia-WARNINGspider : ruia tried to use loop.add_signal_handler but it is not implemented on this platform.
[2019:01:23 00:01:59]-ruia-WARNINGspider : ruia tried to use loop.add_signal_handler but it is not implemented on this platform.
[2019:01:23 00:01:59]-Request-INFO request: <GET: https://developer.github.com/v3/>
[2019:01:23 00:02:00]-Request-INFO request: <GET: https://developer.github.com/v3/>
[2019:01:23 00:02:00]-Request-INFO request: <GET: https://developer.github.com/v3/media/>
[2019:01:23 00:02:00]-Request-INFO request: <GET: https://developer.github.com/v3/oauth_authorizations/>
[2019:01:23 00:02:00]-Request-INFO request: <GET: https://developer.github.com/v3/auth/>
[2019:01:23 00:02:00]-Request-INFO request: <GET: https://developer.github.com/v3/troubleshooting/>
[2019:01:23 00:02:01]-Request-INFO request: <GET: https://developer.github.com/v3/previews/>
Overview 38490
[2019:01:23 00:02:02]-Request-INFO request: <GET: https://developer.github.com/v3/versions/>
OAuth Authorizations API 66565
[2019:01:23 00:02:02]-Request-INFO request: <GET: https://developer.github.com/v3/activity/>
Media Types 8652
[2019:01:23 00:02:02]-Request-INFO request: <GET: https://developer.github.com/v3/activity/events/>
Troubleshooting 2551
[2019:01:23 00:02:02]-Request-INFO request: <GET: https://developer.github.com/v3/activity/events/types/>
API Previews 19537
[2019:01:23 00:02:02]-Request-INFO request: <GET: https://developer.github.com/v3/activity/feeds/>
Other Authentication Methods 6651
[2019:01:23 00:02:03]-Request-INFO request: <GET: https://developer.github.com/v3/activity/notifications/>
Versions 1344
Feeds 14090
[2019:01:23 00:02:03]-Request-INFO request: <GET: https://developer.github.com/v3/activity/starring/>
Activity 2178
[2019:01:23 00:02:04]-Request-INFO request: <GET: https://developer.github.com/v3/activity/watching/>
[2019:01:23 00:02:05]-Request-INFO request: <GET: https://developer.github.com/v3/checks/>
Events 11844
Starring 55228
[2019:01:23 00:02:05]-Request-INFO request: <GET: https://developer.github.com/v3/checks/runs/>
[2019:01:23 00:02:05]-Request-INFO request: <GET: https://developer.github.com/v3/checks/suites/>
Event Types & Payloads 1225037
Notifications 65679
Watching 35775
Checks 7379
Check Runs 116607
[2019:01:23 00:02:06]-ruia-INFO spider : Stopping spider: ruia
Check Suites 115330
[2019:01:23 00:02:06]-ruia-INFO spider : Total requests: 18
[2019:01:23 00:02:06]-ruia-INFO spider : Time usage: 0:00:07.342048
[2019:01:23 00:02:06]-ruia-INFO spider : Spider finished!
Focus on the first several lines.
[2019:01:23 00:01:54]-Request-INFO request: <GET: https://developer.github.com/v3/>
[2019:01:23 00:02:00]-Request-INFO request: <GET: https://developer.github.com/v3/>
[2019:01:23 00:02:00]-Request-INFO request: <GET: https://developer.github.com/v3/media/>
[2019:01:23 00:02:00]-Request-INFO request: <GET: https://developer.github.com/v3/oauth_authorizations/>
[2019:01:23 00:02:00]-Request-INFO request: <GET: https://developer.github.com/v3/auth/>
[2019:01:23 00:02:00]-Request-INFO request: <GET: https://developer.github.com/v3/troubleshooting/>
[2019:01:23 00:02:05]-Request-INFO request: <GET: https://developer.github.com/v3/previews/>
Overview 38490
[2019:01:23 00:02:07]-Request-INFO request: <GET: https://developer.github.com/v3/versions/>
OAuth Authorizations API 66565
The first request is at our requesting the catalogue page.
Then, our spider send 5 requests at almost same time, at [00:02:00]
.
5 seconds later, at [00:02:05]
, our spider receives a response, and then sent another request.
The response was parsed immediately.
2 seconds later, at [00:02:07]
, our spider receives another response,
and sent another request.
Then, it parsed the response immediately.
That is to say, at any time, there are 5 connections between spider and server. That is concurrency control.
Hey, notice that our spider sent 5 requests at same time!
Thanks to python's asyncio
library,
we can write asynchronous crawler easier and faster.
Coroutines runs faster than multi-threadings.
Use Middleware
Ruia
provides mainly two ways to enhance itself.
Firstly let's talk about middlewares. Middlewares are used to process a request before it's sending and to process a response after it's receiving In a word, it is something between your spider and server.
Here is a simple middleware named ruia-ua
,
it is used to automatically add random User-Agent to your requests.
Firstly, install ruia-ua
.
pip install ruia-ua
Then, add it to your spider.
from ruia import AttrField, TextField, Item, Spider
from ruia_ua import middleware
class HackerNewsItem(Item):
target_item = TextField(css_select='tr.athing')
title = TextField(css_select='a.storylink')
url = AttrField(css_select='a.storylink', attr='href')
async def clean_title(self, value):
return value
class HackerNewsSpider(Spider):
start_urls = ['https://news.ycombinator.com/news?p=1', 'https://news.ycombinator.com/news?p=2']
concurrency = 10
async def parse(self, res):
async for item in HackerNewsItem.get_items(html=res.html):
print(item.title)
if __name__ == '__main__':
HackerNewsSpider.start(middleware=middleware)
ruia.Spider
has an argument middleware
.
It receives a list or a single middleware.
Use Plugin
If you want better control of your spider, try to use some plugins.
ruia-pyppeteer is a ruia
plugin used for loading JavaScript.
Firstly, install ruia-pyppeteer
.
pip install ruia_pyppeteer
# New features
pip install git+https://github.com/ruia-plugins/ruia-pyppeteer
Note
When you use load_js, it will download a recent version of Chromium (~100MB). This only happens once.
Here is a simple example to show how to load JavaScript.
import asyncio
from ruia_pyppeteer import PyppeteerRequest as Request
request = Request("https://www.jianshu.com/", load_js=True)
response = asyncio.get_event_loop().run_until_complete(request.fetch())
print(await response.text())
Here is an example to use it in your spider:
from ruia import AttrField, TextField, Item
from ruia_pyppeteer import PyppeteerSpider as Spider
from ruia_pyppeteer import PyppeteerRequest as Request
class JianshuItem(Item):
target_item = TextField(css_select='ul.list>li')
author_name = TextField(css_select='a.name')
author_url = AttrField(attr='href', css_select='a.name')
async def clean_author_url(self, author_url):
return f"https://www.jianshu.com{author_url}"
class JianshuSpider(Spider):
start_urls = ['https://www.jianshu.com/']
concurrency = 10
# Load js on the first request
load_js = True
async def parse(self, response):
async for item in JianshuItem.get_items(html=await response.text()):
# Loading js by using PyppeteerRequest
yield Request(url=item.author_url, load_js=self.load_js, callback=self.parse_item)
async def parse_item(self, response):
print(response)
if __name__ == '__main__':
JianshuSpider.start()