Skip to content

Create a Typical Ruia Spider

Let's fetch some news from Hacker News in four steps:

  • Define item
  • Test item
  • Write spider
  • Run

Step 1: Define Item

After analyzing HTML structure, we define the following data item.

The skill of analyzing HTML structure is important for a spider engineer, Ruia believe you have already had this skill, and won't talk about it here.

from ruia import Item, TextField, AttrField


class HackerNewsItem(Item):
    target_item = TextField(css_select='tr.athing')
    title = TextField(css_select='a.storylink')
    url = AttrField(css_select='a.storylink', attr='href')

It's easy to understand: we want to get an item from HTML structure, the item contains two fields: title and url.

Wait! What is target_item?

target_item is a built-in Ruia field, indicates that the HTML element matched by its selectors contains one item. In this example, we are crawling a catalogue of Hacker News, and there are many news items in one page. target_item tells Ruia to focus on these HTML elements when extracting field.

Step 2: Test Item

Ruia is a low-coupling web crawling framework. Each class can be used separately in your project. You can even write a simple spider with only ruia.Item,ruia.TextField and ruia.AttrField. This feature provides a convenient way to test HackerNewsItem.

import asyncio

from ruia import Item, TextField, AttrField


class HackerNewsItem(Item):
    target_item = TextField(css_select='tr.athing')
    title = TextField(css_select='a.storylink')
    url = AttrField(css_select='a.storylink', attr='href')


async def test_item():
    url = 'https://news.ycombinator.com/news?p=1'
    async for item in HackerNewsItem.get_items(url=url):
        print('{}: {}'.format(item.title, item.url))


if __name__ == '__main__':
    # Python 3.7 Required.
    asyncio.run(test_item()) 

    # For Python 3.6
    # loop = asyncio.get_event_loop()
    # loop.run_until_complete(test_item())

Waiting for the output in your console.

Step 3: Write Spider

Ruia.spider is used to control requests and responses, such as concurrency control. It's important for a spider, or you will be banned by the server in one minute. By default, the concurrency is 3.

"""
 Target: https://news.ycombinator.com/
 pip install aiofiles
"""
import aiofiles

from ruia import Item, TextField, AttrField, Spider


class HackerNewsItem(Item):
    target_item = TextField(css_select='tr.athing')
    title = TextField(css_select='a.storylink')
    url = AttrField(css_select='a.storylink', attr='href')


class HackerNewsSpider(Spider):
    concurrency = 2
    start_urls = [f'https://news.ycombinator.com/news?p={index}' for index in range(3)]

    async def parse(self, response):
        async for item in HackerNewsItem.get_items(html=await response.text()):
            yield item

    async def process_item(self, item: HackerNewsItem):
        """Ruia build-in method"""
        async with aiofiles.open('./hacker_news.txt', 'a') as f:
            await f.write(str(item.title) + '\n')

Just define a property of the subclass of Spider. In this example, we crawl in two coroutines. If you do not know coroutine, as a crawler engineer, ruia believe you know the threading pool of spider. Coroutine is a more efficient way to behave like threading pool.

parse(self, response) is the entry point of a spider. After starting a spider, it send requests to web server. Once received a response, ruia.Spider will call its parse function to extract data from HTML source code.

Step 4: Run

Now everything is ready. Run!

import aiofiles

from ruia import Item, TextField, AttrField, Spider


class HackerNewsItem(Item):
    target_item = TextField(css_select='tr.athing')
    title = TextField(css_select='a.storylink')
    url = AttrField(css_select='a.storylink', attr='href')


class HackerNewsSpider(Spider):
    concurrency = 2
    start_urls = [f'https://news.ycombinator.com/news?p={index}' for index in range(3)]

    async def parse(self, response):
        async for item in HackerNewsItem.get_items(html=await response.text()):
            yield item

    async def process_item(self, item: HackerNewsItem):
        """Ruia build-in method"""
        async with aiofiles.open('./hacker_news.txt', 'a') as f:
            await f.write(str(item.title) + '\n')


if __name__ == '__main__':
    HackerNewsSpider.start()

Hey, notice that, do not run Spider.start() in a await statement! It's just a normal function!

You just create a spider in one python file! Amazing!