Skip to content

How to Write a Plugins

Plugins are used to package some common functions as a third-party model. Ruia allow developers to implement third-party extensions in the following ways:

  • by using Middleware class
  • by overwriting some core modules(just like Spider, Request etc...)

In the previous section, we talked about Middleware. It is used to process before request and after response. Then, we implemeneted a function, that is to add User-Agent in request headers.

Perhaps any crawler need such a function, to add User-Agent randomly, so, let's packaging this function as a third-party extension.

Do it!

Creating a project

The project name is ruia-ua. Ruia is based on Python3.6+, so is ruia-ua.

Supposing that you're now in Python 3.6+.

# Install package management tool: pipenv
pip install pipenv
# Create project directory
mkdir ruia-ua
cd ruia-ua
# Install virtual environment
pipenv install 
# Install ruia
pipenv install ruia
# Install aiofiles
pipenv install aiofiles
# Create project directory in the project directory
mkdir ruia_ua
cd ruia_ua 
# Here's your implementation
touch __init__.py   

Directory structure:

ruia-ua
├── LICENSE                 # Open source license
├── Pipfile                 # pipenv management tools 
├── Pipfile.lock
├── README.md               
├── ruia_ua
│   ├── __init__.py         # Main code of your plugin
│   └── user_agents.txt     # some random user_agents
└── setup.py                

First plugin

user_agents.txt contains all kinds of UA, then we only need to use Middleware of ruia to add a random User-Agent before every request.

Here is one implementation:

import os
import random

import aiofiles

from ruia import Middleware

__version__ = "0.0.1"


async def get_random_user_agent() -> str:
    """
    Get a random user agent string.
    :return: Random user agent string.
    """
    USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36'
    return random.choice(await _get_data('./user_agents.txt', USER_AGENT))


async def _get_data(filename: str, default: str) -> list:
    """
    Get data from all user_agents
    :param filename: filename
    :param default: default value
    :return: data
    """
    root_folder = os.path.dirname(__file__)
    user_agents_file = os.path.join(root_folder, filename)
    try:
        async with aiofiles.open(user_agents_file, mode='r') as f:
            data = [_.strip() for _ in await f.readlines()]
    except:
        data = [default]
    return data


middleware = Middleware()


@middleware.request
async def add_random_ua(spider_ins, request):
    ua = await get_random_user_agent()
    if request.headers:
        request.headers.update({'User-Agent': ua})
    else:
        request.headers = {
            'User-Agent': ua
        }

Now it's high time to upload ruia-ua to community, then all other ruia users are able to use your third-party extension. Sounds great!

Usage

All crawlers can use ruia-ua to add User-Agent automatically.

pip install ruia-ua

Here is an example:

from ruia import AttrField, TextField, Item, Spider
from ruia_ua import middleware as ua_middleware


class HackerNewsItem(Item):
    target_item = TextField(css_select='tr.athing')
    title = TextField(css_select='a.storylink')
    url = AttrField(css_select='a.storylink', attr='href')

    async def clean_title(self, value):
        return value


class HackerNewsSpider(Spider):
    start_urls = ['https://news.ycombinator.com/news?p=1', 'https://news.ycombinator.com/news?p=2']
    concurrency = 10

    async def parse(self, response):
        async for item in HackerNewsItem.get_items(html=await response.text()):
            print(item.title)


if __name__ == '__main__':
    HackerNewsSpider.start(middleware=ua_middleware)

The implementations of third-party plugins will make developing crawlers easier! Ruia do want your developing and uploading your own third-party plugins!