Define Data with Fields

Overview

Fields are used to extract value from HTML code.

Ruia supports the following fields:

ElementField: extract LXML element(s) from the selected HTML element
TextField: extract text string of the selected HTML element
AttrField: extract an attribute of the selected HTML element
HtmlField: extract raw HTML code of the selected HTML element
RegexField: use standard library re for better performance

Note

All the parameters of fields are keyword arguments.

ElementField

ElementField first select an HTML element by CSS Selector or XPath Selector, then get the LXML element(s) from the selected element.

Parameters

attr: str, required, the name of the attribute you want to extract
css_select: str, alternative, match HTML element(s) with CSS Selector
xpath_select: str, alternative, match HTML element(s) with XPath Selector
default: str, recommended, the default value if nothing matched in HTML element
many: bool, optional, extract a list if True

Example

import ruia

from lxml import etree

HTML = '''
<body>
<div class="title" href="/">Ruia Documentation</div>
<ul>
    <li class="tag" href="./easy.html">easy</li>
    <li class="tag" href="./fast.html">fast</li>
    <li class="tag" href="./powerful.html">powerful</li>
</ul>
</body>
'''

html = etree.HTML(HTML)

def test_element_field():
    ul = ruia.ElementField(css_select="ul")
    assert len(ul.extract(html_etree=html).xpath('//li')) == 3

TextField

TextField first select an HTML element by CSS Selector or XPath Selector, then get the text value of the selected element.

Parameters

css_select: str, alternative, match HTML element(s) with CSS Selector
xpath_select: str, alternative, match HTML element(s) with XPath Selector
default: str, recommended, the default value if nothing matched in HTML element
many: bool, optional, extract a list if True

Example

import ruia

from lxml import etree

HTML = '''
<body>
<div class="title">Ruia Documentation</div>
<ul>
    <li class="tag" href="./easy.html">easy</li>
    <li class="tag" href="./fast.html">fast</li>
    <li class="tag" href="./powerful.html">powerful</li>
</ul>
</body>
'''


html = etree.HTML(HTML)

def test_text_field():
    title_field = ruia.TextField(css_select='.title', default='Untitled')
    assert title_field.extract(html_etree=html) == 'Ruia Documentation'
    tag_field = ruia.TextField(css_select='.tag', default='No tag', many=True)
    assert tag_field.extract(html_etree=html) == ['easy', 'fast', 'powerful']

AttrField

TextField first select an HTML element by CSS Selector or XPath Selector, then get the attribute value of the selected element.

Parameters

attr: str, required, the name of the attribute you want to extract
css_select: str, alternative, match HTML element(s) with CSS Selector
xpath_select: str, alternative, match HTML element(s) with XPath Selector
default: str, recommended, the default value if nothing matched in HTML element
many: bool, optional, extract a list if True

Example

import ruia

from lxml import etree

HTML = '''
<body>
<div class="title" href="/">Ruia Documentation</div>
<ul>
    <li class="tag" href="./easy.html">easy</li>
    <li class="tag" href="./fast.html">fast</li>
    <li class="tag" href="./powerful.html">powerful</li>
</ul>
</body>
'''

html = etree.HTML(HTML)

def test_attr_field():
    title = ruia.AttrField(css_select='.title', attr='href', default='Untitled')
    assert title.extract(html_etree=html) == '/'
    tags = ruia.AttrField(css_select='.tag', attr='href', default='No tag', many=True)
    assert tags.extract(html_etree=html)[0] == './easy.html'

HtmlField

TextField first select an HTML element by CSS Selector or XPath Selector, then get the raw HTML code of the selected element.

If there's some spaces or some text outside any HTML elements between this element and next element, then this part of text will also inside the return value. It's an unstable feature, perhaps in later versions the outside text will be remove by default.

Parameters

css_select: str, alternative, match HTML element(s) with CSS Selector
xpath_select: str, alternative, match HTML element(s) with XPath Selector
default: str, recommended, the default value if nothing matched in HTML element
many: bool, optional, extract a list if True

Example

import ruia

from lxml import etree

HTML = '''
<body>
<div class="title">Ruia Documentation</div>
<ul>
    <li class="tag" href="./easy.html">easy</li>
    <li class="tag" href="./fast.html">fast</li>
    <li class="tag" href="./powerful.html">powerful</li>
</ul>
</body>
'''

html = etree.HTML(HTML)

def test_html_field():
    title = ruia.HtmlField(css_select='.title', default='Untitled')
    assert title.extract(html_etree=html) == '<div class="title" href="/">Ruia Documentation</div>\n'
    tags = ruia.HtmlField(css_select='.tag', default='No tag', many=True)
    assert tags.extract(html_etree=html)[1] == '<li class="tag" href="./fast.html">fast</li>\n    '

RegexField

TextField do not parse html structure, it directly use python standard library re. If your spider meets performance limitation, try RegexField. However, ruia is based on asyncio, you will seldom meet performance limitation!

RegexField has a complex behaviour:

if no group: return the whole matched string
if regex has a group: return the group value
if regex has multiple groups: return a list a string
if regex has named groups, no matter one or more: return a dict, whose key and value are both string
if many=True, return a list of above values

Parameters

re_select: str, required, match HTML element(s) with regular expression
default: str, recommended, the default value if nothing matched in HTML element
many: bool, optional, extract a list if True

Example

import ruia

from lxml import etree

HTML = '''
<body>
<div class="title" href="/">Ruia Documentation</div>
<ul>
    <li class="tag" href="./easy.html">easy</li>
    <li class="tag" href="./fast.html">fast</li>
    <li class="tag" href="./powerful.html">powerful</li>
</ul>
</body>
'''

html = etree.HTML(HTML)

def test_regex_field():
    title = ruia.RegexField(re_select='<div class="title" href="(.*?)">(.*?)</div>')
    assert title.extract(html=HTML)[0] == '/'
    assert title.extract(html=HTML)[1] == 'Ruia Documentation'
    tags = ruia.RegexField(re_select='<li class="tag" href="(?P<href>.*?)">(?P<text>.*?)</li>', many=True)
    result = tags.extract(html=HTML)
    assert isinstance(result, list)
    assert len(result) == 3
    assert isinstance(result[0], dict)
    assert result[0]['href'] == './easy.html'

About Parameter many

Parameter many=False indicates if the field will extract one value or multiple values from HTML source code.

For example, one Github Issue has many tags, We can use Item.get_items to get multiple values of tags, but that means an extra class definition. Parameter many aims to solve this problem.

A field is default by many=False, that means, for TextField, AttrField and HtmlField, Field.extract(*, **) will always return a string, and RegexField will return a string or a list or dict, depending on whether there are groups in the regular expression. We can consider it with a 'singular number'.

With many=True, each field will return a 'plural', that is, return a list.