Define Data with Fields
Overview
Fields are used to extract value from HTML code.
Ruia supports the following fields:
- ElementField: extract LXML element(s) from the selected HTML element
- TextField: extract text string of the selected HTML element
- AttrField: extract an attribute of the selected HTML element
- HtmlField: extract raw HTML code of the selected HTML element
- RegexField: use standard library- refor better performance
Note
All the parameters of fields are keyword arguments.
ElementField
ElementField first select an HTML element by CSS Selector or XPath Selector,
then get the LXML element(s) from the selected element.
Parameters
- attr:- str, required, the name of the attribute you want to extract
- css_select:- str, alternative, match HTML element(s) with CSS Selector
- xpath_select:- str, alternative, match HTML element(s) with XPath Selector
- default:- str, recommended, the default value if nothing matched in HTML element
- many:- bool, optional, extract a list if True
Example
import ruia
from lxml import etree
HTML = '''
<body>
<div class="title" href="/">Ruia Documentation</div>
<ul>
    <li class="tag" href="./easy.html">easy</li>
    <li class="tag" href="./fast.html">fast</li>
    <li class="tag" href="./powerful.html">powerful</li>
</ul>
</body>
'''
html = etree.HTML(HTML)
def test_element_field():
    ul = ruia.ElementField(css_select="ul")
    assert len(ul.extract(html_etree=html).xpath('//li')) == 3
TextField
TextField first select an HTML element by CSS Selector or XPath Selector,
then get the text value of the selected element.
Parameters
- css_select:- str, alternative, match HTML element(s) with CSS Selector
- xpath_select:- str, alternative, match HTML element(s) with XPath Selector
- default:- str, recommended, the default value if nothing matched in HTML element
- many:- bool, optional, extract a list if True
Example
import ruia
from lxml import etree
HTML = '''
<body>
<div class="title">Ruia Documentation</div>
<ul>
    <li class="tag" href="./easy.html">easy</li>
    <li class="tag" href="./fast.html">fast</li>
    <li class="tag" href="./powerful.html">powerful</li>
</ul>
</body>
'''
html = etree.HTML(HTML)
def test_text_field():
    title_field = ruia.TextField(css_select='.title', default='Untitled')
    assert title_field.extract(html_etree=html) == 'Ruia Documentation'
    tag_field = ruia.TextField(css_select='.tag', default='No tag', many=True)
    assert tag_field.extract(html_etree=html) == ['easy', 'fast', 'powerful']
AttrField
TextField first select an HTML element by CSS Selector or XPath Selector,
then get the attribute value of the selected element.
Parameters
- attr:- str, required, the name of the attribute you want to extract
- css_select:- str, alternative, match HTML element(s) with CSS Selector
- xpath_select:- str, alternative, match HTML element(s) with XPath Selector
- default:- str, recommended, the default value if nothing matched in HTML element
- many:- bool, optional, extract a list if True
Example
import ruia
from lxml import etree
HTML = '''
<body>
<div class="title" href="/">Ruia Documentation</div>
<ul>
    <li class="tag" href="./easy.html">easy</li>
    <li class="tag" href="./fast.html">fast</li>
    <li class="tag" href="./powerful.html">powerful</li>
</ul>
</body>
'''
html = etree.HTML(HTML)
def test_attr_field():
    title = ruia.AttrField(css_select='.title', attr='href', default='Untitled')
    assert title.extract(html_etree=html) == '/'
    tags = ruia.AttrField(css_select='.tag', attr='href', default='No tag', many=True)
    assert tags.extract(html_etree=html)[0] == './easy.html'
HtmlField
TextField first select an HTML element by CSS Selector or XPath Selector,
then get the raw HTML code of the selected element.
If there's some spaces or some text outside any HTML elements between this element and next element, then this part of text will also inside the return value. It's an unstable feature, perhaps in later versions the outside text will be remove by default.
Parameters
- css_select:- str, alternative, match HTML element(s) with CSS Selector
- xpath_select:- str, alternative, match HTML element(s) with XPath Selector
- default:- str, recommended, the default value if nothing matched in HTML element
- many:- bool, optional, extract a list if True
Example
import ruia
from lxml import etree
HTML = '''
<body>
<div class="title">Ruia Documentation</div>
<ul>
    <li class="tag" href="./easy.html">easy</li>
    <li class="tag" href="./fast.html">fast</li>
    <li class="tag" href="./powerful.html">powerful</li>
</ul>
</body>
'''
html = etree.HTML(HTML)
def test_html_field():
    title = ruia.HtmlField(css_select='.title', default='Untitled')
    assert title.extract(html_etree=html) == '<div class="title" href="/">Ruia Documentation</div>\n'
    tags = ruia.HtmlField(css_select='.tag', default='No tag', many=True)
    assert tags.extract(html_etree=html)[1] == '<li class="tag" href="./fast.html">fast</li>\n    '
RegexField
TextField do not parse html structure,
it directly use python standard library re.
If your spider meets performance limitation, try RegexField.
However, ruia is based on asyncio,
you will seldom meet performance limitation!
RegexField has a complex behaviour:
- if no group: return the whole matched string
- if regex has a group: return the group value
- if regex has multiple groups: return a list a string
- if regex has named groups, no matter one or more: return a dict, whose key and value are both string
- if many=True, return a list of above values
Parameters
- re_select:- str, required, match HTML element(s) with regular expression
- default:- str, recommended, the default value if nothing matched in HTML element
- many:- bool, optional, extract a list if True
Example
import ruia
from lxml import etree
HTML = '''
<body>
<div class="title" href="/">Ruia Documentation</div>
<ul>
    <li class="tag" href="./easy.html">easy</li>
    <li class="tag" href="./fast.html">fast</li>
    <li class="tag" href="./powerful.html">powerful</li>
</ul>
</body>
'''
html = etree.HTML(HTML)
def test_regex_field():
    title = ruia.RegexField(re_select='<div class="title" href="(.*?)">(.*?)</div>')
    assert title.extract(html=HTML)[0] == '/'
    assert title.extract(html=HTML)[1] == 'Ruia Documentation'
    tags = ruia.RegexField(re_select='<li class="tag" href="(?P<href>.*?)">(?P<text>.*?)</li>', many=True)
    result = tags.extract(html=HTML)
    assert isinstance(result, list)
    assert len(result) == 3
    assert isinstance(result[0], dict)
    assert result[0]['href'] == './easy.html'
About Parameter many
Parameter many=False indicates if the field will extract one value or multiple values from HTML source code.
For example, one Github Issue has many tags,
 We can use Item.get_items to get multiple values of tags,
 but that means an extra class definition.
 Parameter many aims to solve this problem.
A field is default by many=False,
that means, for TextField, AttrField and HtmlField,
Field.extract(*, **) will always return a string,
and RegexField will return a string or a list or dict,
depending on whether there are groups in the regular expression.
We can consider it with a 'singular number'.
With many=True, each field will return a 'plural',
that is, return a list.