trafilatura: Extract the main text content of web pages

Robust extraction of main text content and boilerplate removal based on a combination of DOM-based examination, XPath expressions and rules. Given a HTML document, this library parses it, retrieves the main body text and converts it to XML or plain text, while preserving part of the text formatting and page structure.

>>> import requests, trafilatura
>>> response = requests.get('https://www.iana.org/about')
>>> trafilatura.process_record(response.text)
>>> # outputs main content in plain text format ...

$ trafilatura -u https://www.sueddeutsche.de/politik/usa-pompeo-maas-merkel-iran-nordstream-1.4434358
$ # outputs main content in plain text format ...

Work in progress, first package release ahead.

Code: https://github.com/adbar/trafilatura
Issue tracker: https://github.com/adbar/trafilatura/issues
License: GNU GPL v3; see LICENSE file

Contents

Features

Robust text extraction and boilerplate removal based on a combination of rules, XPath expressions and HTML tree examination. Also known as DOM-based content extraction, main content identification, HTML text cleaning. The purpose is to find relevant and original text sections of a web page and also to remove the noise consisting of recurring elements (headers and footers, ads, links/blogroll, etc.)

Because it relies on lxml, trafilatura is comparatively fast. It is also robust, as the additional generic algorithm jusText is used as a backup solution.

The result of processing can be in plain text or XML format. In the latter case, basic formatting elements are preserved such as text formatting (bold, italic, etc.) and page structure (paragraphs, titles, lists), which can be used for further processing.

Currently experimental features:

XML output compatible with the recommendations of the Text Encoding Initiative (XML TEI)
Language detection on the extracted content
Separate extraction of main text and comments

Installation

trafilatura is a Python 3 package that is available on PyPI and can be installed using pip:

pip install trafilatura

(Or use ``pip3 install trafilatura`` on systems where Python 2 and 3 are both globally installed and pip refers to Python 2.)

Direct installation of the latest version over pip is possible (see build status):

pip install git+https://github.com/adbar/trafilatura.git

With Python

Basic use

The simplest way to use trafilatura is as follows:

>>> import requests, trafilatura
>>> response = requests.get('https://www.iana.org/about')
>>> result = trafilatura.process_record(response.text)
>>> print(result) # newlines preserved, TXT output
>>> result = trafilatura.process_record(response.text, xml_output=True)
>>> print(result) # some formatting preserved in basic XML structure

The only required argument is the response element, the rest is optional. It is also possible to use a previously parsed tree (i.e. a lxml.html object) as input, which is then handled seamlessly.

>>> from lxml import html
>>> mytree = html.fromstring('<html><body><article><p>Here is the main text. It has to be long enough in order to bypass the safety checks. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p></article></body></html>')
>>> trafilatura.process_record(mytree)
'Here is the main text...'

Experimental feature: the target language can also be set using 2-letter codes (ISO 639-1), there will be no output if the detected language of the result does not match.

>>> result = trafilatura.process_record(response.text, url, target_language='de')

On the command-line

A basic command-line interface is included, URLs can be used directly (-u/--URL):

$ trafilatura -u https://www.sueddeutsche.de/politik/usa-pompeo-maas-merkel-iran-nordstream-1.4434358
$ # outputs main content in plain text format ...
$ trafilatura --xml --URL "https://de.creativecommons.org/index.php/was-ist-cc/"
$ # outputs main text with basic XML structure ...

A HTML document (and response body) can also be piped to the trafilatura:

$ wget -qO- "https://de.creativecommons.org/index.php/was-ist-cc/" | trafilatura

For usage instructions see trafilatura -h

Additional information

Context

This module is part of methods to derive metadata from web documents in order to build text corpora for computational linguistic and NLP analysis. For more information:

Barbaresi, Adrien. "Efficient construction of metadata-enhanced web corpora", Proceedings of the 10th Web as Corpus Workshop (WAC-X), 2016.

Name

Trafilatura: Italian word for wire drawing.

Kudos to...

Contact

Pull requests are welcome.

See my contact page for additional details.

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
tests		tests
trafilatura		trafilatura
.coveragerc		.coveragerc
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.rst		README.rst
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests

tests

trafilatura

trafilatura

.coveragerc

.coveragerc

.gitignore

.gitignore

.travis.yml

.travis.yml

LICENSE

LICENSE

MANIFEST.in

MANIFEST.in

README.rst

README.rst

setup.py

setup.py

tox.ini

tox.ini

Repository files navigation

trafilatura: Extract the main text content of web pages

Features

Installation

With Python

Basic use

On the command-line

Additional information

Context

Name

Kudos to...

Contact

About

Releases

Packages

Languages

License

DerKozmonaut/trafilatura

Folders and files

Latest commit

History

Repository files navigation

trafilatura: Extract the main text content of web pages

Features

Installation

With Python

Basic use

On the command-line

Additional information

Context

Name

Kudos to...

Contact

About

Resources

License

Stars

Watchers

Forks

Languages