inscriptis

A python based HTML to text converter with minimal support for CSS.

Requirements

Python 3.4+ (preferred) or Python 2.7+
lxml

Usage

Command line

The command line client converts text files or text retrieved from Web pages to the corresponding text representation.

Installation

sudo python3 setup.py install

Command line parameters

usage: inscript.py [-h] [-o OUTPUT] [-e ENCODING] input

Converts HTML from file or url to a clean text version

positional arguments:
  input                 Html input either from a file or an url

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        Output file (default:stdout).
  -e ENCODING, --encoding ENCODING
                        Content encoding for files (default:utf-8)
  -i, --image-captions  Display image captions (default:false).
  -d, --deduplicate-image-captions
                        Deduplicate image captions (default:false).

Examples

# convert the given page to text and output the result to the screen
python3 inscriptis.py http://www.htwchur.ch

# convert the file to text and save the output to output.txt
python3 inscriptis.py htwchur.html -o htwchur.txt

Library

import urllib.request
from inscriptis import get_text

url = "http://www.informationscience.ch"
html = urllib.request.urlopen(url).read()

text = get_text(html)

print(text)

Unit tests

Test cases concerning the html to text conversion are located in the tests/html directory and consist of two files:

test-name.html and
test-name.txt

the latter one containing the reference text output for the given html file.

Text convertion output comparison and speed benchmarking

inscriptis offers a small benchmarking script that can compare different HTML to txt convertion approaches. The script will run the different approaches on a list of URLs, url_list.txt, and save the text output into a time stampped folder in benchmarking/benchmarking_results for manual comparison. Additionally the processing speed of every approach per URL is measured and saved in a text file called speed_comparisons.txt in the respective time stampped folder.

To run the benchmarking script execute run_benchmarking.py from within the folder benchmarking. In def pipeline() set the which HTML -> Text algorithms to be executed by modifying

run_lynx = True
run_justext = True
run_html2text = True
run_beautifulsoup = True
run_inscriptis = True

In url_list.txt the URLs to be parsed can be specified by adding them to the file, one per line with no additional formatting. URLs need to be complete (including http:// or https://) e.g.

http://www.informationscience.ch
https://en.wikipedia.org/wiki/Information_science
...

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
benchmarking		benchmarking
scripts		scripts
src/inscriptis		src/inscriptis
tests		tests
.gitignore		.gitignore
AUTHORS		AUTHORS
COPYING		COPYING
README.md		README.md
TODO.txt		TODO.txt
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmarking

benchmarking

scripts

scripts

src/inscriptis

src/inscriptis

tests

tests

.gitignore

.gitignore

AUTHORS

AUTHORS

COPYING

COPYING

README.md

README.md

TODO.txt

TODO.txt

setup.py

setup.py

tox.ini

tox.ini

Repository files navigation

inscriptis

Requirements

Usage

Command line

Library

Unit tests

Text convertion output comparison and speed benchmarking

About

Releases

Packages

Languages

License

amtec/inscriptis

Folders and files

Latest commit

History

Repository files navigation

inscriptis

Requirements

Usage

Command line

Library

Unit tests

Text convertion output comparison and speed benchmarking

About

Resources

License

Stars

Watchers

Forks

Languages