A python based HTML to text converter with minimal support for CSS.
- Python 3.4+ (preferred) or Python 2.7+
- lxml
The command line client converts text files or text retrieved from Web pages to the corresponding text representation.
Installation
sudo python3 setup.py install
Command line parameters
usage: inscript.py [-h] [-o OUTPUT] [-e ENCODING] input
Converts HTML from file or url to a clean text version
positional arguments:
input Html input either from a file or an url
optional arguments:
-h, --help show this help message and exit
-o OUTPUT, --output OUTPUT
Output file (default:stdout).
-e ENCODING, --encoding ENCODING
Content encoding for files (default:utf-8)
-i, --image-captions Display image captions (default:false).
-d, --deduplicate-image-captions
Deduplicate image captions (default:false).
Examples
# convert the given page to text and output the result to the screen
python3 inscriptis.py http://www.htwchur.ch
# convert the file to text and save the output to output.txt
python3 inscriptis.py htwchur.html -o htwchur.txt
import urllib.request
from inscriptis import get_text
url = "http://www.informationscience.ch"
html = urllib.request.urlopen(url).read()
text = get_text(html)
print(text)
Test cases concerning the html to text conversion are located in the tests/html
directory and consist of two files:
test-name.html
andtest-name.txt
the latter one containing the reference text output for the given html file.
inscriptis offers a small benchmarking script that can compare different HTML to txt convertion approaches.
The script will run the different approaches on a list of URLs, url_list.txt
, and save the text output into a time stampped folder in benchmarking/benchmarking_results
for manual comparison.
Additionally the processing speed of every approach per URL is measured and saved in a text file called speed_comparisons.txt
in the respective time stampped folder.
To run the benchmarking script execute run_benchmarking.py
from within the folder benchmarking
.
In def pipeline()
set the which HTML -> Text algorithms to be executed by modifying
run_lynx = True
run_justext = True
run_html2text = True
run_beautifulsoup = True
run_inscriptis = True
In url_list.txt
the URLs to be parsed can be specified by adding them to the file, one per line with no additional formatting. URLs need to be complete (including http:// or https://)
e.g.
http://www.informationscience.ch
https://en.wikipedia.org/wiki/Information_science
...