This repository contains code and documentation for supporting the web corpus construction session at the UCREL NLP summer school 2016.
Run ./spider.py -h
to see the help text:
usage: spider.py [-h] [-seeds LIST] [-db DBDIR] [-loglevel LOGLEVEL]
A web crawler demo for the UCREL Summer School 2016
optional arguments:
-h, --help show this help message and exit
-seeds LIST
-db DBDIR
-loglevel LOGLEVEL
First run, you'll need to provide a one-url-per-line list using the -seeds
argument:
./spider.py -seeds seed_urls/twitter.txt -db output -loglevel DEBUG
After that, you can resume a crawl by simply running it on the same database:
./spider.py -db output -loglevel DEBUG
Available log levels are:
- DEBUG
- WARNING
- INFO
- ERROR
- FATAL
The tools here require a number of python libraries to run. We maintain a list in the conventional requirements.txt. To install these, run:
pip3 install --user -r requirements.txt
In addition to those requirements mentioned above, we also inlined the code to url_normalize in the interests of clarity.