Skip to content

liesenf/web-corpus-construction

 
 

Repository files navigation

Web Corpus Construction

This repository contains code and documentation for supporting the web corpus construction session at the UCREL NLP summer school 2016.

Slides that support the code

Use

Run ./spider.py -h to see the help text:

usage: spider.py [-h] [-seeds LIST] [-db DBDIR] [-loglevel LOGLEVEL]

A web crawler demo for the UCREL Summer School 2016

optional arguments:
  -h, --help          show this help message and exit
  -seeds LIST
  -db DBDIR
  -loglevel LOGLEVEL

First run, you'll need to provide a one-url-per-line list using the -seeds argument:

./spider.py -seeds seed_urls/twitter.txt -db output -loglevel DEBUG

After that, you can resume a crawl by simply running it on the same database:

./spider.py -db output -loglevel DEBUG

Available log levels are:

  • DEBUG
  • WARNING
  • INFO
  • ERROR
  • FATAL

Dependencies

The tools here require a number of python libraries to run. We maintain a list in the conventional requirements.txt. To install these, run:

pip3 install --user -r requirements.txt

In addition to those requirements mentioned above, we also inlined the code to url_normalize in the interests of clarity.

About

Part of the UCREL NLP Summer School 2016

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%