Web Corpus Construction

This repository contains code and documentation for supporting the web corpus construction session at the UCREL NLP summer school 2016.

Slides that support the code

Introduction slides

Main presentation slides

Use

Run ./spider.py -h to see the help text:

usage: spider.py [-h] [-seeds LIST] [-db DBDIR] [-loglevel LOGLEVEL]

A web crawler demo for the UCREL Summer School 2016

optional arguments:
  -h, --help          show this help message and exit
  -seeds LIST
  -db DBDIR
  -loglevel LOGLEVEL

First run, you'll need to provide a one-url-per-line list using the -seeds argument:

./spider.py -seeds seed_urls/twitter.txt -db output -loglevel DEBUG

After that, you can resume a crawl by simply running it on the same database:

./spider.py -db output -loglevel DEBUG

Available log levels are:

DEBUG
WARNING
INFO
ERROR
FATAL

Dependencies

The tools here require a number of python libraries to run. We maintain a list in the conventional requirements.txt. To install these, run:

pip3 install --user -r requirements.txt

In addition to those requirements mentioned above, we also inlined the code to url_normalize in the interests of clarity.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
endcondition		endcondition
filter		filter
fitness		fitness
seed_urls		seed_urls
slides		slides
urlfilter		urlfilter
.gitignore		.gitignore
CorpusTable.py		CorpusTable.py
Features.py		Features.py
HTTPClient.py		HTTPClient.py
Normalisation.py		Normalisation.py
README.md		README.md
end_condition.py		end_condition.py
practical_document.md		practical_document.md
requirements.txt		requirements.txt
spider.py		spider.py

liesenf/web-corpus-construction

Folders and files

Latest commit

History

Repository files navigation

Web Corpus Construction

Slides that support the code

Use

Dependencies

About

Resources

Stars

Watchers

Forks

Languages