Focused Crawler

Originally designed as part of Virginia Tech's Crisis & Tragedy Recovery Network (CTRnet), this project crawls the internet and collects webpages related to a given topic, often for archival purposes.

FocusedCrawler.py

Driver class for this project
Responsible for creating configuration and classifier object and calling crawler

crawler.py

Crawler class responsible for collecting and exploring new URLs to find relevant pages
Given a priority queue and a scoring class with a calculate_score(text) method

classifier.py

Parent class of classifiers (non-VSM) including NaiveBayesClassifier and SVMClassifier
Contains code for tokenization and vectorization of document text using sklearn
Child classes only have to assign self.model
NBClassifier.py
Subclass of Classifier, representing a Naïve Bayes classifier
SVMClassifier.py
Subclass of Classifier, representing an SVM classifier

scorer.py

Parent class of scorers, which are non-classifier models, typically VSM
tfidfscorer.py
- Subclass of Scorer, representing a tf-idf vector space model
lsiscorer.py
- Subclass of Scorer representing an LSI vector space model

config.ini

Configuration file for focused crawler in INI format

config.py

Class responsible for reading configuration file, using ConfigParser
Adds all configuration options to its internal dictionary (e.g. config[“seedFile”])

utils.py

Contains various utility functions relating to reading files and sanitizing/tokenizing text

seeds.txt

Contains URLs to relevant pages for focused crawler to start
Default name, but can be modified in config.ini

priorityQueue.py

Simple implementation of a priority queue using a heap

webpage.py

Uses BeautifulSoup and nltk to extract webpage text

FocusedCrawlerReport.docx

For the full technical report, please visit: https://docs.google.com/file/d/0B436PtOU57sJZkc5anMyNDZPaHM/edit?usp=sharing

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
crawler		crawler
hooks		hooks
scripts		scripts
server		server
.dockerignore		.dockerignore
.env		.env
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
FocusedCrawlerReport.docx		FocusedCrawlerReport.docx
README.md		README.md
_config.yml		_config.yml
cmd.sh		cmd.sh
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
setup.sh		setup.sh

wilcollins/FocusedCrawler

Folders and files

Latest commit

History

Repository files navigation

Focused Crawler

FocusedCrawler.py

crawler.py

classifier.py

NBClassifier.py

SVMClassifier.py

scorer.py

tfidfscorer.py

lsiscorer.py

config.ini

config.py

utils.py

seeds.txt

priorityQueue.py

webpage.py

FocusedCrawlerReport.docx

About

Resources

Stars

Watchers

Forks

Languages