GitHub - antiface/crawtext_exp: Python Crawler for collecting domain specific web corpora

antiface / crawtext_exp Public

forked from mazieres/crawtext_exp

Notifications You must be signed in to change notification settings
Fork 0
Star 0

Python Crawler for collecting domain specific web corpora

MIT license

0 stars 4 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
decruft		decruft
doc		doc
standalone_scripts		standalone_scripts
.gitignore		.gitignore
Crawtext.yaml		Crawtext.yaml
LICENSE.txt		LICENSE.txt
README		README
crawl_trial.py		crawl_trial.py
crawl_trial.yaml		crawl_trial.yaml
fonctions.py		fonctions.py
forbidden_linktext.txt		forbidden_linktext.txt
forbidden_sites.txt		forbidden_sites.txt
forbidden_sites_strong.txt		forbidden_sites_strong.txt
library.py		library.py
library_exp.py		library_exp.py
path.py		path.py
post-processing.py		post-processing.py
seachengine2.py		seachengine2.py
whoosh_init.py		whoosh_init.py

Repository files navigation

python crawl_trial.py will launch the crawl accroding to parameters declared in crawl_parameters.yml file or any yaml file declared as an argument of crawl_trial.py. The crawler uses the seed sites found in the list of files of a given repertory (path) as well as a query that will be used to validate new webpages (query) found during the crawling process. inlinks_min define the minimum number of citation a page should accumulate before being considered as a candidate to enter the corpus. depth parameter defines the number of steps of corpus extension performed from the initial corpus made of seeds webpages. 

required modules: urllib2,BeautifulSoup, urlparse, sqlite3, pyparsing,urllib, random, multiprocessing,lxml,socket, decruft, feedparser, pattern,warnings, chardet, yaml


TODO list:

  * better scrapping of webpages, for the moment decruft is doing great but can be enhanced (http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/)
  * automatically extract dates from extracted texts (good python solutions in english, still lacking in french)
  * feed the db with cleaner and richer information (like domain name, number of views, etc)
  * take into account the charset when crawling webpages
  * crawl updating process 
  * automatic grab google links to initiate a crawl
  * TBD grid-compliant code...
  * clean the code (debug mode, documentation, etc.)
  * write a comprehensive post_processing.py script to keep compatibility with other developments.
  * monitoring and reporting (page retrieval problems, success, distributions, etc)
  * modular architecture : include a better information extraction process.
  * avoid downloading n times the same content. (md5 comparison)
  * retry to download pages that could not be opened.
  * targeted and careful crawl of each domain (only follow hypertext links with the ~query in the url or in the linkText)