This project provides for querying and direct browsing of portions of the Common Crawl. It uses the Common Crawl Url-Index, specifically the commoncrawlindex python library and extends the pywb (python wayback) project to provide web archive browsing capabilities for the Common Crawl.
This extension allows direct browsing of Common Crawl Web Data that has been indexed. At this time, it appears that the url-index is only partial and a lot of non-text content may be missing.
Install with pip:
pip install -r requirements.txt
This will install pywb and other dependencies.
There are a few quick run scripts:
./run.sh
-- run with wsgi ref./run-uwsgi.sh
-- run with uwsgi (must have uwsgi installed, eg:pip install uwsgi
)./run-gunicorn.sh
-- run with gunicorn (must have gunicorn install, eg:pip install gunicorn
)
To run tests against live index (must have py.test installed, eg: pip install pytest
)
./run-tests.sh
This browser follow standard wayback machine url conventions: For example, to see a list of captures for ask.metafilter.com you can point your browser to:
http://localhost:8080/commoncrawl/*/http://ask.metafilter.com
You can also view captures for all urls starting with a given prefix by using the wildcard query:
http://localhost:8080/commoncrawl/*/http://ask.metafilter.com*
There is also a lower-level api for fetching the index in plain-text format:
http://localhost:8080/commoncrawl-index?url=http://ask.metafilter.com&matchType=host
(This query converts the Common-Crawl Index into a text CDX-like index. Additional options to be added at a later time.)
See the cci-config.yaml file for configuration info specific to this deployment.
See the pywb github page project for more details and documentation of pywb wayback implementation.