Common Crawl Web Archive Browser (Wayback)

This project provides for querying and direct browsing of portions of the Common Crawl. It uses the Common Crawl Url-Index, specifically the commoncrawlindex python library and extends the pywb (python wayback) project to provide web archive browsing capabilities for the Common Crawl.

This extension allows direct browsing of Common Crawl Web Data that has been indexed. At this time, it appears that the url-index is only partial and a lot of non-text content may be missing.

Installation

Install with pip:

pip install -r requirements.txt

This will install pywb and other dependencies.

There are a few quick run scripts:

./run.sh -- run with wsgi ref
./run-uwsgi.sh -- run with uwsgi (must have uwsgi installed, eg: pip install uwsgi)
./run-gunicorn.sh -- run with gunicorn (must have gunicorn install, eg: pip install gunicorn)

Tests

To run tests against live index (must have py.test installed, eg: pip install pytest)

./run-tests.sh

Browsing

This browser follow standard wayback machine url conventions: For example, to see a list of captures for ask.metafilter.com you can point your browser to:

http://localhost:8080/commoncrawl/*/http://ask.metafilter.com

You can also view captures for all urls starting with a given prefix by using the wildcard query:

http://localhost:8080/commoncrawl/*/http://ask.metafilter.com*

There is also a lower-level api for fetching the index in plain-text format:

http://localhost:8080/commoncrawl-index?url=http://ask.metafilter.com&matchType=host

(This query converts the Common-Crawl Index into a text CDX-like index. Additional options to be added at a later time.)

Additional Info

See the cci-config.yaml file for configuration info specific to this deployment.

See the pywb github page project for more details and documentation of pywb wayback implementation.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
cci		cci
ui-templates		ui-templates
.gitignore		.gitignore
LICENSE		LICENSE
README.rst		README.rst
cci-config.yaml		cci-config.yaml
requirements.txt		requirements.txt
run-gunicorn.sh		run-gunicorn.sh
run-tests.sh		run-tests.sh
run-uwsgi.sh		run-uwsgi.sh
run.sh		run.sh
uwsgi.ini		uwsgi.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cci

cci

ui-templates

ui-templates

.gitignore

.gitignore

LICENSE

LICENSE

README.rst

README.rst

cci-config.yaml

cci-config.yaml

requirements.txt

requirements.txt

run-gunicorn.sh

run-gunicorn.sh

run-tests.sh

run-tests.sh

run-uwsgi.sh

run-uwsgi.sh

run.sh

run.sh

uwsgi.ini

uwsgi.ini

Repository files navigation

Common Crawl Web Archive Browser (Wayback)

Installation

Tests

Browsing

Additional Info

About

Releases

Packages

Languages

License

Segerberg/pywb-commoncrawl

Folders and files

Latest commit

History

Repository files navigation

Common Crawl Web Archive Browser (Wayback)

Installation

Tests

Browsing

Additional Info

About

Resources

License

Stars

Watchers

Forks

Languages