Skip to content

tigercosmos/web-archives

Repository files navigation

WEB ARCHIVES

Web Archives Collection System

Setup

Ubuntu Version > 16.04:

# Get source
git clone https://github.com/tigercosmos/web-archives.git
cd web-archives
git submodule init
git submodule update

# Dependencies
sudo apt-get install libxml2-dev libxslt-dev proxychains

# Don't change, it's hard code here.
virtualenv warcm_env/virt1 --no-site-packages
source warcm_env/virt1/bin/activate

pip install -r WarcMiddleware/pip_requirements.txt
pip install git+https://github.com/ikreymer/pywb.git

Usage

For all sites

All sites list in assets/alexa-*.csv

# save
./scripts/getArchiveAll.py
# extract
./scripts/extractArchiveAll.py 

For one site

# get and save as warc
./scripts/getArchive.sh [Name] [URL]
# extract warc to files
./scripts/extracArchive.sh [Name]

Third Party Tools Documents

Top Website Data

Alexa Top 1,000,000 at 2018/3/20

License

ISC