WebDevData.org

This branch contains the scripts used to fetch the HTML files from top Alexa sites.

Methodology

The top 1 million Alexa sites csv is downloaded, unzipped, and the URLs are extracted from it
The URLs are then fed to a Python script that downloads the HTML files and their HTTP headers using a thread pool (to minimize waiting).

Usage

If your on Linux or OSX, simply run ./getData.sh and you should be good to go. If you're on Windows, cygwin may be your best bet.

If you want to fetch resources other than Alexa's top HTMLs, you can do that by doing something like cat resource_urls.txt | ./downloadr.py

Dependencies

Python (Tested with 2.7)
curl
zcat
python-magic

Results

The resulting directory structure is:

A root directory of the pattern "webdevdata.org-YYYY-MM-DD-HHMMSS"
Sub-directories are 16 bit hashes of the URLs below them. Used to verify there are not toom many files in a single directory.

The resulting files have an ".html.txt" extension for the data files and ".html.hdr.txt" extension for the header files.

June 2013 data set (484 Mb, .7z file)

Includes approx 53,000 HTML files. Some HTML element and attribute usage stats derived from the data are available.

Queries

A java based script is available to get statistics on html tags/attributes with CSS-like queries.

See the Queries on WebDevData wiki.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
query		query
CNAME		CNAME
README.md		README.md
downloadr.py		downloadr.py
getData.sh		getData.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

query

query

CNAME

CNAME

README.md

README.md

downloadr.py

downloadr.py

getData.sh

getData.sh

Repository files navigation

WebDevData.org

Methodology

Usage

Dependencies

Results

June 2013 data set (484 Mb, .7z file)

Queries

About

Releases

Packages

Languages

baptistelebail/webdevdata.org

Folders and files

Latest commit

History

Repository files navigation

WebDevData.org

Methodology

Usage

Dependencies

Results

Queries

About

Resources

Stars

Watchers

Forks

Languages