Skip to content

baptistelebail/webdevdata.org

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WebDevData.org

This branch contains the scripts used to fetch the HTML files from top Alexa sites.

Methodology

  • The top 1 million Alexa sites csv is downloaded, unzipped, and the URLs are extracted from it
  • The URLs are then fed to a Python script that downloads the HTML files and their HTTP headers using a thread pool (to minimize waiting).

Usage

If your on Linux or OSX, simply run ./getData.sh and you should be good to go. If you're on Windows, cygwin may be your best bet.

If you want to fetch resources other than Alexa's top HTMLs, you can do that by doing something like cat resource_urls.txt | ./downloadr.py

Dependencies

Results

The resulting directory structure is:

  • A root directory of the pattern "webdevdata.org-YYYY-MM-DD-HHMMSS"
  • Sub-directories are 16 bit hashes of the URLs below them. Used to verify there are not toom many files in a single directory.

The resulting files have an ".html.txt" extension for the data files and ".html.hdr.txt" extension for the header files.

Includes approx 53,000 HTML files. Some HTML element and attribute usage stats derived from the data are available.

Queries

A java based script is available to get statistics on html tags/attributes with CSS-like queries.

See the Queries on WebDevData wiki.

About

Website for reports, etc.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Java 83.3%
  • Python 14.7%
  • Shell 2.0%