GitHub - udayravishankar/prep-wu-data: coprolitic scripts for some datasets

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 1,285 Commits
apeyeye/app/endpoints		apeyeye/app/endpoints
computer		computer
culture		culture
db/utils/auxtables		db/utils/auxtables
demographics		demographics
engineering/mechanical/material_properties_mjvanvoorhis		engineering/mechanical/material_properties_mjvanvoorhis
geo		geo
health		health
huge		huge
icss		icss
join/codes_country		join/codes_country
language		language
math		math
money		money
politics/campaign_us_2008		politics/campaign_us_2008
scaffolds		scaffolds
science		science
social		social
sport		sport
time		time
weather/ncdc		weather/ncdc
web/analytics		web/analytics
.gitignore		.gitignore
README		README
README-commands		README-commands
README-license		README-license
README-overview		README-overview
snarf-mlbgameday-xml.sh		snarf-mlbgameday-xml.sh

Repository files navigation

h1. Intro

Infinite monkeywrench is a frameworks to simplify the tasks of acquiring,
extracting, transforming and loading data.

* It's built, designed and tested for manipulating datasets as small as 1k and
  as large as hundreds of gigabytes.

* Minimize **programmer time** even at the expense of increasing run time.
  These tasks only need to be run once.  (And they deal scalably with
  incremental updates: see 'Lazy' below.)

* Runtime scales with data.  One MB of data for testing, will run in about
  1000'th the time to process one GB of data for real.
  
* Simple parallelization. Tell imw it's #3 out of 5 (or 500) workers and it will
  process only that fraction of the input.

* Lazy evaluation, like 'make': imw lets you define dependency chains (and comes
  already knowing a few).  It won't scrape new data if there's nothing new to
  scrape, and if you need to generate file "frobnozz" before you can process
  file "marklar", imw generates frobnozz if and only if it doesn't already exist.

* Realistic.  IMW is built to handle real data as she is spoke:

** Beautiful, schematized, formatted data from the infochimps.org collection.
** Most popular file formats: XML, YAML, CSV, JSON
** Parsing flat files becomes an easy two-liner piece of code.
** Messy data in some backwater format with no schema still sucks but a lot less
   than it used to.
** Scrape a web page tree and nimbly extract the table data from each page.

* Although obviously it's more work to acquire a sloppy dataset than a
  well-defined one, imw degrades well -- you write code for exactly and only the
  tasks that make your dataset bizarre.

* IMW is toolset agnostic.  If you have pre-existing routines to parse or
  acquire some format, imw is happy to call those tools at any step. You can
  have imw manage the acquisition and loading, but replace the entire munging
  step with a simple @'sh "perl dostuff.pl"'@ or even @'make'@.

h2. Setup

Since I'm not smart enough to get this bootstrapped the right way, add this to
your .profile (either that, or ensmarten me about the right way):
  ### Wield Infinite Monkeywrench (+1 to data munging, charisma, THAC0):
  export IMW_ROOT=$HOME/ics/imw
  export PATH="$IMW_ROOT"/bin:"$PATH"
  export RUBYLIB="$IMW_ROOT"/lib:"$RUBYLIB"

These directories will be created under $IMW_ROOT (more about each later):
* @pool/(cat)/(subcat)/(pool)/@		-- Data pool processing code
* @ripd/com.reverse.url/dirs/files.ext@	-- scraped data from elsewhere
* @rawd/(cat)/(subcat)/(pool)/@ 	-- working copy of raw data
* @dump/(cat)/(subcat)/(pool)/@ 	-- intermediate data
* @fixd/(cat)/(subcat)/(pool)/@ 	-- completely processed data
* @pkgd/(cat)/(subcat)/(pool)/@ 	-- compressed & bundled distributable

If any directory is called for but found missing, IMW will create it (except for
@dump/@, which will be linked to /tmp/imw).  However, if the directory is there,
IMW will leave it the heck alone. So feel free to replace each directory with a
symbolic link (putting @rawd/@, @dump/@ and @fixd/@ directories on a large, fast
drive, @ripd/@ and @pkgd/@ on large, slow drives for instance.).

h2. Scaffolding

@ imw generate pool=foo/bar/myhappypool @



h2. Actually processing your data

The Infinite Monkeywrench is built atop ActiveResource, the data model that
powers Ruby on Rails.  This gives us

* a well-known, well-tested, database-agnostic data model
* ability to export from that datamodel to sqlite3, csv, yaml, xml and JSON with
  vernacular structure
* active_resource for both an outgoing API and an ingoing data socket: if
  someone creates an active_resource facade to an external API we get the
  data

It also demonstrates our commitment to the "minimize programmer time, not run
time" philosophy, but it's worked for us so far.