This repo contains Shilad's source code for the Summer 2014 Wikipedia Source Geography project with Heather Ford, Dave Musicant, Brent Hecht, and Mark Graham.
Recreating this project:
- Create an account on Wikimedia Foundation's Tools Labs.
- On a labs instance run
get_geo_articles.py
to get the list of geographic articles in every language. - On a labs instance run
get_labs_urls.py
to get the list of external urls in every geographic article. - Place the external url file in
dat/wmf_source_urls.tsv.bz2
- Install WikiBrain for language EN on some computer.
- Run WmfExtractEnhancer.java on your WikiBrain machine to add more data to the url file.
- At the end of this step, your final file should be placed in
dat/source_urls.tsv
- Extract all domains by running
AllDomains.java
on your WikiBrain installation. - Create an Amazon Web Services (AWS) RDS Postgres installation on a VPC.
- Fire up an EC2 machine and load the domains into a database by running
create_whois_db.py
with the result of step 1. - Run
manage_whois.sh
you will have to change the configuration parameters in the script.
- Extract all cited urls by running
AllUrls.java
on your WikiBrain installation. - Fire up an EC2 machine and load the urls into a database by running
create_url_db.py
with the result from the previous step. - Run
manage_scraping.sh
you will have to change the configuration parameters in the script.
- Backup your amazon RDS instance using
pg_dump
. - Restore the postgres database on your computer.
- Copy the S3 directories with the scrape to your computer (careful! this is about 0.5 TB).
- Give Dave Musicant the postgres dump and ask him to extract admin countries for the whois results.
- Build set of "interesting" (e.g. non-multimedia) URLs
python build_interesting_urls.py
. - Build counts for each url
python build_url_counts.py
. - Place Dave's whois results in
dat/whois_results3.tsv
- Build url to country file based on whois:
python build_url_to_whois.py
- Build url to country file based on wikidata:
python build_wikidata_locations.py
- Build the SQL database by running
python urlinfo.py
- Build prior country distribution:
python build_country_priors.py
- Run the inferrer: