memex_ad_features

The purpose of this repository is to calculate a number of economic metrics on MEMEX escorts data. It's a small data pipeline intended to take a number of extractions from scraped advertisements and spit out usable JSON data.

Design

The different data targets are managed by the Makefile. Input data for a given target is passed through a number of Python and R scripts, and a generated CSV is spat out.

Input

Currently, the Makefile takes as input TSV-file dumps of Lattice data stored in S3 buckets, and the tsv files dumped by the different files. Eventually the makefile will replace these S3 dumps Lattice files stored in HDFS. Potentially, it will also take other extractions from the different teams as well.

Output

Currently, the Makefile spits out a number of different CSV files. Eventually it will dump these files into HDFS for general consumption.

Target Dependencies

The below dependency graph is intended to provide an overview of the makefile and chain of targets that it can create. If it doesn't render correctly, know that all is well! Just click on the "missing" icon to get to the file itself.

Target Plans

We will be pairing down this version of the makefile to focus on a limited number of targets. Specifically:

Ad-level metrics

Note that most ad-level metrics are relatively unimportant. User facing systems already have basic ad information and extracted information. So the theme of what we're going to contribute here is either we're going to impute price for ads or we're going to contribute measures which depend in part on aggregated statistics such as...

Non-imputed features (we can look these up when they exist)

Imputations (features that can't be looked up.)

Price (if missing)
- Jeff's first task! To see what to do about text features from CMU
- Jeff's second task! To see about improving the match rates between features and ads.
Age (if missing)

Geographic-area-level metrics

Eventually, there will be three files here:

State
City
MSA / small geographic region

Values should be calculated quarterly (perhaps monthly? Or a rolling average?)

Phone Entity-level metrics

Phones
MIT Author identifiers
Rebecca's stylometry clusters (when finished)
IST Cluster IDs) (Others TBA, as they come up.

No metrics at this level are in the original makefile, but we have been moving in that direction. (See, for instance, make_phone_characteristics.py and make_phone_level.py.) These metrics come from Steve Bach's computations.

Future Possibilities

MSA-level values, cross-tabbed by gender using ACS data

Name		Name	Last commit message	Last commit date
Latest commit History 497 Commits
config		config
hdfs_tools		hdfs_tools
helpers		helpers
lattice_json_tools		lattice_json_tools
.gitignore		.gitignore
Makefile		Makefile
Makefile_daily		Makefile_daily
README.md		README.md
config_parser.py		config_parser.py
convert_csv_to_json.py		convert_csv_to_json.py
create_dataframe.py		create_dataframe.py
create_location_files.py		create_location_files.py
impute.py		impute.py
impute_exploration.py		impute_exploration.py
impute_true_negatives.py		impute_true_negatives.py
impute_true_positives.py		impute_true_positives.py
make_ad.py		make_ad.py
make_entity.py		make_entity.py
make_makefile_graph.py		make_makefile_graph.py
make_msa.py		make_msa.py
model_munge.py		model_munge.py
requirements.txt		requirements.txt
run_daily.py		run_daily.py

giantoak/memex_ad_features

Folders and files

Latest commit

History

Repository files navigation

memex_ad_features

Design

Input

Output

Target Dependencies

Target Plans

Ad-level metrics

Non-imputed features (we can look these up when they exist)

Imputations (features that can't be looked up.)

Geographic-area-level metrics

Phone Entity-level metrics

Future Possibilities

About

Resources

Stars

Watchers

Forks

Languages