The purpose of this repository is to calculate a number of economic metrics on MEMEX escorts
data. It's a small data pipeline intended to take a number of extractions from scraped advertisements and spit out usable JSON data.
The different data targets are managed by the Makefile
. Input data for a given target is passed through a number of Python and R scripts, and a generated CSV is spat out.
Currently, the Makefile takes as input TSV-file dumps of Lattice data stored in S3 buckets, and the tsv files dumped by the different files. Eventually the makefile will replace these S3 dumps Lattice files stored in HDFS. Potentially, it will also take other extractions from the different teams as well.
Currently, the Makefile spits out a number of different CSV files. Eventually it will dump these files into HDFS for general consumption.
The below dependency graph is intended to provide an overview of the makefile and chain of targets that it can create. If it doesn't render correctly, know that all is well! Just click on the "missing" icon to get to the file itself.
We will be pairing down this version of the makefile to focus on a limited number of targets. Specifically:
Note that most ad-level metrics are relatively unimportant. User facing systems already have basic ad information and extracted information. So the theme of what we're going to contribute here is either we're going to impute price for ads or we're going to contribute measures which depend in part on aggregated statistics such as...
- Ad price at ad level (
price_per_hour
inmake_ad_prices.py
) (@gstub)- For ads with multiple prices, hold out 10% of the one-hour prices, impute the value of the held out set, test the size of the match (Jeff, validates parent)
- Price relative to geographic area (supertask)
- Price relative to the city average (lowest level of geographic agg. provided by Lattice) (@gstub)
-
Price relative to the MSA average (price_per_hour
-ad_p50_msa
)/ad_std_msa
(@gstub) - Price relative to the state average (@gstub)
- Find out what list of cities Lattice is using (@pmlandwehr)
- Find the mapping from that list of cities to the MSAs. (@pmlandwehr) (DO NOT get hung up on this.)
- Price quantile relative to geographic area (i.e. is this ad at the 25th percentile? 30th perentile?) (@gstub)
- Price for phone number X relative to the median price for phone number X. (@gstub)
-
Flag difference from MSA average. (i.e. if 30% of ads in an MSA are flagged "Juvenile" the MSA average is .3, so a non-"Juvenile" flagged ad will have a value of -0.3 and a "Juvenile" flagged at 0.7) - Total number of extracted prices per ad (possibly not included in data at this point, but good to have)
- Price (if missing)
- Jeff's first task! To see what to do about text features from CMU
- Jeff's second task! To see about improving the match rates between features and ads.
- Age (if missing)
Eventually, there will be three files here:
- State
- City
- MSA / small geographic region
Values should be calculated quarterly (perhaps monthly? Or a rolling average?)
-
plots of price changes per time period per region w/ more than 300 ads with prices per time period, where time period =
- month
- quarter
- year
-
Those laid out in
msa_characteristics.csv
(generated bymake_msa_characteristics.py
)-
ad_count_msa
- number of ads - Lattice-extracted Price data per geographic area
-
ad_mean_msa
- Mean -
ad_std_msa
- Standard deviation of price within MSA -
ad_p<X>_msa
, where<X> = range(0, 100, 5)
- value at the Xth price percentile
-
- Lattice flag data per geographic area
- Lattice-extracted Age data per geographic area
-
ad_mean_age
- Mean -
ad_p<X>_age_msa
where<X> = range(0, 100, 5)
- value at the the Xth age percentile
-
-
- Phones
- MIT Author identifiers
- Rebecca's stylometry clusters (when finished)
- IST Cluster IDs) (Others TBA, as they come up.
No metrics at this level are in the original makefile, but we have been moving in that direction. (See, for instance, make_phone_characteristics.py
and make_phone_level.py
.) These metrics come from Steve Bach's computations.
- Talk to Senthil about the future pipeline for these features. (@jeffborowitz)
- All values in
data/bach/phones.csv
. (Not in repo)-
n_ads
- Number of ads posted by this phone number in sample -
n_distinct_locations
- Number of unique cities -
location_tree_length
- a measure of how far apart the phone number appears around the US -
n_incall
- Number of ads posted by this phone number that are incalls. -
n_outcall
- Number of ads posted by this phone number that are outcalls. -
Other metrics… (needs to be broken down)
-
- Price metrics
- Share of ads under this phone that have prices
- Median price per phone number
- Average price per phone number
- Standard deviation per phone number
- Number of unique prices Other
- Double check multiple cities and states
- MSA-level values, cross-tabbed by gender using ACS data