Preprocess repository

This repository contains code to preprocess HERE data.

Required Programs

You will need python 3.5 and numpy installed in order to preprocess data.

You will need pandas and matplotlib to run the optional visualization functions in TripAnalysis.py

Files in Repository

Preprocessing

preprocess.py
filter.py
utils.py
organize_data.py
organize_raw.sh
run_month.sh
run_pipeline.sh

Visualization and Analysis

TripAnalysis.py

Outputs

Running preprocess.py will generate the following csv files:

processed.csv for CONSUMER and FLEET
trip_meta.csv for CONSUMER and FLEET
probe_meta.csv for CONSUMER and FLEET
raw_trips.csv for CONSUMER and FLEET
bounding_box.csv
filtered.csv for CONSUMER and FLEET

The processed file contains one row for each probe data point that met the various preprocessing parameters, with all of the information in the raw data file except for system_date. It has one additional column trip_id, an integer which does not have any intrinsic value. Trip_id can be used to identify which data points belonging to a specific probe_id are in the same trip.

trip_meta contains one row for each trip included in processed, with the probe_id, provider, start time and end time of the trip, trip size (number of data points), trip duration, trip median speed, trip median speed excluding zeros, sampling rate for the trip (number of data points per minute), and trip_id. trip_meta can be thought of as a file containing all filtered trips. Every trip in trip_meta passed the trip filtering, which removed trips covering too little distance or too little time according to customizable min_trip_distance and min_trip_duration parameters.

probe_meta contains one row for each probe included in processed, with the probe_id, provider, min delta (difference in time between consecutive data points), max delta, median delta, and sampling rate for the probe.

bounding_box contains one row for each probe in the raw data file with latitude and longitude within a defined I-210 Region. We have the upper left corner of the box at 34.188788, -118.182502 and lower right at 34.037687, -117.855723. This includes a portion of the I-10 Freeway as well as the I-210. The row retains all columns from the raw data file except for the system date.

filtered contains one row for each probe in the raw data file for points that are within the bounding box AND pass several basic data-quality checks. The row retains all columns from the raw data file except for the system date.

raw_trips contains one row for each probe that was included in a trip. Each row contains all data included in a trip. Raw trips only need to satisfy the conditions required in defining a trip.

Key Terms:

time delta: This refers to the difference in time (seconds) between two consecutive data points from the same probe_id
trip: A sequence of consecutive data points from the same probe_id, where each pair of points is separated by no more than maxGAP number of seconds AND the trip includes at least minSIZE data points.

Running files

Setup

Clone this repo (at minimum, download preprocess.py, utils.py, organize_data.py, organize_raw.sh)
Create data_raw and analysis_output directories
Download individual raw data files or zip folders of files from Box. Move raw data into the data_raw directory. Raw data files must be in the format probe_data_I210.date.waynep1.csv (or .gz) This should be the case if downloaded directly from Box. Files do not need to be extracted or organized into folders manually.

Examples

1. Organize

The first time you preprocess data, you can run organize_raw.sh in the data_raw directory. This will extract all gz files from zipped folders in the directory, move the files to the appropriate month folder (creating them if they don't already exist), expand the gz files to csv, and delete the zip folder.

bash organize_raw.sh

2. Filter or Preprocess

After data has been extracted from zipped folders and is in csv format, you have more control over start and end dates. The format is always python <filename>.py <data source directory> <start date> <optional end date>

Filter data from 20171017 to 20171022

python filter.py data_raw 20171017 20171022

Preprocess data from 20171017, starting from raw data

python preprocess.py data_raw 20171017

Preprocess data from 20171017, starting from filtered file in analysis_output

python preprocess.py analysis_output 20171017

Preprocess data from January. 27 to Feb. 8 2017, starting from filtered data

python preprocess.py analysis_output 20170127 20170208

3. Alternatives

If you are only planning on filtering or preprocessing one month of data, you can simply use run_month.sh. This script will organize all raw data and then preprocess or filter the month in question starting from the source directory, depending on the argument you pass in.

bash run_month.sh <preprocess/filter> <source> 2017_07

run_pipeline.sh is similar to run_month, but runs either filter or preprocess on every file in the source directory

bash run_pipeline.sh <preprocess/filter> <source>

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
SampleAnalysis		SampleAnalysis
SampleRateAnalysis		SampleRateAnalysis
Writing		Writing
README.md		README.md
TripAnalysis.py		TripAnalysis.py
heading_histograms.py		heading_histograms.py
preprocess.py		preprocess.py
preprocessVisualization.py		preprocessVisualization.py
sampleRateCalc.py		sampleRateCalc.py
size_analysis.csv		size_analysis.csv
testPreprocessPARAMETER.py		testPreprocessPARAMETER.py

lijiayi9712/PATH_project

Folders and files

Latest commit

History

Repository files navigation

Preprocess repository

Required Programs

Files in Repository

Preprocessing

Visualization and Analysis

Outputs

Key Terms:

Running files

Setup

Examples

1. Organize

2. Filter or Preprocess

3. Alternatives

About

Resources

Stars

Watchers

Forks

Languages