NYC_taxi_dataset

The aim of this project is to process NYC Taxi Trip Record Data.

https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page

• Load raw data files to hdfs.

• Transform them and write to parquets.

How to prepare for running

Download main load_file.py, utils.py and config.json files.
Set your own variables in config.json file (hdfs paths, local directories, etc).
Run script

How to run

Set variables:

PAR_ENV='prod';

PAR_CONFIG_PATH={path to config json}'config.json';

Use this spark command to run script:

{path to spark dir i.e} /spark-2.4.0-bin-hadoop2.7/bin/spark-submit \
   --master yarn \
   --deploy-mode cluster \
   --conf "spark.pyspark.python={path to python i.e.} /spark240python3/bin/python" \
   --conf "spark.pyspark.driver.python={path to python driver i.e.} /spark240python3/bin/python" \
   ``{path to main load_file.py file}`` /load_file.py $PAR_ENV $PAR_CONFIG_PATH

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
config.json		config.json
load_file.py		load_file.py
nyc_taxi.ipynb		nyc_taxi.ipynb
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

config.json

config.json

load_file.py

load_file.py

nyc_taxi.ipynb

nyc_taxi.ipynb

utils.py

utils.py

Repository files navigation

NYC_taxi_dataset

How to prepare for running

How to run

About

Releases

Packages

Languages

TadasSi/NYC_taxi_dataset

Folders and files

Latest commit

History

Repository files navigation

NYC_taxi_dataset

How to prepare for running

How to run

About

Resources

Stars

Watchers

Forks

Languages