MLB Analytics App

This app does analytics on Major League Baseball open datasets available on Kaggle using Apache Druid, Flask and Python

Uses CSV Data From https://www.kaggle.com/pschale/mlb-pitch-data-20152018

🧰 This project consists of three components

PostgreSQL - Primary transaction database (stores strike events, games and player tables)
Druid - Columnar Storage Datawarehouse For Analytics; Stores Denormalized Table
Flask - To Serve User Queries

🚀 Getting Started

Requirements

Setting up the App

Run the following in your favorite terminal assuming you use Conda for Python env/package management if not run the equivalent using pip

conda install -c conda-forge pydruid
conda install -c anaconda flask
conda install -c anaconda psycopg2

Ingest the CSV data into PostgreSQL database

1. Start an instance of PostgreSQL on localhost
2. Using a psql client (either on terminal or pgAdmin client) set up the tables using DDL queries given in 
   ./database/ddl/tables/*.sql files - confirm tables were created before proceeding.
3. Please update the full path to where the CSV files are stored on your machine in ./config.py
4. Run python indexer_main.py to load the data into Postgres. Confirm data loaded by querying the tables created in Postgres
5. Create a denormalized table by running the denormalized query in ./database/ddl/tables/denormalized.sql 
   and export the file as CSV from pgAdmin client

Start a Druid Cluster + Ingest data into Druid

1. Run docker-compose.yml - this should give you a fully operational druid cluster
2. Open Druid UI on localhost:8888 in your browser
3. Use Druid console to ingest the denormalized CSV generated from Step 3 above - follow the steps
   on this page - https://druid.apache.org/docs/latest/tutorials/tutorial-batch.html#loading-data-with-a-spec-via-console
   Use the file ./druid/ingestion_specs/denormalized_spec.json file as your spec file. Once this file is submitted, it 
   will trigger an ingestion task which can be monitored in Tasks console on localhost:8888 in the browser
4. Confirm that the csv was ingested by going to the query pane and querying the data. 
5. You are set - data is in Druid.

Start your Flask server to start sending queries from the browser

Run the below three commands from terminal 

export FLASK_APP=flaskr
export FLASK_ENV=development
flask run

Then navigate to flask app running on http://127.0.0.1:5000/

You should now be running a new Flask app locally! 🎉 Try the following end points from the browser which will query the Druid cluster and respond back with JSON results

1. /games_per_venue
2. /strike_type_count
3. /strike_type_count/<strike_event> e.g. /strike_type_count/Home%20Run
4. /player_score_count

TODO

Render the data on the browser through D3.js (https://d3js.org/)
Hook up Apache Superset (https://superset.apache.org/) Dashboard with Druid to visualize data in Superset.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
data		data
database/ddl/tables		database/ddl/tables
druid/ingestion_specs		druid/ingestion_specs
flaskr		flaskr
models		models
.gitignore		.gitignore
README.md		README.md
config.py		config.py
docker-compose.yml		docker-compose.yml
environment		environment
games_indexer.py		games_indexer.py
indexer_main.py		indexer_main.py
player_indexer.py		player_indexer.py
strike_events_indexer.py		strike_events_indexer.py

rkredux/mlbanalytics

Folders and files

Latest commit

History

Repository files navigation

MLB Analytics App

🧰 This project consists of three components

🚀 Getting Started

Requirements

Setting up the App

About

Topics

Resources

Stars

Watchers

Forks

Languages