This app does analytics on Major League Baseball open datasets available on Kaggle using Apache Druid, Flask and Python
Uses CSV Data From https://www.kaggle.com/pschale/mlb-pitch-data-20152018
- PostgreSQL - Primary transaction database (stores strike events, games and player tables)
- Druid - Columnar Storage Datawarehouse For Analytics; Stores Denormalized Table
- Flask - To Serve User Queries
Run the following in your favorite terminal assuming you use Conda for Python env/package management if not run the equivalent using pip
conda install -c conda-forge pydruid
conda install -c anaconda flask
conda install -c anaconda psycopg2
- Ingest the CSV data into PostgreSQL database
1. Start an instance of PostgreSQL on localhost
2. Using a psql client (either on terminal or pgAdmin client) set up the tables using DDL queries given in
./database/ddl/tables/*.sql files - confirm tables were created before proceeding.
3. Please update the full path to where the CSV files are stored on your machine in ./config.py
4. Run python indexer_main.py to load the data into Postgres. Confirm data loaded by querying the tables created in Postgres
5. Create a denormalized table by running the denormalized query in ./database/ddl/tables/denormalized.sql
and export the file as CSV from pgAdmin client
- Start a Druid Cluster + Ingest data into Druid
1. Run docker-compose.yml - this should give you a fully operational druid cluster
2. Open Druid UI on localhost:8888 in your browser
3. Use Druid console to ingest the denormalized CSV generated from Step 3 above - follow the steps
on this page - https://druid.apache.org/docs/latest/tutorials/tutorial-batch.html#loading-data-with-a-spec-via-console
Use the file ./druid/ingestion_specs/denormalized_spec.json file as your spec file. Once this file is submitted, it
will trigger an ingestion task which can be monitored in Tasks console on localhost:8888 in the browser
4. Confirm that the csv was ingested by going to the query pane and querying the data.
5. You are set - data is in Druid.
- Start your Flask server to start sending queries from the browser
Run the below three commands from terminal
export FLASK_APP=flaskr
export FLASK_ENV=development
flask run
Then navigate to flask app running on http://127.0.0.1:5000/
- You should now be running a new Flask app locally! 🎉 Try the following end points from the browser which will query the Druid cluster and respond back with JSON results
1. /games_per_venue
2. /strike_type_count
3. /strike_type_count/<strike_event> e.g. /strike_type_count/Home%20Run
4. /player_score_count
TODO
- Render the data on the browser through D3.js (https://d3js.org/)
- Hook up Apache Superset (https://superset.apache.org/) Dashboard with Druid to visualize data in Superset.