This is a complete overhaul of alberlyu's nbadb project. At this point, most of the original code has been rewritten, and the original project serves more as an idea than something to build on.
The code is being rewritten to be:
...By fetching data by date, and then by game_id, instead of by resource type
...By using a pool of worker threads and by aggregating many INSERTS into bulk insert statements. For loading small chunks of data, the new version of the code is roughly 10x faster. Performance testing is still being done.
The code was barely functional when I started. My goal is to be able to scrape every morsel of information about the NBA, with or without schemas or structure.
Currently there are a few flavors of the staging
script.
All take the same arguments in these three ways
python staging.py
will load data for yesterday's gamespython staging.py 2014-12-25
will load data for only December 25, 2014python staging.py 2014-12-25 2015-12-1
will load data from December 25, 2014 to December 1, 2015
The flavors differ in their efficiency
staging.py
is most similar to the original code and is hanging around to check that the newer version work properly. This is the slowest.staging_multi.py
uses a pool of 10 workers and one thread per game to speed up downloadsstaging_multi2.py
builds onstaging_multi.py
by eliminating duplicate queries and aggregating inserts into single insert statements.staging_multi3.py
uses a pool of 20 workers and 1 thread per date, in addition to the other enhancements. This is the fastest by far. This loads data from just before the '09 season to Dec. 1 2015 in about an hour.
The original installation instructions found below are pretty good, so please follow those after reading my notes here.
I'm using Windows 8, and I've updated the requirement version for psycopg2
to 2.6.1
. On my machine, I can pip install -r requirements.txt
without any errors. Hopefully you can to.
I would add that I am using the PowerShell wrapper for virtualenv with good success, and highly recommend this tutorial so you can use it too.
A Python project to extract, transform, and load NBA data into a PostgreSQL database.
This project was built with Python 2.7.5, PostgreSQL 9.4.0. That does not mean it won't work in Python 3 or PostgreSQL 9.4, as I haven't tested that yet. It should work on both Windows and Unix operating systems. I think it's best that you create your nbadb within a virtual environment for easy replication.
In your nbadb folder, start a virtualenv
instance (see the virtualenv docs for more information) and install the required modules:
$ virtualenv ENV
$ source ENV/bin/activate # For Unix machines
$ \path\to\ENV\Scripts\activate # For Windows machines
$ pip install -r requirements.txt
If you are on a Windows machine and are unable to install psycopg2
with the message 'error: Unable to find vcvarsall.bat,' you will need to install psycopg2
directly as this is a known issue with installing psycopg2
on Windows. To do so, run the following:
$ easy_install http://stickpeople.com/projects/python/win-psycopg/2.5.3/psycopg2-2.5.3.win32-py2.6-pg9.3.4-release.exe
For more details, see the following link.
Update your config.ini file with your PostgreSQL credentials. You will need to create a new database called 'nbadb,' which you can do so by using PostgreSQL's createdb
wrapper statement in command line or executing a CREATE DATABASE
statement in a psql interpreter.
nbadb starts by loading raw data from the source as-is and dumping them into staging tables. The source includes data from scoreboards, box scores, play-by-play logs, and shot chart detail. There are also data on players, including player profile information, player shot logs, and player rebound logs.
To load data from the entire 2013-14 season into staging tables:
$ python load_staging.py 2013-10-29 2014-04-16 # start to end of 2013-14 regular season
$ python update_players.py 2013-14 # loads data for 2013-14 season only
To load data from the beginning of the 2014-15 season until yesterday's games:
$ python load_staging.py # if no argument, start date set at '2014-10-28'
$ python update_players.py # if no argument, loads data for 2014-15 season only
To drop all staging tables (including the player tables), simply run the drop_staging.py script:
$ python drop_staging.py
TBD
- Data courtesy of NBA.com