This repository contains scripts to acquire, clean and process the spending information released by the UK central government.
The scripts have several stages that need to be run in order:
build_index
- will find all related metadata (tagged: spend-transactions) on data.gov.ukretrieve
will then try to fetch all the filesextract
will attempt to parse CSV/XLS/... and load it into a DBcombine
column names are mapped and values are stored in one central tablecleanup
validate
report
creates the report HTML
First clone this repo:
git clone https://github.com/okfn/dpkg-uk25k.git
You need to install the dependencies (best in a python virtual environment):
virtualenv pyenv-dpkg-uk25k
pyenv-dpkg-uk25k/bin/pip install -r requirements.txt
The default configuration is in default.ini
. If you want to change the configuration, copy it config.ini
and edit it there.
Before you can run the scripts you need to prepare a database:
sudo -u postgres createdb uk25k
Now create a postgres user for your unix user name:
sudo -u postgres createuser -D -R -S $USER
And allow access to the database by editing /etc/postgresql/9.1/main/pg_hba.conf and adding this line:
local uk25k all trust
Now restart postgres:
sudo service postgresql restart
Run the scripts like this:
. pyenv-dpkg-uk25k/bin/activate
cd dpkg-uk25k
python build_index.py
python retrieve.py
python extract.py
python combine.py
python cleanup.py
python validate.py
python report.py reports
Or do the whole lot together:
python build_index.py && python retrieve.py && python extract.py && python combine.py && python cleanup.py && python validate.py && python report.py reports
Before running the scripts again, be sure to clear out old data from the issues table or from all tables like this:
sudo -u postgres dropdb uk25k
sudo -u postgres createdb uk25k
To limit the analysis to one publisher, specify the name as a parameter to build_index:
python build_index.py wales-office
?
- PDFs
- Zip files containing a bunch of CSVs (potentially for a number of publishers)