Skip to content

cuecue123/ckanext-shift

 
 

Repository files navigation

image

image

Downloads

Latest Version

Supported Python versions

Development Status

License

ckanext-shift

Loads CSV (and similar) data into DataStore. Designed as a replacement for DataPusher.

Key differences from DataPusher

Simpler queueing tech

DataPusher - job queue is done by ckan-service-provider which is bespoke, complicated and stores jobs in its own database (sqlite by default).

ckanext-shift - job queue is done by RQ, which is simpler and is backed by Redis and allows access to the CKAN model. You can also debug jobs easily using pdb. Job results are currently still stored in its own database, but the intention is to move this relatively small amount of data into CKAN's database, to reduce the complication of install.

(The other obvious candidate is Celery, but we don't need its heavyweight architecture and its jobs are not debuggable with pdb.)

Speed of loading

DataPusher - parses CSV rows, converts to detected column types, converts the data to a JSON string, calls datastore_create for each batch of rows, which reformats the data into an INSERT statement string, which is passed to PostgreSQL.

ckanext-shift - pipes the CSV file directly into PostgreSQL using COPY.

We tested the load of a very large CSV (1000000 rows, 475MB): * DataPusher: 35 minutes * ckanext-shift: 1 minute

Robustness

DataPusher - one cause of failure was when casting cells. The type of a column was detected on the values of the first few rows, so if a column is mainly numeric or dates, but a string (like "Null") comes later on, then this will cause the load to finish at that point.

ckanext-shift - loads all the cells as text, before allowing the admin to convert columns to the types they want (Data Dictionary feature). In future it could do automatic detection and conversion.

Separate web server

DataPusher - has the complication that the queue jobs are done by a separate (Flask) web app, aside from CKAN. This was the design because the job requires intensive processing to convert every line of the data into JSON. However it means more complicated code as info needs to be passed between the services in http requests, more for the user to set-up and manage - another app config, another apache config, separate log files.

ckanext-shift - the job runs in a worker process, in the same app as CKAN, so can access the CKAN config, db and logging directly and avoids many HTTP calls. This simplification makes sense because the shift job doesn't need to do much processing - mainly it is streaming the CSV file from disk into PostgreSQL.

Not yet complete

  • Only supports CSVs, not XLS etc
  • No support for private datasets
  • Once loaded in Datastore, search is not yet working.

Requirements

Works with CKAN 2.7.0 and later.

Works with CKAN 2.3.x - 2.6.x if you install ckanext-rq.

Installation

To install ckanext-shift:

  1. Activate your CKAN virtual environment, for example:

    . /usr/lib/ckan/default/bin/activate
  2. Install the ckanext-shift Python package into your virtual environment:
  1. Install dependencies:

    pip install -r requirements.txt
    pip install -U requests[security]
  2. If you are using CKAN version before 2.8 you need to define the populate_full_text_trigger in your database:

    sudo -u postgres psql datastore_default -f full_text_function.sql

    If successful it will print:

    CREATE FUNCTION
    ALTER FUNCTION

    NB this assumes you used the defaults for the database name and username. If in doubt, check your config's ckan.datastore.write_url. If you don't have database name datastore_default and username ckan_default then adjust the psql option and full_text_function.sql before running this.

  3. Add shift to the ckan.plugins setting in your CKAN config file (by default the config file is located at /etc/ckan/default/production.ini).

    You should also remove datapusher if it is in the list, to avoid them both trying to load resources into the DataStore.

  4. If it is a production server, you'll want to store jobs info in a more robust database than the default sqlite file:

    sudo -u postgres createdb -O ckan_default shift_jobs -E utf-8

    And add this list to the config:

    ckanext.shift.jobs_db.uri postgresql://ckan_default:pass@localhost/shift_jobs

    (This step can be skipped when just developing or testing.)

  5. Restart CKAN. For example if you've deployed CKAN with Apache on Ubuntu:

    sudo service apache2 reload
  6. Run the worker. First test it on the command-line:

    paster --plugin=ckan jobs -c /etc/ckan/default/ckan.ini worker

    or if you have CKAN version 2.6.x or less and are using ckanext-rq:

    paster --plugin=ckanext-rq jobs -c /etc/ckan/default/ckan.ini worker

    Test it will load a CSV ok by submitting a CSV in the web interface or in another shell:

    paster --plugin=ckanext-shift shift submit <dataset-name> -c /etc/ckan/default/ckan.ini

    Clearly, running the worker on the command-line is only for testing - for production services see:

    http://docs.ckan.org/en/ckan-2.7.0/maintaining/background-tasks.html#using-supervisor

    If you have CKAN version 2.6.x or less then you'll need to download supervisor-ckan-worker.conf and adjust the command to reference ckanext-rq.

Config Settings

Configuration:

# The connection string for the jobs database used by ckanext-shift. The
# default of an sqlite file is fine for development. For production use a
# Postgresql database.
ckanext.shift.jobs_db.uri = sqlite:////tmp/shift_jobs.db

# The formats that are accepted. If the value of the resource.format is
# anything else then it won't be 'shifted' to DataStore (and will therefore
# only be available to users in the form of the original download/link).
# Case insensitive.
# (optional, defaults are listed in plugin.py - FORMATS).
ckanext.shift.formats = csv application/csv
# In future we hope to add: xls application/vnd.ms-excel

# The maximum size of files to load into DataStore. In bytes. Default is 1MB
# (i.e. 10485760 bytes)
ckanext.shift.max_content_length = 100000000

Development Installation

To install ckanext-shift for development, activate your CKAN virtualenv and do:

git clone https://github.com/davidread/ckanext-shift.git
cd ckanext-shift
python setup.py develop
pip install -r dev-requirements.txt

Upgrading from DataPusher

To upgrade from DataPusher to ckanext-shift:

  1. In your config, on the ckan.plugins line replace datapusher with shift.

TBC

Running the Tests

To run the tests, do:

nosetests --nologcapture --with-pylons=test.ini

If you get error function populate_full_text_trigger() does not exist then you need a CKAN with ckan/ckan#3786. (Even if you create the function on the test database, it gets cleared by: https://github.com/ckan/ckan/pull/3786/files#diff-33d20faeb53559a9b8940bcb418cb5b4R75 )

About

Loads CSV (and similar) data into DataStore. Designed as a replacement for DataPusher.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 92.5%
  • HTML 5.5%
  • Shell 1.5%
  • PLpgSQL 0.5%