Store CKAN assets (org / group images) to cloud storage
This CKAN extension moves storage of uploaded asset files - specifically, organization and group logos and the site logo, to a cloud storage backend. It is designed to be flexible as it allows for different storage backends, and may support additional storage backends in the future.
NOTE This does not handle resource storage at all. For offloading resource storage, we recommend using ckanext-blob-storage.
- This extension works with CKAN 2.8.x and CKAN 2.9.x. It may work, but has not been tested, with other CKAN versions.
- You need to have access to a supported Cloud Storage account to store assets in
Supported Storage Backends include:
- Google Cloud Storage
- Azure Blob Storage
- AWS S3 (NOT YET)
- Local Storage (mainly for testing and fallback purposes)
To install ckanext-asset-storage:
- Activate your CKAN virtual environment, for example:
. /usr/lib/ckan/default/bin/activate
- Install the ckanext-asset-storage Python package into your virtual environment:
pip install ckanext-asset-storage
-
Add
asset_storage
to theckan.plugins
setting in your CKAN config file (by default the config file is located at/etc/ckan/default/production.ini
). -
Restart CKAN. For example if you've deployed CKAN with Apache on Ubuntu:
sudo service apache2 reload
The following CKAN configuration options should be set:
The storage backend type. The following types are supported out of the box:
local
- Local filesystem storagegoogle_cloud
- Google Cloud Storageazure_blobs
- Azure Blob Storages3
- AWS S3 storage
You can also write custom storage backends, and specify the fully
qualified package.module:Class
name of your storage class here. For
example, specifying ckanext.my_storage.storage:MyStorageClass
will
try to use MyStorageClass
in the ckanext.my_storage.storage
module
as a storage backend.
Options to pass to the storage backend. This should be a Python dict with key-value pairs.
The specific option keys depend on the storage backend_type
in and
are detailed below.
Stores asset in the local file system. This is not much different from CKAN's built-in behavior, and thus is mostly useful for testing purposes.
The following configuration options are available:
storage_path
- (required, string) the local directory to store files in
To use Google Cloud Storage, you must have an existing Google Cloud project and bucket. You need to obtain a Google Account Key file (a JSON file downloadable from the Google Console) for a user or a service account that has "Object Admin" role on the bucket (at the very least they should be able to read, write and delete objcets).
The following configuration options are available:
project_name
- (required, string) Google Cloud project namebucket_name
- (required, string) Google Cloud Storage bucket nameaccount_key_file
- (required, string) Path to the Google Cloud credentials JSON filepath_prefix
- (optional, string) A prefix to prepend to all stored assets in the bucketpublic_read
- (boolean, defaultTrue
) Whether to allow public read access to uploaded assets. Setting toFalse
means the asset can only be accessed after a request to this code to generate a signed URL. This will have some performance impact. You should keep this at the default unless you consider group / organization images sensitive / private information.signed_url_lifetime
- (int, default3600
) When public access is not allowed, this sets the max lifetime in seconds of signed URLs. Typically you should not change this.
To use Azure Blob Storage, you must have an existing Azure account and Blob Storage container.
By default, Azure Blob Storage containers are set to disallow public access to blobs, and this is
not configurable at the blob level - only on the container level. For this reason, this storage
backend provides public URLs if and only if the container is configured to allow public access
to blobs.
If public access is enabled, asset URLs will be direct-to-cloud public URLs. Otherwise, asset
URLs will point to CKAN, which will generate a time-limited SAS signed URL, and redirect the
client to that URL.
The following configuration options are available:
container_name
- Azure Blob Storage container nameconnection_string
- The Azure Blob Storage connection string to usepath_prefix
- A prefix to prepend to all stored assets in the containersigned_url_lifetime
- When public access is not allowed, this sets the max lifetime of signed URLs.
When s3
support is available, we will add some documentation here ;-)
When this extension is enabled on an existing CKAN installation, existing organization and group images will most likely need to be migrated to avoid re-uploading all images. The following steps are suggested as a migration procedure:
- It is highly recommended to not delete any of your old storage files or configuration until it has been ensured that migration has completed successfully. If you plan to make DB changes (see below), back up your database in advance.
- Externally referenced ("URL") assets do not need to be migrated. You only need to migrate images which have been uploaded to CKAN.
- If you have used a cloud storage extension which saved the absolute canonical URLs of assets in the CKAN DB, you
may continue to use these assets without any migration as long as the old storage is not deleted. New assets will be
saved in the
ckanext-asset-storage
storage. - As a general rule, you should be able to seamlessly migrate to
ckanext-asset-storage
by copying files from the old storage path / container while retaining directory structure and file names to the new storage container. The new location should match your configured path container /path_prefix
if any. - If that is not possible, you will need to copy all files to a new location, and then write a script to modify the CKAN database to point to the new location of the files.
- Check your CKAN INI file for the configured local storage directory (
ckan.storage_path
) - Recursively copy all files from
<storage_path>/storage/uploads/*
to your new storage location- Your new storage location depends on your selected storage backend and configuration. For example, when using GCP
your storage location will be
gs://<bucket_name>/<path_prefix>/
- Note that subdirectories under
<storage_path>/storage/uploads/
(e.g.group
) should be retained and copied as-is to the new storage path
- Your new storage location depends on your selected storage backend and configuration. For example, when using GCP
your storage location will be
These instructions are generic, as specifics may differ between different CKAN cloud storage extensions.
NOTE: If you plan to use the same bucket / container you have used in the past with ckanex-asset-storage
, you may
not need to migrate anything as long as you configure ckanext-asset-storage
to point to the same location.
- Copy all files from your current storage location to the new location; This is similar in concept to the process described above for vanilla CKAN.
- If your old storage extension stores the full absolute URL of images in the DB, and the URL does not point to
the
/uploads/...
path under your CKAN public URL, you will need to run a script that modifies the URLs stored in the DB to match the new URL pattern for assets- You can check if this is the case by running
SELECT id, image_url FROM "group"
on your DB, and looking at the results, but take special care of ignoring URLs that point to externally stored images, as these do not need to be migrated. - In most cases, stripping
image_url
to just thesubdirectory/filename.ext
form (removing any scheme, host, port and path prefix from the URL) will do the trick.
- You can check if this is the case by running
- If your old storage extension stores a relative path to the image of the form
subdirectory/filename.ext
, you do not need to modify your database. Things should "just work".
In high-traffic or mission critical CKAN sites, you may be in risk that new assets are uploaded to storage while migration is in progress. If this is a risk you cannot afford, read on:
- In many cases, group and organization images are not mission critical and you can afford to have some of your users
visiting your CKAN instance while images are not displayed. If this is the case, it is recommended to switch your CKAN
installation to
ckanext-asset-storage
before migrating the data. This will ensure new assets uploaded while the migration is in progress are saved to the new storage. Once migration is complete, group and organization images will re-appear and everything will be back to normal. - If you want to make sure images are always displayed even during migration, you have a couple of options:
- Lock your CKAN instance for changes to organizations and groups until migration is complete (TODO: how?)
- or, aim for eventual consistency by running migration, switching to
ckanext-asset-storage
and then running migration again to ensure nothing has been left behind.
A: If you use a CKAN extension that stores resources in cloud storage (such
as ckanext-blob-storage
),
and you already have a cloud container configured for storing assets, it should
be very easy to reuse the same container to store assets if that is desired.
Simply configure your storage backend to store assets under a path prefix (e.g.
see the path_prefix
config option for most cloud backends), and use a prefix
that will never be used by your resource storage extension to store resources.
For example, setting path_prefix
to _assets
will do the trick in most cases.
To install ckanext-asset-storage
for development, do the following:
- Pull the project code from Github
git clone https://github.com/datopian/ckanext-asset-storage.git
cd ckanext-asset-storage
- Create a Python 2.7 virtual environment
virtualenv .venv27
source .venv27/bin/activate
Or a Python 3.x virtual environment:
python3 -m venv .venv3
source .venv3/bin/activate
(You can use pyenv
to manage multiple versions of Python on your system)
- Run the following command to bootstrap the entire environment
make dev-start
This will pull and install CKAN and all it's dependencies into your virtual environment, create all necessary configuration files, launch external services using Docker Compose and start the CKAN development server.
You can repeat the last command at any time to start developing again.
Type make help
to get a like of user commands useful to managing the local
environment.
- You do not touch
*requirements.*.txt
files directly. We usepip-tools
and custommake
targets to manage these files. - Use
make develop
to install the right development time requirements into your current virtual environment - Use
make install
to install the right runtime requirements into your current virtual environment - To add requirements, edit
requirements.in
ordev-requirements.in
and runmake requirements
. This will recompile the requirements file(s) for your current Python version. You may need to do this for the other Python version by switching to a different Python virtual environment before committing your changes.
This project manages requirements in a relatively complex way, in order to seamlessly support Python 2.7 and 3.x.
For this reason, you will see 4 requirements files in the project root:
requirements.py2.txt
- Python 2 runtime requirementsrequirements.py3.txt
- Python 3 runtime requirementsdev-requirements.py2.txt
- Python 2 development requirementsdev-requirements.py3.txt
- Python 3 development requirements
These are generated using the pip-compile
command (a part of pip-tools
)
from the corresponding requirements.in
and dev-requirements.in
files.
To understand why pip-compile
is used, read the pip-tools
manual. In
short, this allows us to pin dependencies of dependencies, thus resolving
potential deployment conflicts, without the headache of managing the specific
version of each Nth-level dependency.
In order to support both Python 2.7 and 3.x, which tend to require slightly
different dependencies, we use requirements.in
files to generate
major-version specific requirements files. These, in turn, should be used
when installing the package.
In order to simplify things, the make
targets specified above will automate
the process for the current Python version.
Requirements are managed in .in
files - these are the only files that
should be edited directly.
Take care to specify a version for each requirement, to the level required to maintain future compatibility, but not to specify an exact version unless necessary.
For example, the following are good requirements.in
lines:
pyjwt[crypto]==1.7.*
pyyaml==5.*
This allows these packages to be upgraded to a minor version, without the risk of breaking compatibility.
Developers wanting to add new requirements (runtime or development time),
should take special care to update the requirements.txt
files for all
supported Python versions by running make requirements
on different
virtual environment, after updating the relevant .in
file.
You can delete *requirements.*.txt
and run make requirements
.
TODO: we can probably do this in a better way - create a make
target
for this.
To run the tests, do:
make test
To run the tests and produce a coverage report, first make sure you have
coverage installed in your virtualenv (pip install coverage
) then run:
make coverage
ckanext-asset-storage should be available on PyPI as https://pypi.org/project/ckanext-asset-storage. To publish a new version to PyPI follow these steps:
-
Update the version number in
ckanext/asset_storage/__init__.py
file. See PEP 440 for how to choose version numbers. -
Make sure you have the latest version of necessary packages:
pip install --upgrade setuptools wheel twine
- Create a source and binary distributions of the new version:
python setup.py sdist bdist_wheel && twine check dist/*
Fix any errors you get.
- Upload the source distribution to PyPI:
twine upload dist/*
- Commit any outstanding changes:
git commit -a
- Tag the new release of the project on GitHub with the version number from
the
setup.py
file. For example if the version number insetup.py
is 0.0.1 then do:
git tag 0.0.1
git push --tags