The HDX Python Library is designed to enable you to easily develop code that interacts with the Humanitarian Data Exchange platform. The major goal of the library is to make pushing and pulling data from HDX as simple as possible for the end user.
For more about the purpose and design philosophy, please visit HDX Python Library.
The API documentation can be found here: http://ocha-dap.github.io/hdx-python-api/. The code for the library is here: https://github.com/ocha-dap/hdx-python-api.
Please note that the library only works on Python 3.
The first task is to create an API key file. By default this is assumed to be called .hdxkey
and is located in the current user's home directory ~
. Assuming you are using a desktop browser, the API key is obtained by:
- Browse to the HDX website
- Left click on LOG IN in the top right of the web page if not logged in and log in
- Left click on your username in the top right of the web page and select PROFILE from the drop down menu
- Scroll down to the bottom of the profile page
- Copy the API key which will be of the form xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
- Paste the API key into a text file
- Save the text file with filename
.hdxkey
in the current user's home directory
To include the HDX Python library in your project, pip install the line below or add the following to your requirements.txt
file:
git+git://github.com/ocha-dap/hdx-python-api.git#egg=hdx-python-api
If you get errors, it is probably the dependencies of the cryptography package that are missing eg. for Ubuntu: python-dev, libffi-dev and libssl-dev. See cryptography dependencies
Let's start with a simple example that also ensures that the library is working properly. This assumes you are using Linux, but you can do something similar on Windows:
-
Create the API key if you haven't already. Look it up on the HDX website as mentioned above, then put it into a file in your home directory:
cd ~ echo xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx > .hdxkey
-
Install virtualenv if not installed:
sudo apt-get install virtualenv
-
Create a Python 3 virtualenv and activate it:
virtualenv -p python3 test source test/bin/activate
-
Install the HDX Python library:
pip install git+git://github.com/ocha-dap/hdx-python-api.git#egg=hdx-python-api
-
If you get errors, it is probably the dependencies of the cryptography package
-
Launch python:
python
-
Import required classes:
from hdx.configuration import Configuration from hdx.data.dataset import Dataset
-
Use configuration defaults and test HDX site:
configuration = Configuration(hdx_site='test', project_config_dict = {})
-
Read this dataset ACLED Conflict Data for Africa (Realtime - 2016) from HDX and view the date of the dataset:
dataset = Dataset.read_from_hdx(configuration, 'acled-conflict-data-for-africa-realtime-2016') print(dataset['dataset_date'])
-
As a test, change the dataset date:
dataset['dataset_date'] = '07/26/2015' print(dataset['dataset_date']) dataset.update_in_hdx()
-
You can view it on HDX before changing it back:
dataset['dataset_date'] = '06/25/2016' dataset.update_in_hdx()
-
Exit and remove virtualenv:
exit() deactivate rm -rf test
The easiest way to get started is to use the facades and configuration defaults. The facades set up both logging and HDX configuration.
The default configuration loads an internal HDX configuration located within the library, and assumes that there is an API key file called .hdxkey
in the current user's home directory ~
and a YAML project configuration located relative to your working directory at config/project_configuration.yml
which you must create. The project configuration is used for any configuration specific to your project.
The default logging configuration reads a configuration file internal to the library that sets up an coloured console handler outputting at DEBUG level and a file handler writing to errors.log at ERROR level.
You will most likely just need the simple facade. If you are in the HDX team, you may need to use the ScraperWiki facade which reports status to that platform (in which case replace simple
with scraperwiki
in the code below):
from hdx.facades.simple import facade
def main(configuration):
***YOUR CODE HERE***
if __name__ == '__main__':
facade(main)
The configuration is passed to your main function in the configuration
argument above.
It is possible to pass configuration parameters in the facade call eg.
facade(main, hdx_site = HDX_SITE_TO_USE, hdx_key_file = LOCATION_OF_HDX_KEY_FILE, hdx_config_yaml=PATH_TO_HDX_YAML_CONFIGURATION,
project_config_dict = {'MY_PARAMETER', 'MY_VALUE'})
If you did not need a project configuration, you could simply provide an empty dictionary eg.
facade(main, project_config_dict = {})
If you do not use the facade, you can use the Configuration
class directly, passing in appropriate keyword arguments ie.
from hdx.configuration import Configuration
...
cfg = Configuration(KEYWORD ARGUMENTS)
KEYWORD ARGUMENTS
can be:
Choose | Argument | Type | Value | Default |
---|---|---|---|---|
hdx_site | Optional[bool] | HDX site to use eg. prod, test | test | |
hdx_key_file | Optional[str] | Path to HDX key file ~/.hdxkey | ||
One of: | hdx_config_dict | dict | HDX configuration dictionary | |
hdx_config_json | str | Path to JSON HDX configuration | ||
hdx_config_yaml | str | Path to YAML HDX configuration | Library's internal hdx_configuration.yml | |
One of: | project_config_dict | dict | Project configuration dictionary | |
project_config_json | str | Path to JSON Project configuration | ||
project_config_yaml | str | Path to YAML Project configuration | config/project_configuration.yml |
If you wish to change the logging configuration from the defaults, you will need to call setup_logging
with arguments unless you have used the simple or ScraperWiki facades, in which case you must update the hdx.facades
module variable logging_kwargs
before importing the facade.
If not using facade:
from hdx.logging import setup_logging
...
logger = logging.getLogger(__name__)
setup_logging(KEYWORD ARGUMENTS)
If using facade:
from hdx.facades import logging_kwargs
logging_kwargs.update(DICTIONARY OF KEYWORD ARGUMENTS)
from hdx.facades.simple import facade
KEYWORD ARGUMENTS
can be:
Choose | Argument | Type | Value | Default |
---|---|---|---|---|
One of: | logging_config_dict | dict | Logging configuration dictionary | |
logging_config_json | str | Path to JSON Logging configuration | ||
logging_config_yaml | str | Path to YAML Logging configuration | Library's internal logging_configuration.yml | |
One of: | smtp_config_dict | dict | Email Logging configuration dictionary | |
(if using | smtp_config_json | str | Path to JSON Email Logging configuration | |
defaults) | smtp_config_yaml | str | Path to YAML Email Logging configuration |
Do not supply smtp_config_dict
, smtp_config_json
or smtp_config_yaml
unless you are using the default logging configuration!
If you are using the default logging configuration, you have the option to have a default SMTP handler that sends an email in the event of a CRITICAL error by supplying either smtp_config_dict
, smtp_config_json
or smtp_config_yaml
. Here is a template of a YAML file that can be passed as the smtp_config_yaml
parameter:
handlers:
error_mail_handler:
toaddrs: EMAIL_ADDRESSES
subject: "RUN FAILED: MY_PROJECT_NAME"
Unless you override it, the mail server mailhost
for the default SMTP handler is localhost
and the from address fromaddr
is noreply@localhost
.
To use logging in your files, simply add the line below to the top of each Python file:
logger = logging.getLogger(__name__)
Then use the logger like this:
logger.debug('DEBUG message')
logger.info('INFORMATION message')
logger.warning('WARNING message')
logger.error('ERROR message')
logger.critical('CRITICAL error message')
You can read an existing HDX object with the static read_from_hdx
method which takes a configuration and an identifier parameter and returns the an object of the appropriate HDX object type eg. Dataset
or None
depending upon whether the object was read eg.
dataset = Dataset.read_from_hdx(configuration, 'DATASET_ID_OR_NAME')
You can create an HDX Object, such as a dataset, resource or gallery item by calling the constructor with a configuration, which is required, and an optional dictionary containing metadata. For example:
from hdx.data.dataset import Dataset
dataset = Dataset(configuration, {
'name': slugified_name,
'title': title,
'dataset_date': dataset_date, # has to be MM/DD/YYYY
'groups': iso
})
The dataset name should not contain special characters and hence if there is any chance of that, then it needs to be slugified. Slugifying is way of making a string valid within a URL (eg. ae
replaces ä
). There are various packages that can do this eg. awesome-slugify.
You can add metadata using the standard Python dictionary square brackets eg.
dataset['name'] = 'My Dataset'
You can also do so by the standard dictionary update
method, which takes a dictionary eg.
dataset.update({'name': 'My Dataset'})
Larger amounts of static metadata are best added from files. YAML is very human readable and recommended, while JSON is also accepted eg.
dataset.update_yaml([path])
dataset.update_json([path])
The default path if unspecified is config/hdx_TYPE_static.yml
for YAML and config/hdx_TYPE_static.json
for JSON where TYPE is an HDX object's type like dataset or resource eg. config/hdx_galleryitem_static.json
. The YAML file takes the following form:
owner_org: "acled"
maintainer: "acled"
...
tags:
- name: "conflict"
- name: "political violence"
gallery:
- title: "Dynamic Map: Political Conflict in Africa"
type: "visualization"
description: "The dynamic maps below have been drawn from ACLED Version 6."
...
Notice how you can define a gallery with one or more gallery items (each starting with a dash '-') within the file as shown above. You can do the same for resources.
You can check if all the fields required by HDX are populated by calling check_required_fields
with an optional list of fields to ignore. This will throw an exception if any fields are missing. Before the library posts data to HDX, it will call this method automatically. An example usage:
resource.check_required_fields(['package_id'])
Once the HDX object is ready ie. it has all the required metadata, you simply call create_in_hdx
eg.
dataset.create_in_hdx()
You can delete HDX objects using delete_from_hdx
and update an object that already exists in HDX with the method update_in_hdx
. These do not take any parameters or return anything and throw exceptions for failures like the object to delete or update not existing.
A dataset can have resources and a gallery.
If you wish to add resources or a gallery, you can supply a list and call the appropriate add_update_*
function, for example:
resources = [{
'name': xlsx_resourcename,
'format': 'xlsx',
'url': xlsx_url
}, {
'name': csv_resourcename,
'format': 'zipped csv',
'url': csv_url
}]
for resource in resources:
resource['description'] = resource['url'].rsplit('/', 1)[-1]
dataset.add_update_resources(resources)
Calling add_update_resources
creates a list of HDX Resource objects in dataset and operations can be performed on those objects.
To see the list of resources or gallery items, you use the appropriate get_*
function eg.
resources = dataset.get_resources()
If you wish to add one resource or gallery item, you can supply a dictionary or object of the correct type and call the appropriate add_update_*
function, for example:
dataset.add_update_resource(resource)
You can delete a Resource or GalleryItem object from the dataset using the appropriate delete_*
function, for example:
dataset.delete_galleryitem('GALLERYITEM_TITLE')
Here we will create a working example from scratch.
First, pip install the library or alternatively add it to a requirements.txt file if you are comfortable with doing so as described above.
Next create a file called run.py
and copy into it the code below.
#!/usr/bin/python
# -*- coding: utf-8 -*-
'''
Calls a function that generates a dataset and creates it in HDX.
'''
import logging
from hdx.facades.scraperwiki import facade
from my_code import generate_dataset
logger = logging.getLogger(__name__)
def main(configuration: dict):
'''Generate dataset and create it in HDX'''
dataset = generate_dataset(configuration)
dataset.create_in_hdx()
if __name__ == '__main__':
facade(main, hdx_site='test')
The above file will create in HDX a dataset generated by a function called generate_dataset
that can be found in the file my_code.py
which we will now write.
Create a file my_code.py
and copy into it the code below:
#!/usr/bin/python
# -*- coding: utf-8 -*-
'''
Generate a dataset
'''
import logging
from hdx.data.dataset import Dataset
logger = logging.getLogger(__name__)
def generate_dataset(configuration):
'''Create a dataset
'''
logger.debug('Generating dataset!')
You can then fill out the function generate_dataset
as required.
A complete example can be found here: https://github.com/mcarans/hdxscraper-acled-africa
In particular, take a look at the files run.py
, acled_africa.py
and the config
folder.