This prototype takes the information of several APIs, transforms it and stores it in a data warehouse. The gathered data then can be requested over an API.
To setup and run the project we first need to get Docker and Docker compose.
Executing the docker-compose file will setup the postgres database in one container (incl. creating the necessary tables) and the python environment in another container and execute the application.
Clone the project,
git clone https://github.com/yemboo2/gordias.git
change to the gordias folder and run docker-compose.
cd goridas && sudo docker-compose up
Verfiy setup:
curl --data "contacts={\"contacts\" : [{\"first_name\" : \"Markus\",\"last_name\" : \"Ehringer\",\"organization\" : \"Technical University of Munich\"}]}" http://localhost:8080/contacts --data "age=1"
The API has two endpoints:
- contact: Enriches a single contact (POST-request). Parameters are first name, last name and organization.
- contacts: Enriches multiple contacts (POST-request). Parameter is a string in JSON format containing a list of basic contact fields: for each contact which should be enriched we need first name, last name and organization.
First name and last name and organization must be provided in every case. Both endpoints return a JSON object with the found contact fields.
An additional parameter age can be added to each request. Providing this information the user can define how old the enriched data can be. If the last synchronization time of a contact lies further back than the time of the request subtracted by the age the data will be synchronized with the sources. The value of this parameter is specified in hours.
Note1: Make sure to use HTML encoding for the data sent to the API.
Note2: It could take a while after the start until the first request will be processed.
- Create a new folder in the sources directory.
- Create a new python file in that folder.
- Create a class which inherits from the class defined in the source_class.py file and implement the get_data() function.
- Add a new entry to the sources_config.json file.
- name: Simply the name of the source (only letters).
- path: Path to the file that contains the class class_name.
- class_name: Name of the class that inherits from the abstract source class.
- Adding a mapping configuration file mapping.json.
- If needed add an additional file map_functions.py with source-specific map functions.
Importing the abstract class from the folder above is a bit tricky. If you have troubles check out an existing source (e.g. Twitter(/tw)).If you plan to overwrite the constructor of the abstract class in the subclass make sure to pass first_name, last_name and organization and set these as class variables (see source_class.py).
The class_name file has to contain the logic to fetch data from the actual source. When retrieving multiple user from APIs matching the right one based on just the name and organization might be difficult. The utils.py contains some functions to support the matching process. In doubt if you match the correct person there are two options: Taking the contact and risk having maybe wrong data in the system or taking no contact and risk maybe losing correct data. So far we stuck to the latter option. All added sources have to stick to that decision.
After locally adding a new source to the project and testing it extensively one can create a pull request for this source to be added. If it meets all requirements it will be merged.
- Markus Ehringer - Initial work - yemboo2
Special thanks to my supervisor Patrick Holl for his help and inspiring ideas.