ETL based on Django model introspection
This reusable Django app provides classes to include light weight ETL into your project.
The package is not geared toward speed but toward sync'ing with upstream data sources (e.g. for an API).
However, subclassing allows for replacing methods with speedier, simplified or more sophisticated versions.
The package supports data persistence, consistency, normalization, and recreating relationships from flat files.
The app derives most ETL rules from model introspection. This rules can be easily modified and overriden.
The project can be easily used within a parallelization framework such as Celery, thorough checks of the state of the destination avoid race conditions and inconsistencies (at the cost of speed.)
The package is in active development toward a release. For evaluation, contribution, and testing.
pip install -e git+ssh://git@github.com/postfalk/django-etl-sync#egg=django-etl-sync
Add etl_sync
to INSTALLED_APPS
in settings.py.
The app provides two principal ways of usage on either file or record level.
-
Use the
Mapper
class to specify all ETL operations. If you need to make changes to the data between reading from the file and writing them to the database create a costumTransformer
class (see below). -
Use the
Generator
. This class maps dictionary keys into a Django model and returns an instance. In this case custom transformations need to be performed before using the Generator class. This can be done in any possible way. Of course you can use theTransformer
class for this task.
# data.txt
record name
1 one
2 two
3 three
# models.py
from django.db import models
class TestModel(models.Model)
"""
Example Model.
"""
record = models.CharField(max_length=10)
name = models.CharField(max_length=10, null=True, blank=True)
# <yourscript>.py
from etl_sync.mappers import Mapper
from <yourproject>.models import TestModel
class YourMapper(Mapper)
"""
Add your specific settings here.
"""
filename = 'data.txt'
model_class = TestModel
if __name__ == '__main__':
mapper = YourMapper()
res = mapper.load()
# <yourscript>.py
from etl_sync.generators import BaseInstanceGenerator
from <yourproject>.models import TestModel
dic = {'record': 3, 'name': 'three'}
if __name__ == '__main__':
# add additional transformations here
generator = BaseInstanceGenerator(TestModel, dic)
instance = generator.get_instance()
print(instance, generator.res)
Unique fields
Before loading a record it might be necessary to check whether it already exists, whether it needs to be added or updated (persistence). By default the module inspects the target model and uses model fields with the attribute unique=True as criterion for persistence. The module will check first whether any record with the given combination of values in unique fields already exists and update that record.
WARNING: Do not use the models internal pk or ide field as identifier for your data! Add an extra record or remote_id field.*
Extra arguments
Another method to add (or overwrite) persistence criterions is to add a a list of fields via key word argument.
generator = InstanceGenerator(TestModel, dic, persistence = ['record', 'source'])
Subclassing
You can also subclass InstanceGenerator to create your own generator class.
from etl_sync.generators import InstanceGenerator
class MyGenerator(InstanceGenerator):
"""
My generator class with costum persistence criterion.
"""
persistence = ['record', 'source']
etl_persistence key in data dictionary
The last method is to put an extra key value pair in your data dictionary.
dic = {'record': 6365, 'name': 'john', 'occupation': 'developer', 'etl_persistence': ['record']}
This technique is useful for nested records if the recursive call of InstanceGenerator cannot be directly accessed (see below). However ...
Defining persistence by a field attributes and a concise data model is the preferred method.
Once the variable persistence is overwritten the model field attributes will be ignored. Nevertheless, conflicts with your data definition will through database errors.
By default django-etl-sync uses the csv.DictReader, other reader classes can be used or created if they are similar to csv.DictReader.
The package currently contains a reader for ESRI Shapefiles.
from etl_sync.generators import InstanceGenerator
from etl_sync.readers import ShapefileReader
class MyMapper(Mapper):
reader_class=Shapefilereader
Transformations remap the dictionary from the CSV reader or another reader class to the Django model. We attempt to map the dictionary key to the model field with the matching name. The transformer classes allow for remapping and validation of incoming records.
Instantiate InstanceGenerator with a costumized Transformer class:
from etl_sync.mappers import Mapper
from etl_sync.transformes import Transformer
class MyTransformer(Transformer):
mappings = {'id': 'record', 'name': 'last_name'}
defaults = {'last_name': 'Doe'}
forms = []
blacklist = {'last_name': ['NA', r'unknown']}
class MyMapper(InstanceGenerator):
model_class = {destination model}
transformer_class = MyTransformer
filename = myfile.txt
mapper = MyMapper()
mapper.load()
- The
mapping
property contains a dictionary in the form{‘original_fieldname’: ‘new_fieldname’}
which will remap the dictionary. - The
defaults
property holds a dictionary that gets applied if the value for the dictionary key in question is empty. - The
forms
property holds a list of Django forms that get applied to the dictionary. WARNING: old values will not be removed. The cleaned_data keys will be added to the dictionary. - And finally the
blacklist
property holds a list of values for a particular key that will trigger a validation error. The record will be discarded.
WARNING: These methods will be applied in exactly that order. If the dictionary changes in one of these steps, the next step needs to take these changes into consideration.
In addition to these built-in transformations, there are two additional methods that can be modified for more thorough changes:
class MyTransformer(Transformer):
def transform(self, dic):
"""Make whatever changes needed here."""
return dic
def validate(self, dic):
"""Raise ValidationErrors"""
if last_name == 'Bunny':
raise ValidationError('I do not want to have this record')
Both methods will be applied after the forementioned built-in methods.
Django-etl-sync fully support Django forms. You can reuse the Django forms from your project to bulk load data. See section “Transformations”.
Django-etl-sync will create a log file in the same location as the source file. It will contain the list of rejected records.
source_file.txt
source_file.txt.2014-07-23.log
- Create readers for more source types, especially for comma limited data, and headerless CSV.
- Add a way for data removal, if deleted from source.