Skip to content

arowla/ersatzpg

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ERSATZ
======

A tool for loading CSV's into Postgresql databases quickly.

Postgresql's copy command is just fine for bulk-loading CSVs into a table, if those CSVs are already formatted to exactly match the table columns and data types. It even lets you load a subset of the table's columns from the CSV. Unfortunately, it doesn't let you:

* Load a subset of the CSV's columns
* Transform columns from the CSV before insertion
* Combine columns
* Insert directly into partition tables
* Define default column values
* Load multiple tables from different subsets of a CSV (and define a sequential key for linking those tables later)

An old module, [pgloader](http://pgfoundry.org/projects/pgloader), will let you do some of these things but is somewhat limited and seemed kind of slow. Ersatz is an attempt to provide all the functionality above and achieve fast copy rates. It is dependent on [psycopg2](http://www.initd.org/psycopg/) for its Postrgres interface.

An example of the arcane configuration is included. Look in the examples directory for test_part.py. import.py is a script that runs ersatz pointed at that configuration. Some config explanation:
* Most of the configuration is in dictionaries, which allows you to create default values and update copies of the default dictionary with specific config values for tables.
* univ_settings.py contains some basic stuff: database connection values (user, db name, password. (assumes the existence of a password file that contains a passwd value. This can easily be changed to an environment variable setting or whatever password storing scheme you prefer)
* test_part.py contains the table definitions.

Table-Specific Configuration Values
-----------------------------------

* dict_reader: whether or not to use a header-name based dictionary when referencing columns. Basically whether you're using csv.DictReader or csv.reader
* skip_head_lines: how much header to skip. Ignored if dict_reader is True
* udcs: a dictionary with column names as keys and default column values as values
* table: the table name
* partitions: an ordered dict of column names to values to partitiion on
* partition_table_pattern: the python format string to create partition table names with
* filename: the csv to import from
* field_sep: the csv field separator
* columns: a dict of column names or, in a special case, a tuple of column names to one of the following types:
** integer: a column in the csv (only available if dict_reader is False)
** string: a column name in the csv (only available if dict_reader is True)
** dict with one key: 'key' and a value indicating the externally defined sequential key to use for this field
** dict with keys: (the column dict key can be a single column name or a tuple of column names depending on how many values are in the function's return tuple)
*** 'function': a function returning a tuple of values
*** 'columns': a tuple of integers (dict_reader False) or strings (dict_reader True) indicating which columns of the CSV to pass to the function
*** 'defaults': a dict of default values and their parameter names to pass to the function. 

Universal Configuration Values
------------------------------

* tables: a dict of table names to table definition dicts
* parallel_load: a tuple of dicts, each dict with keys:
** 'tables': a tuple of table names to load in parallel
** 'keys': a dict of key names to key source names
* key_sources: a dict of key source names to initial values

Please see http://neworganizing.com/content/blog/ersatz1 and http://neworganizing.com/content/blog/ersatz2 for more detailed discussion.

About

A version of pgloader stripped down to components I found essential. Sacrifices a lot of flexibility, but throws out some overhead

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%