Skip to content

natgaertner/ersatzpg

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ERSATZ
A tool for loading csvs into postgresql databases quickly

postgresql's copy command is just fine for bulk-loading CSVs into a table if those CSVs are already formatted to exactly match the table columns and data types. It even lets you load a subset of the table's columns from the CSV.
Unfortunately, it doesn't let you:
Load a subset of the CSV's columns
Transform columns from the CSV before insertion
Combine columns
Insert directly into partition tables
Define default column values
Load multiple tables from different subsets of a CSV (and define a sequential key for linking those tables later)

An old module, pgloader (http://pgfoundry.org/projects/pgloader), will let you do some of these things but is somewhat limited and seemed kind of slow.
Ersatz is an attempt to provide all the functionality above and achieve fast copy rates. It is dependent on psycopg2 (http://www.initd.org/psycopg/) for its postrgres interface.

An example of the arcane configuration is included. Look in the examples directory. test_part.py is the configuration file. import.py is a script that runs ersatz pointed at the configuration. 
Some config explanation:
Most of the configuration is in dictionaries, which allows you to create default values and update copies of the default dictionary with specific config values for tables.
univ_settings.py contains some basic stuff: database connection values (user, db name, password. (assumes the existence of a password file that contains a passwd value. This can easily be changed to an environment variable setting or whatever password storing scheme you prefer)
test_part.py contains the table definitions.
Table-specific config values:
dict_reader is whether or not to use a header-name based dictionary when referencing columns. Basically whether you're using csv.DictReader or csv.reader
skip_head_lines is how much header to skip. Will be ignored if dict_reader is True
udcs is a dictionary with column names as keys and default column values as values
table is the table name
partitions is an ordered dict of column names to values to partitiion on
partition_table_pattern is the python format string to create partition table names with
filename is the csv to import from
field_sep is the csv field separator
columns is a dict of column names or in a special case a tuple of column names to one of:
    integer: a column in the csv (only available if dict_reader is False)
    string: a column name in the csv (only available if dict_reader is True)
    dict with one key: 'key' and a value indicating the externally defined sequential key to use for this field
    dict with keys: 'function': a function returning a tuple of values, 'columns': a tuple of integers (dict_reader False) or strings (dict_reader True) indicating which columns of the csv to pass to the function, 'defaults': a dict of default values and their parameter names to pass to the function. In this case the column dict key can be a single column name or a tuple of column names depending on how many values are in the function's return tuple.

universal config values:
tables: a dict of table names to table definition dicts
parallel_load: a tuple of dicts, each dict with keys:
    'tables': a tuple of table names to load in parallel
    'keys': a dict of key names to key source names
key_sources: a dict of key source names to initial values

please see http://neworganizing.com/content/blog/ersatz1 and http://neworganizing.com/content/blog/ersatz2 for more detailed discussion

About

A version of pgloader stripped down to components I found essential. Sacrifices a lot of flexibility, but throws out some overhead

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages