[WIP] Adding your own Storage Engine or Data Format

cp for structured data

dcp is a python library and command line tool that provides a fast and safe way to copy structured data between any two points, whether copying a CSV to a Mysql table or an in-memory DataFrame to an S3 JSONL file.

dcp orders.csv mysql://root@localhost:3306/mydb/orders

Fast

dcp uses best-in-class underlying client libraries, employs parallelization and compression to the extent possible, and estimates the memory, cpu, and wire costs of any copy operation to select the lowest cost copy path for available storages.

Safe

dcp uses Common Model Schemas under the hood as the "lingua franca" of structured data, allowing for careful preservation of logical data types and values across many formats and storage engines. Error handling behavior is configurable so when type conversion errors are encountered -- a value is truncated or cannot be cast -- dcp can fail, relax the datatype, or set the value null depending on the desired behavior.

Currently supported formats:

JSON
CSV file
Database table
Pandas dataframe
Apache arrow

Currently supported storage engines:

Databases: postgres, mysql, sqlite (and any database supported by SqlAlchemy)
File systems: local, S3 (coming soon)
Memory: python

In addition, dcp supports related operations like inferring the schema of a dataset, conforming a dataset to a schema, and creating empty objects of a specified schema.

Usage

pip install datacopy or poetry add datacopy

Command line:

dcp orders.csv mysql://localhost:3306/mydb

This command will load the orders.csv file into a mysql table (by default of the same name orders) on the given database, inferring the right schema from the data in the CSV.

dcp mysql://localhost:3306/mydb/orders s3://mybucket.s3/pth/orders.csv

This will export your orders table to a file on S3 (in the "default" format for the StorageEngine since none was specified, in the case of S3 a CSV).

Python library

The python library gives you a powerful API for more complex operations:

import
from dcp import Storage

records = [{"f1":"hello", "f2": "world"}]
fields = .infer_fields(records)
print(fields)
# >>> [Field(name="f1", type=Text), Field(name="f2", field_type=Text)]

.copy(
    from_obj=records,
    to_name='records',
    to_format="csv",
    to_storage='file:///tmp/dcp'
)

assert Storage('file:///tmp/dcp').get_api().exists('records')
with Storage('file:///tmp/dcp').get_api().open('records') as f:
    print(f.read())
    # >>> f1,f2
    # >>> hello,world

.copy(
    from_name='records',
    from_storage='file:///tmp/dcp/',
    to_storage='postgresql://localhost:5432/mydb'
)

data_format = .infer_format("records", storage='file:///tmp/dcp')
print(data_format)
# >>> CsvFileFormat

.copy(
    from_name='records',
    from_storage='file:///tmp/dcp/',
    to_storage='postgresql://localhost:5432/mydb'
    fields=fields,
    cast_level='strict',
)

assert Storage('postgresql://localhost:5432/mydb').get_api().exists("records")

[WIP] Adding your own Storage Engine or Data Format

dcp can easily be extended with storage engines or formats:

class RedisStorageEngine(StorageEngine):
    scheme = "redis"
    api = RedisStorageApi

class RedisStorageApi(StorageApi):
    def exists():...
    def exists():...
    def exists():...
    def exists():...

Adding a new format requires adding the handling logic for that format, for each storage class or engine that you want to support.

Name		Name	Last commit message	Last commit date
Latest commit History 134 Commits
.github/workflows		.github/workflows
assets		assets
dcp		dcp
docs		docs
tests		tests
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
mypy.ini		mypy.ini
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

assets

assets

dcp

dcp

docs

docs

tests

tests

.flake8

.flake8

.gitignore

.gitignore

.pre-commit-config.yaml

.pre-commit-config.yaml

LICENSE

LICENSE

README.md

README.md

mypy.ini

mypy.ini

pyproject.toml

pyproject.toml

Repository files navigation

cp for structured data

Fast

Safe

Usage

Command line:

Python library

[WIP] Adding your own Storage Engine or Data Format

About

Releases

Packages

Contributors 2

Languages

License

kvh/dcp

Folders and files

Latest commit

History

Repository files navigation

cp for structured data

Fast

Safe

Usage

Command line:

Python library

[WIP] Adding your own Storage Engine or Data Format

About

Resources

License

Stars

Watchers

Forks

Languages