Skip to content

walkframe/datafactory

Repository files navigation

image

image

image

image

image

Overview

datafactory makes flexible data according to the given rules.

The features are divided into field, model, container, and formatter. If you compare it to a DB, fields are columns, models are records, and containers are tables.

The great thing about the datafactory is its flexibility in type specification. Containers can also be nested.

formatter supports data formatting and file output.

Requirements

  • Python 3.5 or later.

Install

$ pip install datafactory

Usage

Basic Example

In [1]: import datafactory

In [2]: model = datafactory.Model({
   ...:     'id': datafactory.IncrementField(),
   ...:     'x': datafactory.CycleField(['a', 'b', 'c']),
   ...:     # BLANK will be omit.
   ...:     'option': datafactory.ChoiceField([True, False, datafactory.BLANK]),
   ...: })

In [3]: container = datafactory.Container(model, 5, render=True)

In [4]: container
Out[4]:
[{'id': 1, 'x': 'a'},
 {'id': 2, 'x': 'b', 'option': False},
 {'id': 3, 'x': 'c', 'option': True},
 {'id': 4, 'x': 'a'},
 {'id': 5, 'x': 'b'}]

# specify rewrite=True, if file already exists.
In [5]: datafactory.JsonFormatter(container).write('/tmp/test.json', rewrite=True)

In [6]: !cat /tmp/test.json
[
 {
  "x": "a",
  "id": 1
 },
 {
  "x": "b",
  "id": 2,
  "option": false
 },
 {
  "x": "c",
  "id": 3,
  "option": true
 },
 {
  "x": "a",
  "id": 4
 },
 {
  "x": "b",
  "id": 5
 }
]

TSV Example

In [1]: import datafactory

In [2]: model = datafactory.ListModel([
   ...:     datafactory.IncrementField(start=10, step=5),
   ...:     datafactory.HashOfField(2, 'md5'),  # hashing value of the third column.
   ...:     datafactory.ChoiceField(['foo', 'bar', 'baz']),
   ...:     datafactory.CycleField(range(0, 30, 10)),
   ...: ]).ordering(2)  # render at first index:2(third column)

# IterContainer is saving memory, because generating an element each time.
In [3]: container = datafactory.IterContainer(model, 10)  # repeat 10 times.

In [4]: datafactory.CsvFormatter(
   ...:     container,
   ...:     delimiter='\t',
   ...:     header=['id', 'hash-of-name', 'name', 'value']
   ...: ).write('/tmp/test.csv', rewrite=True)

In [5]: !cat /tmp/test.csv
id    hash-of-name    name    value
10    acbd18db4cc2f85cedef654fccc4a4d8    foo 0
15    acbd18db4cc2f85cedef654fccc4a4d8    foo 10
20    73feffa4b7f6bb68e44cf984c85f6e88    baz 20
25    acbd18db4cc2f85cedef654fccc4a4d8    foo 0
30    acbd18db4cc2f85cedef654fccc4a4d8    foo 10
35    73feffa4b7f6bb68e44cf984c85f6e88    baz 20
40    73feffa4b7f6bb68e44cf984c85f6e88    baz 0
45    73feffa4b7f6bb68e44cf984c85f6e88    baz 10
50    37b51d194a7513e45b56f6524f2d51f2    bar 20
55    37b51d194a7513e45b56f6524f2d51f2    bar 0

Custom Example

If object is callable, it stores execution result.

Model

In [1]: import datafactory

In [2]: def square(k, i):
   ...:     return k * i
   ...:

In [3]: container = datafactory.DictContainer(square)

In [4]: container(['a', 'b', 'c', 'd', 'e'])
Out[4]: {'a': '', 'b': 'b', 'c': 'cc', 'd': 'ddd', 'e': 'eeee'}

Field

In [1]: import datafactory

In [2]: model = datafactory.Model({
   ...:    'col1': (lambda r, i: i),
   ...:    'col2': (lambda r: r['col1'] + 1),
   ...:    'col3': (lambda r: r['col2'] * 2),
   ...:    'col4': 100,  # fixed value
   ...: }).ordering('col1', 'col2', 'col3')

In [3]: container = datafactory.ListContainer(model)

In [4]: container(4)
Out[4]:
[{'col1': 0, 'col2': 1, 'col3': 2, 'col4': 100},
 {'col1': 1, 'col2': 2, 'col3': 4, 'col4': 100},
 {'col1': 2, 'col2': 3, 'col3': 6, 'col4': 100},
 {'col1': 3, 'col2': 4, 'col3': 8, 'col4': 100}]

Limited number of element Example

In [1]: import datafactory

In [2]: model = datafactory.Model({
   ...:     # x: a is 1times limited. / b is 2times limited. / c is 3times limited.
   ...:     'x': datafactory.PickoutField({'a': 1, 'b': 2, 'c': 3}, missing=None),
   ...:     # y: a is 2times limited. / b and c is 1times limited.
   ...:     'y': datafactory.PickoutField(['a', 'a', 'b', 'c'], missing='*'),
   ...:     # z: a and b can't be selected. / c is 5times limited.
   ...:     'z': datafactory.PickoutField(['c']*5, missing=None),
   ...: })

In [3]: container = datafactory.ListContainer(model)

In [4]: container(6)
Out[4]:
[{'x': 'a', 'y': 'a', 'z': 'c'},
 {'x': 'c', 'y': 'b', 'z': 'c'},
 {'x': 'c', 'y': 'a', 'z': 'c'},
 {'x': 'b', 'y': 'c', 'z': 'c'},
 {'x': 'c', 'y': '*', 'z': 'c'},
 {'x': 'b', 'y': '*', 'z': None}]

Combination Example

To generate the testdata that combines multiple elements can be achieved by using the repeat-argument of CycleField and SequenceField.

In [1]: import datafactory

In [2]: l0 = ['a', 'b']

In [3]: l1 = ['a', 'b', 'c']

In [4]: l2 = ['a', 'b', 'c', 'd']

In [5]: model = datafactory.ListModel([
   ...:     datafactory.SequenceField(l0, repeat=len(l1)*len(l2), missing=datafactory.ESCAPE),
   ...:     datafactory.CycleField(l1, repeat=len(l2)),
   ...:     datafactory.CycleField(l2),
   ...: ])

In [6]: container = datafactory.Container(model)

# by specifying the ESCAPE to missing-argument
# automatically detect end of elements and escape before reaching 10000.
In [7]: container(10000)
Out[7]:
[['a', 'a', 'a'],
 ['a', 'a', 'b'],
 ['a', 'a', 'c'],
 ['a', 'a', 'd'],
 ['a', 'b', 'a'],
 ['a', 'b', 'b'],
 ['a', 'b', 'c'],
 ['a', 'b', 'd'],
 ['a', 'c', 'a'],
 ['a', 'c', 'b'],
 ['a', 'c', 'c'],
 ['a', 'c', 'd'],
 ['b', 'a', 'a'],
 ['b', 'a', 'b'],
 ['b', 'a', 'c'],
 ['b', 'a', 'd'],
 ['b', 'b', 'a'],
 ['b', 'b', 'b'],
 ['b', 'b', 'c'],
 ['b', 'b', 'd'],
 ['b', 'c', 'a'],
 ['b', 'c', 'b'],
 ['b', 'c', 'c'],
 ['b', 'c', 'd']]

nested example

In [1]: import datafactory

In [2]: model = datafactory.Model({
   ...:     'a': datafactory.ListModel([
   ...:         datafactory.CycleField(['b', 'c']),
   ...:         datafactory.CycleField(['d', 'e']),
   ...:     ]),
   ...:     datafactory.ChoiceField(['f', 'g', 'h']): datafactory.DictContainer(lambda x: x * 2, 5)
   ...: })

In [3]: datafactory.Container(model, 10, render=True)
Out[3]:
[{'a': ['b', 'd'], 'h': {0: 0, 1: 2, 2: 4, 3: 6, 4: 8}},
 {'a': ['c', 'e'], 'f': {0: 0, 1: 2, 2: 4, 3: 6, 4: 8}},
 {'a': ['b', 'd'], 'f': {0: 0, 1: 2, 2: 4, 3: 6, 4: 8}},
 {'a': ['c', 'e'], 'g': {0: 0, 1: 2, 2: 4, 3: 6, 4: 8}},
 {'a': ['b', 'd'], 'f': {0: 0, 1: 2, 2: 4, 3: 6, 4: 8}},
 {'a': ['c', 'e'], 'h': {0: 0, 1: 2, 2: 4, 3: 6, 4: 8}},
 {'a': ['b', 'd'], 'g': {0: 0, 1: 2, 2: 4, 3: 6, 4: 8}},
 {'a': ['c', 'e'], 'h': {0: 0, 1: 2, 2: 4, 3: 6, 4: 8}},
 {'a': ['b', 'd'], 'h': {0: 0, 1: 2, 2: 4, 3: 6, 4: 8}},
 {'a': ['c', 'e'], 'h': {0: 0, 1: 2, 2: 4, 3: 6, 4: 8}}]
  • b and c appear at first of a list in a key alternatively.
  • d and e appear at second of a list in a key alternatively.

datetime Utility

choice

random choice between start and end.

In [1]: from datafactory.utils.datetime import choice


In [2]: choice(1988, '2015-11-11T11:11:11.111111')
Out[2]: datetime.datetime(2009, 11, 30, 23, 25, 43, 240031)

# tuple: datetime(*tuple), dict: datetime(**dict)
In [3]: choice((1988, 5, 22), {'year': 2015, 'month': 11, 'day': 11})
Out[3]: datetime.datetime(1996, 7, 1, 11, 14, 59, 314809)

In [4]: from datetime import datetime, date

In [5]: choice(date(1988, 5, 22), datetime(2015, 11, 11, 11, 11, 11))
Out[5]: datetime.datetime(2011, 3, 23, 19, 39, 14, 476901)

generator

generator that generate the datetime object at regular intervals.

In [1]: from datetime import timedelta
In [2]: from datafactory.utils.datetime import generator

# if you omit end-argument, then it creates an object infinitely.
In [3]: g = generator(start=2015, interval=timedelta(days=1, hours=12))

In [4]: next(g)
Out[4]: datetime.datetime(2015, 1, 1, 0, 0)

In [5]: next(g)
Out[5]: datetime.datetime(2015, 1, 2, 12, 0)

In [6]: next(g)
Out[6]: datetime.datetime(2015, 1, 4, 0, 0)

In [7]: next(g)
Out[7]: datetime.datetime(2015, 1, 5, 12, 0)

range

generate list object that includes regularly generated datetime objects element.

In [1]: from datetime import timedelta
In [2]: from datafactory.utils.datetime import range

In [3]: range(2015, '2015/2/1')
Out[3]:
[datetime.datetime(2015, 1, 1, 0, 0),
 datetime.datetime(2015, 1, 2, 0, 0),
 datetime.datetime(2015, 1, 3, 0, 0),
 datetime.datetime(2015, 1, 4, 0, 0),
 datetime.datetime(2015, 1, 5, 0, 0),
 datetime.datetime(2015, 1, 6, 0, 0),
 datetime.datetime(2015, 1, 7, 0, 0),
 datetime.datetime(2015, 1, 8, 0, 0),
 datetime.datetime(2015, 1, 9, 0, 0),
 datetime.datetime(2015, 1, 10, 0, 0),
 datetime.datetime(2015, 1, 11, 0, 0),
 datetime.datetime(2015, 1, 12, 0, 0),
 datetime.datetime(2015, 1, 13, 0, 0),
 datetime.datetime(2015, 1, 14, 0, 0),
 datetime.datetime(2015, 1, 15, 0, 0),
 datetime.datetime(2015, 1, 16, 0, 0),
 datetime.datetime(2015, 1, 17, 0, 0),
 datetime.datetime(2015, 1, 18, 0, 0),
 datetime.datetime(2015, 1, 19, 0, 0),
 datetime.datetime(2015, 1, 20, 0, 0),
 datetime.datetime(2015, 1, 21, 0, 0),
 datetime.datetime(2015, 1, 22, 0, 0),
 datetime.datetime(2015, 1, 23, 0, 0),
 datetime.datetime(2015, 1, 24, 0, 0),
 datetime.datetime(2015, 1, 25, 0, 0),
 datetime.datetime(2015, 1, 26, 0, 0),
 datetime.datetime(2015, 1, 27, 0, 0),
 datetime.datetime(2015, 1, 28, 0, 0),
 datetime.datetime(2015, 1, 29, 0, 0),
 datetime.datetime(2015, 1, 30, 0, 0),
 datetime.datetime(2015, 1, 31, 0, 0),
 datetime.datetime(2015, 2, 1, 0, 0)]

# +-3 hour noise, +5 minute noise
In [4]: range(2015, '2015-01-15', hours=3, minutes=(0, 5))
Out[4]:
[datetime.datetime(2015, 1, 1, 3, 1),
 datetime.datetime(2015, 1, 2, 0, 3),
 datetime.datetime(2015, 1, 3, 2, 0),
 datetime.datetime(2015, 1, 3, 22, 2),
 datetime.datetime(2015, 1, 4, 22, 3),
 datetime.datetime(2015, 1, 6, 0, 2),
 datetime.datetime(2015, 1, 7, 0, 4),
 datetime.datetime(2015, 1, 8, 0, 4),
 datetime.datetime(2015, 1, 8, 21, 3),
 datetime.datetime(2015, 1, 9, 22, 0),
 datetime.datetime(2015, 1, 11, 0, 0),
 datetime.datetime(2015, 1, 11, 22, 1),
 datetime.datetime(2015, 1, 12, 22, 5),
 datetime.datetime(2015, 1, 14, 3, 0),
 datetime.datetime(2015, 1, 15, 2, 5)]

# it is able to specify minus direction as interval.
In [5]: range(start='2015-5-22', end='2015-04-22', interval=timedelta(days=-1))
Out[5]:
[datetime.datetime(2015, 5, 22, 0, 0),
 datetime.datetime(2015, 5, 21, 0, 0),
 datetime.datetime(2015, 5, 20, 0, 0),
 datetime.datetime(2015, 5, 19, 0, 0),
 datetime.datetime(2015, 5, 18, 0, 0),
 datetime.datetime(2015, 5, 17, 0, 0),
 datetime.datetime(2015, 5, 16, 0, 0),
 datetime.datetime(2015, 5, 15, 0, 0),
 datetime.datetime(2015, 5, 14, 0, 0),
 datetime.datetime(2015, 5, 13, 0, 0),
 datetime.datetime(2015, 5, 12, 0, 0),
 datetime.datetime(2015, 5, 11, 0, 0),
 datetime.datetime(2015, 5, 10, 0, 0),
 datetime.datetime(2015, 5, 9, 0, 0),
 datetime.datetime(2015, 5, 8, 0, 0),
 datetime.datetime(2015, 5, 7, 0, 0),
 datetime.datetime(2015, 5, 6, 0, 0),
 datetime.datetime(2015, 5, 5, 0, 0),
 datetime.datetime(2015, 5, 4, 0, 0),
 datetime.datetime(2015, 5, 3, 0, 0),
 datetime.datetime(2015, 5, 2, 0, 0),
 datetime.datetime(2015, 5, 1, 0, 0),
 datetime.datetime(2015, 4, 30, 0, 0),
 datetime.datetime(2015, 4, 29, 0, 0),
 datetime.datetime(2015, 4, 28, 0, 0),
 datetime.datetime(2015, 4, 27, 0, 0),
 datetime.datetime(2015, 4, 26, 0, 0),
 datetime.datetime(2015, 4, 25, 0, 0),
 datetime.datetime(2015, 4, 24, 0, 0),
 datetime.datetime(2015, 4, 23, 0, 0),
 datetime.datetime(2015, 4, 22, 0, 0)]

common

noise

It is possible to specify the gap between the actual time as noise parameters. allow to specify the noise parameters are “datetimes.generator” and “datetimes.range” functions.

**noise is specified in the kwargs format and they are not required.

The available keys are same with timedelta-args.

  • days
  • hours
  • minute
  • seconds
  • microseconds

argtype

The acceptable arguments as the other than datetime type are the following.

int

It is evaluated as a year.

str

It is parsed as datetime from the numeric part of the string.

tuple

It will be passed into datetime args.

dict

It will be passed into datetime kwargs.

date

It will be converted datetime type.

history

1.0.x

Initialize.

Releases

No releases published

Packages

No packages published

Languages