Skip to content

notEvil/encprim

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

encprim

is a serializer for primitive python objects, similar to cPickle, json or msgpack but written in pure python. Find the reasons in the corresponding section below.

Overview

  1. initially designed as extension for the python module struct
  2. later inspired by msgpack and pickle
  3. encodes None, bool, int/long (arbitrary size), float, complex, str, unicode, slice and bitarray
  4. supports nesting in tuple, list, set and dict
  5. detects and exploits type repetitions

Output Syntax

object: [count]type[data]
tuple / list / set: sequence of objects enclosed by () / [] / <>
dict: sequence of objects, keys first then values, enclosed by {}

Examples with data replaced by "."
(2TF2N) == (True, True, False, None, None)
{(2i..)()d.4s.} == {(3, 7): 1.2, (): 'text'}

Getting Started

Copy the encprim directory somewhere python will find it

import encprim
x = 1
a = encprim.encode( x )
print repr(a)
b = encprim.decodes( a )
assert b == x

import bitarray
print repr(encprim.encode( bitarray.bitarray('11011') ))

encprim.enableTypes([tuple, list, set, dict])
print repr(encprim.encode( {(): [set([])]} ))

encode returns None when the object contains non encodable types or there are recursions. By default all container types (tuple, list, set, dict) are disabled due to performance reasons (see below). Use cPickle for these types, or enable them with enableTypes.

You can start the test suite by executing the _init_.py file directly. _init_.out contains an example output.
Add the argument "-i" to get into interactive mode, where you can type in python structs for which the encoded value is printed, alongside with its size in bytes and the size ratio compared to pickle (lower is better).

Reasons

  • why not use pickle

pickle is great, especially its power to serialize really everything there is, including functions/classes defined at _main_ level thanks to Oren Tirosh's monkey patch*. But if you find yourself in the situation where you have to serialize a large number of small objects, then every byte may count.

  • why not use existing serializers like json, msgpack

Besides the external dependency they exist for a number of other reasons and are therefore not 100% compliant with python types. For example** json's dictionary keys need to be strings and msgpack can't distinct between tuple and list.

  • why not

I found that pickle is quite efficient but for one exception: complex numbers. The output shouldn't be much larger than two doubles, but it somehow is. Also pickle seems to store a lot of unnecessary information when it is fed with a bitarray. This type is not built in, but I love it and use it extensively.

So I decided to put some effort into this module. I hope you like it or find your usecase.

Performance

The test suite shows that encprim produces outputs which are, on average, 40% smaller compared to pickle. Depending on the data type the rate could rise above 90% (bitarrays with len < 16) or drop below significance (big integers). By design the best results are achieved with single values or a flat collection of same typed values.

Regarding runtime, the size optimizations come at a prize. Compared to pickle, which is a valid baseline because it is pure python too, encprim is, on average, about twice as fast when encoding and only a few percent faster when decoding. However, en/decoding single values is about 5 times faster than pickle. cPickle is fastest in any case but encprim comes very close when en/decoding single values :)

* http://code.activestate.com/recipes/572213-pickle-the-interactive-interpreter-state/
** in my best knowledge

About

serializer similar to cPickle, json or msgpack but written in pure python

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages