Skip to content
This repository has been archived by the owner on Apr 12, 2020. It is now read-only.
/ spavro Public archive
forked from pluralsight/spavro

Spavro is a (sp)eedier avro library -- Spavro is a fork of the official Apache AVRO python 2 implementation with the goal of greatly improving data read deserialization and write serialization performance.

License

Notifications You must be signed in to change notification settings

Igosuki/spavro

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

(Sp)eedier Avro - Spavro

Spavro is a fork of the official Apache AVRO python 2 implementation with the goal of greatly improving data read deserialization and write serialization performance.

Spavro is also python 2/3 compatible (instead of a spearate project / implementation). Currently only tested using py2.7 and py3.5.

Implementation Details

There are three primary differences between the official implementation and Spavro. First, Spavro uses a C extension, created with Cython, to accelerate some of the low level binary serialization logic. Additionally Spavro uses a different model for handling schemas. Spavro attemps to parse the write and read schemas once and only once and creates recursive reader/writer functions from the schema definition. These reader/writer functions encode the type structure of the schema so no additional lookups are necessary while processing data. The last difference is that Spavro has been updated to be both Python 2 and Python 3 compatible using the six library. The official apache AVRO implementation has two separate codebases for Python 2 and Python 3 and spavro only has one.

This has the net effect of greatly improving the throughput of reading and writing individual datums, since the schema isn't interrogated for every datum. This can be especially beneficial for "compatible" schema reading where both a read and write schema are needed to be able to read a complete data set.

Performance

Some (very non-scientific) benchmarks on an arbitray data set (using my development laptop) show a 2-17x improvement in serialization / deserialization throughput. A simple test was run using ~135k relatively large (5k) real world records. YMMV but in all cases spavro was faster than both the default apache implementation and the "fastavro" library.

deserialize 135k avro records

default implementation 360.23 records/s

fastavro library 2003.62 records/s

spavro 6521.34 records/s

Serialize 135k avro records

default implementation 464.61 records / s

fastavro library 686.62 records / s

spavro library 4719.05 records / s

API

Spavro keeps the default Apache library's API. This allows spavro to be a drop-in replacement for code using the existing Apache implementation.

Tests

Since the API matches the existing library, the majority of the existing Apache test suite is used to verify the correct operation of Spavro. Spavro adds some additional correctness tests to compare new vs old behaviors as well as some additional logic tests above and beyond the original library. Some of the java-based "map reduce" tests (specifically the tether tests) were removed because Spavro does not include the java code to implement that logic.

About

Spavro is a (sp)eedier avro library -- Spavro is a fork of the official Apache AVRO python 2 implementation with the goal of greatly improving data read deserialization and write serialization performance.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 80.5%
  • C 18.9%
  • Shell 0.6%