The Python avro
package distributed by
Apache is incredibly slow. When dealing with very
large datasets, the performance of avro
is unacceptable,
more or less unfit for any production use.
fastavro
is less feature complete than avro
,
however performance reading and writing Avro files is significantly
faster.
With the Cython extension modules compiled, fastavro
is really fast, almost on par with the native Java implementation
of Avro.
The focus of this fork (e-heller/fastavro) is on reimplementing and
optimizing fastavro's
Cython extension modules in pure
Cython code to achieve screaming speed.
I hope these improvements will eventually be pulled into the upstream repo, but that may take some time.
This fastavro
fork supports these Python versions:
- Python 2.6
- Python 2.7
- Python 3.4
- Python 3.5
Because fastavro
is shipped with Cython extension
modules, you will require the following to build and install:
-
Cython to generate the C extension files
-
Some kind of C compiler like
gcc
If you can't compile the extension modules, you can still use the pure Python implementation, but you will be missing out on the significant performance improvements.
First, download Cython:
$ pip install cython
To build the compiled extensions, from the root fastavro
source
directory:
$ make cfiles
$ python setup.py build
Assuming the build worked, now you can just type:
$ python setup.py install
import fastavro
with open('some_file.avro', 'rb') as input:
# Create a `Reader` object
reader = fastavro.Reader(input)
# Obtain the writer's schema if required
schema = reader.schema
# Iteratively read the records:
for record in reader:
process_record(record)
# Read the records in one shot:
records = list(reader)
import fastavro
# Define an Avro Schema as a dict:
schema = {
'doc': 'A weather reading.',
'name': 'Weather',
'namespace': 'test',
'type': 'record',
'fields': [
{'name': 'station', 'type': 'string'},
{'name': 'time', 'type': 'long'},
{'name': 'temp', 'type': 'int'},
],
}
# Create some records:
records = [
{u'station': u'011990-99999', u'temp': 0, u'time': 1433269388},
{u'station': u'011990-99999', u'temp': 22, u'time': 1433270389},
{u'station': u'011990-99999', u'temp': -11, u'time': 1433273379},
{u'station': u'012650-99999', u'temp': 111, u'time': 1433275478},
]
# Open a file, and write the records:
with open('some_file.avro', 'wb') as out:
fastavro.write(out, schema, records)
fastavro
is missing many of the features of the official Apache
avro
package. Essentially, fastavro
only supports
reading and writing in Avro format.
Notably, there is no support for:
-
The Protocol Wire Format or any kind of Wire Transmission
There are also some limitations with reading and writing:
-
Incomplete support for 'reader's schemas' - i.e., reading an Avro file written with one schema (the 'writer's schema') and interpreting the data with another schema (the 'reader's schema'). See Schema Resolution in the Avro specification for details.
-
No support for Aliases - i.e., the
aliases
attribute in named Avro types. See Aliases in the Avro specification for details. -
No support for Logical Types - i.e., the new
logicalType
attribute that adds support for derived types in Avro 1.8 likedecimal
,date
,time-millis
,timestamp-millis
, etc. See Logical Types in the Avro specification for details.Note: the ViaSat/fastavro fork currently has support for some Logical Types. I hope to incorporate this work soon.
If you want to play around and modify any of the Cython
.pyx
files, you will need to recompile the C module code.
You will require:
- Cython to regenerate the C files
- Some kind of C compiler
To easiest way to recompile the C extensions is as follows.
From the root fastavro
source directory:
$ make build
The make build
command first calls cython
to generate the C files,
then compiles them via setup.py
. (The exact command is
python setup.py build_ext -i
)
If you are feeling more adventurous, and you're on a Linux-y platform
with gcc
you can try to compile the extensions with the GCC settings
I personally use:
$ make compile
This will compile the extensions "in place" in the source directory.
To install the package with your modifications, you'll just have to
manually copy the fastavro
package directory to your Python
site-packages
directory.
I definitely recommend working inside a virtual environment if you plan
to modify and rebuild fastavro
. If you are unfamiliar with virtual
environments, check out virtualenv or pew.
Personally, I recommend pew.
See the ChangeLog