avro-serializer

Serialize in parallel lots of small arbitrary files using the avro format.

The toolkit can be used for data ingestion from and edge node, the resulting avro files are automatically ingested into HDFS.

You can edit the file constants to set the max data size in megabytes to be stored in every avro file (MAX_BATCH_SIZE) and the max number of processes to run in parallel (PROCESSES).

In my usecase I used json further up in the pipeline so I decided to encode the file content in base64, a more standard way of encoding binary values in text/json rather than the raw bytes, but feel free to change the schema provided and the code to accommodate your needs.

(useful posts on base64)

http://mail-archives.apache.org/mod_mbox/avro-dev/200906.mbox/%3C444684557.1243962607373.JavaMail.jira@brutus%3E

http://stackoverflow.com/questions/1443158/binary-data-in-json-string-something-better-than-base64

Features:

Native files compression in avro (deflated codec)
Parallel avro file creation with a configurable file size greater that the hdfs block size
Schema and metadata avro capabilities (you can add your own additional fields to the schema provided)
Parallel ingestion into hdfs

Requirements:

Python 2.7
Avro 1.7.4 libraries for Python http://avro.apache.org/docs/current/gettingstartedpython.html

Improvements:

Include a map reduce based version
Improve performance using iterators rather that lists
Add a proper command-line parsing (optparse)
Include snappy compression

How to use it:

find [input_path] -name '.' –print | python serializer.py [output_temp_path] [hdfs_path]

find /mydata/ -name '.' –print | python serializer.py /tmp/ /tmp/out/

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
README.md		README.md
deserializer.py		deserializer.py
hdfs_put.py		hdfs_put.py
schema.avsc		schema.avsc
serializer.py		serializer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

deserializer.py

deserializer.py

hdfs_put.py

hdfs_put.py

schema.avsc

schema.avsc

serializer.py

serializer.py

Repository files navigation

avro-serializer

Features:

Requirements:

Improvements:

How to use it:

About

Releases

Packages

Contributors 2

Languages

cpiva/avro-serializer

Folders and files

Latest commit

History

Repository files navigation

avro-serializer

Features:

Requirements:

Improvements:

How to use it:

About

Resources

Stars

Watchers

Forks

Languages