Skip to content

Serialize in parallel lots of small arbitrary files using the avro format.

Notifications You must be signed in to change notification settings

cpiva/avro-serializer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

avro-serializer

Serialize in parallel lots of small arbitrary files using the avro format.

The toolkit can be used for data ingestion from and edge node, the resulting avro files are automatically ingested into HDFS.

You can edit the file constants to set the max data size in megabytes to be stored in every avro file (MAX_BATCH_SIZE) and the max number of processes to run in parallel (PROCESSES).

In my usecase I used json further up in the pipeline so I decided to encode the file content in base64, a more standard way of encoding binary values in text/json rather than the raw bytes, but feel free to change the schema provided and the code to accommodate your needs.

(useful posts on base64)

http://mail-archives.apache.org/mod_mbox/avro-dev/200906.mbox/%3C444684557.1243962607373.JavaMail.jira@brutus%3E

http://stackoverflow.com/questions/1443158/binary-data-in-json-string-something-better-than-base64

Features:

  • Native files compression in avro (deflated codec)
  • Parallel avro file creation with a configurable file size greater that the hdfs block size
  • Schema and metadata avro capabilities (you can add your own additional fields to the schema provided)
  • Parallel ingestion into hdfs

Requirements:

Improvements:

  • Include a map reduce based version
  • Improve performance using iterators rather that lists
  • Add a proper command-line parsing (optparse)
  • Include snappy compression

How to use it:

find [input_path] -name '.' –print | python serializer.py [output_temp_path] [hdfs_path]

find /mydata/ -name '.' –print | python serializer.py /tmp/ /tmp/out/

About

Serialize in parallel lots of small arbitrary files using the avro format.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages