Skip to content
This repository has been archived by the owner on Dec 11, 2020. It is now read-only.

DaniSancas/soxta

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

soxta

SOXTA - StackOverflow's XML To AVRO

How to run

Firstly, we need uncompressed data from the Archive.org public dataset: https://archive.org/details/stackexchange. Given the next folder structure, the data should be inside xml_data folder. The avro_schemas folder should contain the schema files in order to convert from XML to AVRO. The avro_data folder should be empty:

.
├── avro_data       # Empty, to store Avro converted files
├── avro_schemas    # Avro schema files
└── xml_data        # Uncompressed data from Archive.org

Convert from XML to AVRO:

Then we need to run soxta.py script, specifying the XML file to convert, the AVRO schema file, and the path for the result AVRO file:

$ soxta.py xml_data/Posts.xml avro_schemas/Posts.avsc avro_data/Posts.avro

About

SOXTA - StackOverflow's XML To AVRO

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages