A collection of tools to download, parse, and standardize sequence metadata from NCBI databases.
Written by Remi Marchand between May 13, 2016 and August 26, 2016.
This collection of tools, by default, manipulates data from the Sequence Read Archive (SRA) database.
The database can be found here: http://www.ncbi.nlm.nih.gov/sra
Main program that queries and downloads xml files based on organism name and date.
Usage: metadata.py options (run python metadata.py -h to see options)
Download in Bulk: bash download.sh organism start_date end_date
Main program that standardizes relevant columns from input csv files.
Usage: standardize.py csv_files
- lxml (install via pip as: python -m pip install lxml)
- Levenshtein (install from: https://pypi.python.org/pypi/python-Levenshtein/0.12.0)
- If on a Mac: export PYTHONPATH="${PYTHONPATH}:Path_to_Standardize_Metadata"