Skip to content

Command line tools for manipulating FASTA files, Developed for use at the Ferris Lab, Children's Hospital New Orleans

Notifications You must be signed in to change notification settings

theJohnnyBrown/ferristools

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FerrisTools

is a collection of small command-line tools for manipulating FASTA files, developed for use by researchers at Children's Hospital New Orleans. Some of the scripts perform small general operations, others are intended to integrate our workflow with qiime more smoothly.

create_mapping.py

USAGE:

python create_mapping.py -m MappingFile.txt -o NewMapping.txt

suggested workflow:

  • run this script as shown above

  • run check_id_map.py:

      macqiime check_id_map.py -m NewMapping.txt -o checkmap -j run_prefix
    
  • check log of check_id_map.py for relevant errors:

      grep -v "Removed bad chars" checkmap/NewMapping.log
    
  • take subsets of .fna and .qual as necessary to render them isomorphic

  • run split_libraries.py:

      macqiime split_libraries.py -e 0 -m checkmap/NewMapping_corrected.txt -f MySeqs.fna -q MyQual.qual -o splib-out -j run_prefix -b 8
    

create_mapping adds a 'run_prefix' column to the mapping file, allowing qiime's split_libraries.py to demultiplex reads by an already determined sample name as well as the barcode. This is useful in situations where the sequencing facility has already labeled the reads by sample

fasta.py :

USAGE:

python fasta.py keyfile.txt fastafile.fna [--liberal | -l]

Runs QA steps to remove primers, barcodes, homopolymers and chimeras from the data.

ARGS:

  • keyfile.txt: the "keyfile" or mapping file.
  • fastafile.fna: the file to be preprocessed
  • --liberal or -l: see odd cases below. If this flag is not provided, the default is 'conservative' mode.

Odd cases:

  • sequence id not in keyfile: throw out sequence, warn
  • sequence does not start with barcode: ignore
  • sequence does not match primer (maybe primer was already stripped):
    • conservative mode - throw out
    • liberal mode - ignore (whole operation is idempotent in liberal mode)

seq_subset.py :

USAGE:

python seq_subset.py <fastafile> namestems

Where namestems is a list in quotes, with entries separated by commas, e.g. "JN031811-1, JN031811-2, Jn031811-3"

--or--

python seq_subset.py <fastafile> -f stemsfile

Where stemsfile is a the path to a file containing one list entry per line

any sequences in the fasta file whose names begin with one of the entries in the list will be printed

fnaview.py :

USAGE:

python fnaview.py fastafile.fna

All sample IDs in fastafile will be printed once each. to count the number of samples in a file, use like so:

python fnaview.py fastafile.fna | wc -l

To check whether two files, (or a fasta and a qual file) have the same sample names in them, do this:

python fnaview.py fastafile.fna | sort > f1.fna
python fnaview.py other_fastafile.fna | sort > f2.fna
diff f1.fna f2.fna

if the diff command produces no output, the two files contain the same set of samples.

USAGE:

checkseqs.py seqs.fna

Where seqs.fna is the fasta file generated by split_libraries.py.

The script checks that the sample ID assigned by qiime matches the original, and prints any mismatches. You may wish to pipe the output to a file, as if there is one mismatch there will likely be many.

The script prints no output if it finds no mismatched sequences.

To see the number of mismatches, use the script like so:

python checkseqs.py seqs.fna > mismatches.fna

About

Command line tools for manipulating FASTA files, Developed for use at the Ferris Lab, Children's Hospital New Orleans

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages