Manage ribo-seq experiment from raw data
Download with fasta_dump
Clip adapter gunzip -c data.fastq.gz | fastq_quality_filter -Q33 -q N –v -o data.fastq.filter_Nv CTGTAGGCACCATCAAT fastx_clipper -Q33 -a CTGTAGGCACCATCAAT -l 25 -c -n -v -i data.fastq -o data.fastq.clipper -Q33 to avoid unrecognize character , -c discard sequence missing the adaptor -n keep sequence with unknow nt like N -l 25 discard sequence shorter than 25
clipper:CTGTAGGCACCATCAAT trimmer:2 minlength:25
Remove rRNA/tRNA
gunzip -c {clip.gz} |
hisat2
-x ../../../assembly/Hsapiens/hg19/tRNA_rRNA
--phred33
--un-gz noRNA.fastq.gz
-p 8
-t
-S RNA.sam
Map to genome/transcriptome
Find the GSE Download experience from GSE GSE.py Choose the experiment(s) from the gse.sample file Download SRA files with SRA.py
usage: GSE.py -g GSEnumber option: -w workdir
output: +GSEnumber.xml +GSEnumber.sample
GSE are GEO database identification. It is associated to a list of sample
used during the study. This number can be found on the GEO NCBI database
and is often referred on the publication.
From a GSE number, the script download the xml file containing the description of
the experiment. After parsing the xml file GSEnumber.xml, we extract the name of
each sample, the species, the GSM and SRX number. These latter numbers are required
to download SRA files associated.
At the end, the user obtains a GSEnumber.xml and a GSEnumber.sample.
- GSEnumber.xml is the raw file of the experiment. It contains all the details, protocoles etc...
- GSEnumber.sample resume the samples by giving GSM, sample's names, SRR and Genome assmebly. This file is required to download SRA data. Output files is created in the current directory unless user specified -w option.
usage: SRA.py -g GSE.sample options: -w workdir
output: +workdir directory if user uses -w +GSM directory containing SRR.fastq files (if workdir is -w) +GSE.sra
By default workdir is the current directory
This script needs SRA-Toolkit and EUTILS installed to work.
This file is generated by GSE.py and contains in the first column of each line the GSM number required to
download SRR files. SRA.py will avoid line starting with # so the users can select only sample they want.
Users can also provide a file with first column as GSM or SRX number to process.
Reading the GSM or SRX number, the script will get the SRR datas and download the files. For each GSM or SRX
a new directory is created and the SRR files are downloaded inside. GSE.sra list the GSM directory and
the SRR files belonging to.
For each line non commented -- each GSM or SRX number -- a directory named GSM or SRX number is created and will contains
the SRR files. A GSE.sra is created where the script is launched and contains the GSM directory and SRR files names.
+GSM diretories with SRR.fastq
+GSE.sra list of GSM directory and SRR files
http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software
Add the bin directory path to the $PATH or edit the path in the dependency.config
To avoid to download cache in the ncbi dedicated directory, in bin directory, launch:
./vdb-config -i
- Unselect Enable Local File Caching
- Save
- Exit
http://www.ncbi.nlm.nih.gov/books/NBK179288/
Simply execute the following code where you want install EUTILS
perl -MNet::FTP -e
'$ftp = new Net::FTP("ftp.ncbi.nlm.nih.gov", Passive => 1); $ftp->login;
$ftp->binary; $ftp->get("/entrez/entrezdirect/edirect.zip");'
unzip -u -q edirect.zip
rm edirect.zip
export PATH=$PATH:$HOME/edirect
./edirect/setup.sh
Add the bin directory path to the $PATH or edit the path in the dependency.config
usage: ExtractSeq.py -g genbank option: -r output
output: +genbank.tRNA_rRNA or ouput
Genbank is a file format. These file contains annotations about a set of sequence or a whole genome. You can download then in the NCBI website. Make sure to take the full version.
By using the -r option, you ask to extract all tRNA and rRNA sequence in a fasta.