broadinstitute/PatHCap_PL
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This pipeline is used to generate read counts and metrics from fastq reads. The code was optimized for infrastructure of Broad Institute. However, I think it can be adapted to other platform if required. The pipeline can be started by calling the toplevel Python script called PipelineMain.py. From the help message: usage: PipelineMain.py [-h] --config_file CONFIG_FILE [--key_id KEY_ID] [--key_path KEY_PATH] [--project_ids PROJECT_IDS] [--seq_id SEQ_ID] [--raw_seq_path RAW_SEQ_PATH] [--temp_path TEMP_PATH] [--bam_path BAM_PATH] [--results_path RESULTS_PATH] [--do_patho] [--do_host] [--remove_splitted] [--gzip_merged] [--host_aligner HOST_ALIGNER] [--read_counter READ_COUNTER] [--no_p7] [--use_p5] [--use_lane] [--use_seq_path] [--no_qsub] [--no_split] [--no_merge] [--no_align] [--no_count] [--no_umi_count] [--no_metrics] [--no_umi_metrics] [--no_replace_refname] [--no_expand] [--no_bc_split] [--no_ref] [--Suffix_s1 SUFFIX_S1] [--Suffix_s2 SUFFIX_S2] [--Suffix_ne SUFFIX_NE] [--ADD5 ADD5] [--ADD3 ADD3] [--MOC_id MOC_ID] [--trim_rs_5p TRIM_RS_5P] [--trim_rs_3p TRIM_RS_3P] [--keep_rs_5p KEEP_RS_5P] [--keep_rs_3p KEEP_RS_3P] [--trim_r1_5p TRIM_R1_5P] [--trim_r1_3p TRIM_R1_3P] [--trim_r2_5p TRIM_R2_5P] [--trim_r2_3p TRIM_R2_3P] [--keep_r1_5p KEEP_R1_5P] [--keep_r1_3p KEEP_R1_3P] [--keep_r2_5p KEEP_R2_5P] [--keep_r2_3p KEEP_R2_3P] [--MOC_id_ref MOC_ID_REF] [--AllSeq_con_A ALLSEQ_CON_A] [--AllSeq_trim_len ALLSEQ_TRIM_LEN] [--no_login_name] [--min_resource] [--count_strand_rev COUNT_STRAND_REV] [--do_bestacc] [--use_sample_id] [--bwa_mem] [--rm_rts_dup] [--do_rerun] [--paired_only_patho] [--paired_only_host] [--ucore_time UCORE_TIME] Process the options. optional arguments: -h, --help show this help message and exit --config_file CONFIG_FILE, -c CONFIG_FILE Path to main config file --key_id KEY_ID Key file would be "key_id"_key.txt --key_path KEY_PATH Key file path (absolute) --project_ids PROJECT_IDS Project ids --seq_id SEQ_ID Sequencing id used to make raw_seq_path --raw_seq_path RAW_SEQ_PATH Directory for raw sequence files (absolute) --temp_path TEMP_PATH Will contain the temporary results --bam_path BAM_PATH Will contain the aligned sorted bam files --results_path RESULTS_PATH Will contain the path to the results --do_patho Do the patho --do_host Do the host --remove_splitted Remove the splitted files --gzip_merged Compress the merged files to gzip format --host_aligner HOST_ALIGNER Aligner for host (BBMap) --read_counter READ_COUNTER Tool for counting reads (default: Counter written by JLivny) --no_p7 Use if P7 index is not used. --use_p5 Use if P5 index is used. --use_lane Use if lane specific merging is required. --use_seq_path Use if direct mapping of sample to raw seq path is required. --no_qsub Does not submit qsub jobs. --no_split Does not split the fastq files. --no_merge Does not merge the split fastq files. --no_align Does not align. --no_count Does not count. --no_umi_count Does not collapse/compute UMI. --no_metrics No metrics generation. --no_umi_metrics Does not compute UMI metrics. --no_replace_refname Does not replace reference name in fna --no_expand Does not expand p7 and barcode entries in the key file if required. --no_bc_split Does not run the barcode splitter, instead create softlinks of the raw seq files to the split directory. --no_ref Does not generate patho ref. --Suffix_s1 SUFFIX_S1 Update the value of Suffix_s1 --Suffix_s2 SUFFIX_S2 Update the value of Suffix_s2 --Suffix_ne SUFFIX_NE Update the value of Suffix_ne --ADD5 ADD5 ADD5 for gff parser --ADD3 ADD3 ADD3 for gff parser --MOC_id MOC_ID Provide MOC string for adding MOC hierarchy --trim_rs_5p TRIM_RS_5P 5p trim count for read-single --trim_rs_3p TRIM_RS_3P 3p trim count for read-single --keep_rs_5p KEEP_RS_5P 5p keep count for read-single --keep_rs_3p KEEP_RS_3P 3p keep count for read-single --trim_r1_5p TRIM_R1_5P 5p trim count for read1 --trim_r1_3p TRIM_R1_3P 3p trim count for read1 --trim_r2_5p TRIM_R2_5P 5p trim count for read2 --trim_r2_3p TRIM_R2_3P 3p trim count for read2 --keep_r1_5p KEEP_R1_5P 5p keep count for read1 --keep_r1_3p KEEP_R1_3P 3p keep count for read1 --keep_r2_5p KEEP_R2_5P 5p keep count for read2 --keep_r2_3p KEEP_R2_3P 3p keep count for read2 --MOC_id_ref MOC_ID_REF Adds MOC_id_ref to Bacterial_Ref_path --AllSeq_con_A ALLSEQ_CON_A Minimum number of consecutive A required to trim AllSeq read. --AllSeq_trim_len ALLSEQ_TRIM_LEN Minimum length of a read required to keep it after AllSeq trim --no_login_name Generate results in a username specific directory --min_resource Request for minimum resource during unicore allocation --count_strand_rev COUNT_STRAND_REV Reverse the counting mechanism by making it "forward" --do_bestacc Run bestacc after splitting and merging --use_sample_id Use sample id for sample id mapping --bwa_mem Use bwa mem instead of bwa backtrack --rm_rts_dup Remove PCR duplicates from RNA-tagseq --do_rerun Rerun pipeline after a previous incomplete run --paired_only_patho Align/count only the reads when both ends are mapped on pathogen side --paired_only_host Align/count only the reads when both ends are mapped on host side --ucore_time UCORE_TIME Timelimit of unicore for this run Here are brief descriptions of other directories of the pipeline. Directory "other" --------------------------------- The "other" folder contains several components of the pipelines. BarcodeSplitter subfolder ........................... (1) BarcodeSplitter: This remove the inline barcodes from the reads (both RNATagSeq and AllSeq protocols) and bin them to several files. Each of the binned files contains reads with only same inline barcodes. To execute barcode splitter for RNA-tagseq, bc_splitter_rts -d <dict_file> --file1 <file1> --file2 <file2> -p <prefix_str> -o <outdir> where, dict_file - the dictionary file file1 - first fastq file (first of the pair, read 1) file2 - second fastq file (second of the pair, read 2) prefix - prefix string used to prepend to all output files output - output directory The help message provides optional parameters of the executable. -h [ --help ] produce help message -d [ --dict-file ] arg Dictionary file --file1 arg First file --file2 arg Second file -p [ --prefix ] arg Prefix string -o [ --outdir ] arg Output directory -k [ --keep_last ] Optional/Do use last base of barcode (RNATag-Seq) --ha Optional/Highlight all barcodes --bc-all arg Optional/File of all barcodes --bc-used arg Optional/File of used barcodes, one number per line -m [ --mismatch ] arg (=1) Optional/Maximum allowed mismatches. --allowed-mb arg (=2048) Optional/Estimated memory requirement in MB. (2) To execute barcode splitter for the AllSeq protocol the executable bc_splitter is similar as that of bc_splitter_rts in terms of mandatory parameters. However, the optional parameters are different due to starting location and length of UMI and barcode. From help message of bc_splitter, ./bc_splitter -h -h [ --help ] produce help message -d [ --dict-file ] arg Dictionary file --file1 arg First file --file2 arg Second file -p [ --prefix ] arg Prefix string -o [ --outdir ] arg Output directory -t [ --type ] arg (=allseq) Optional allseq/rnatagseq --ha Optional/Highlight all barcodes --bc-all arg Optional/File of all barcodes --bc-used arg Optional/File of used barcodes, one number per line -m [ --mismatch ] arg (=1) Optional/Maximum allowed mismatches. --bc-start arg (=6) Optional/Barcode start position --bc-size arg (=6) Optional/Barcode size --umi-start arg (=0) Optional/Umi start position --umi-size arg (=6) Optional/Umi size --allowed-mb arg (=2048) Optional/Estimated memory requirement in MB. Usage: bc_splitter -d <dict_file> --file1 <file1> --file2 <file2> -p <prefix_str> -o <outdir> (3) index_splitter is used to remove P7 index barcode from fastq reads. This program can be run by calling a python script called IndexSplitterMain.py, IndexSplitterMain.py -d <dict_file> -i <input_dir> -o <output_dir> where, dict_file - dictionary file containing P7 indices input_dir - input fastq files output_dir - results of the run -------------------------------------- Sub-directory "barcodes" -------------------------------------- Contains barcodes for the pipeline. Nextera_i5_indices.txt - list of i5 barcodes used for dual index splitter Nextera_i7_indices.txt - list of i7 barcodes used for dual index splitter allseq_bcs.txt - list of AllSeq inline barcode p7_barcodes.txt - list of P7 barcodes used for index splitter rts_bcs.txt - inline barcodes for RNA-tagseq Sub-directory "configs" --------------------------- A sample configuration file used to set environment variables before the pipeline is submitted. Sub-directory "dual_indexed_bc" ------------------------------- Dual index barcode splitter used to remove both i7 and i5 barcodes from the reads. This can be executed by calling the top level python script such as: dual_indexed_bc/IndexSplitterMain.py -i <input dir> -o <output dir> --dict1 <dictionary file i7> --dict2 <dictionary file i5> From the help message, usage: IndexSplitterMain.py [-h] --indir INDIR --outdir OUTDIR --dict1_file DICT1FILE --dict2_file DICT2FILE [--suffix_s1 SUFFIX_S1] [--suffix_s2 SUFFIX_S2] [-m MEMORY] [--no_qsub] [--total_run TOTAL_RUN] [--run_time_ind RUN_TIME_IND] Process inputs for index splitter optional arguments: -h, --help show this help message and exit --indir INDIR, -i INDIR Input directory (default: None) --outdir OUTDIR, -o OUTDIR Output directory (default: None) --dict1_file DICT1FILE index 1 dict file (default: None) --dict2_file DICT2FILE index 2 dict file (default: None) --suffix_s1 SUFFIX_S1 --suffix_s2 SUFFIX_S2 -m MEMORY, --memory MEMORY Amount of memory to allocate (default: 8000) --no_qsub Does not submit qsub jobs. (default: True) --total_run TOTAL_RUN Run for top read_count number of reads (if -1, run for all reads) (default: -1) --run_time_ind RUN_TIME_IND Run time for individual jobs (default: 24 hours) (default: 24) Sub-directory "read_trimmer" ------------------------------- It contains a script called "read_trimmer" used to trim fastq reads on 5-prime and 3-prime sides. Usage: combine_lanes -i <infile> -o <outfile> --l <logfile> (optional) --trim_5p <number of bases to trim on 5p side> --trim_3p <number of bases to trim on 3p side> --keep_5p <number of bases to keep from 5p side> --keep_3p <number of bases to keep from 3p side> From the help message, read_trimmer -h -h [ --help ] generate help message -i [ --infile ] arg path to the input file -o [ --outfile ] arg path to the output file - [ --logfile ] arg Path to the logfile --trim_5p arg (=0) number of bases to trim from 5p --trim_3p arg (=0) number of bases to trim from 3p --keep_5p arg (=-1) number of bases to keep from the 5p --keep_3p arg (=-1) number of bases to keep from the 3p
About
No description, website, or topics provided.
Resources
License
Code of conduct
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published