Software:
- bcl2fastq
- FastQC
- Trim Galore
- Bismark
R packages:
- ggplot2
- reshape
- grid
- xtable
Important note: Most python scripts have a help page, so if unsure on how to use it or require information about default and additional arguments, check. For example:
$ analysis_info.py -h
1. Go to your home folder. For example:
$ cd /mnt/research/mjg225/
2. Create a directory where you will do the analysis and go to that folder. For example:
$ mkdir methylSeq/pipelineTest/heroG/aug2016
$ cd methylSeq/pipelineTest/heroG/aug2016
3. Create a bin/ folder. This folder will have the scripts you need to run the analysis:
$ mkdir bin/
4. Go to https://github.com/CGSbioinfo/MethylSeq. Click on the green top right button "Clone or download". This will download the scripts to the Downloads/ folder in your home directory (for example: /home/mjg225/Downloads/). Make sure it exists by typing the following (Note that /home/mjg225/ should be replaced with your own home directory):
$ ls /home/mjg225/Downloads/MethylSeq-master
This should print on your terminal screen the MethylSeq-master folder
5. From your currect directory, unzip the file (Note that /home/mjg225/ should be replaced with your own home directory):
$ unzip /home/mjg225/Downloads/MethylSeq-master
A folder named "MethylSeq-master" should appear. Check by just typing ls.
6. Move the contents of MethylSeq-master/bin to bin/
$ mv MethylSeq-master/bin/* bin/
At this point you should have a directory where you will do the analysis and a bin/ folder in such directory with the analysis scripts copied from the github download.
7. Run the analysis_info.py script
$ python bin/analysis_info.py
A file named 'analysis_info.txt' will be created in the folder. Open it in a text editor or vi and fill it. For example:
Working directory = /mnt/cgs-fs2/Bioinfo_pipeline/MethylSeq/test/aug2016/heroG/
run_folder = /mnt/cgs-fs3/Sequencing/NextSeq_Output/160711_NS500125_0298_AHFW35BGXY/
run_samplesheet = /mnt/cgs-fs3/Sequencing/NextSeq_Output/160711_NS500125_0298_AHFW35BGXY/SampleSheet.csv
bcl2fastq_output = /mnt/cgs-fs2/Bioinfo_pipeline/MethylSeq/test/aug2016/heroG/fastq/
readType = pairedEnd
reference_genome =
bismark_params = --bowtie2; --bam; --directional; -N 0; -L 20; --no-mixed; --no-discordant; -D 15; -R 2; --score_min L,0,-0.2;
methyl_extract_params= --bedGraph; --gzip; --merge_non_CpG;
target_regions_bed =
ncores = 8
Explanation of 'analysis_info.txt':
Working directory = <path to directory of the analysis>
run_folder = <path to the run folder>
run_samplesheet = <sample sheet used to generate fastq files. This is created using the Illumina Expert Manager>
bcl2fastq_output = <path to the desired output of bcl2fastq. The defaults is fastq/ and the folder will be created automatically>
readType = <either pairedEnd or singleEnd>
reference_genome = <path to the bismark reference genome that will be used at the mapping step>
bismark_params = <parameters passed to the bismark script. Leave the default, and you can add additional parameters separated by ";">
methyl_extract_params= <parameters passed to the methyl extractor script. Leave the default, and you can add additional parameters separated by ";">
target_regions_bed = <path to bedfile>
ncores = <Number of cores to use to pararellize analysis>
All the analysis scripts are wrapped in the main python script runMethylationAnalysis.py. This main script takes an argument --run, which is used to indicate which section of the main script to run. The following commands are used to run the analysis with the main script:
1. Using the command --run step1_prepare_analysis, the main script will read the analysis_info_file and run bcl2fastq, create a sample names file, and organize the working directory.
$ python bin/runMethylationAnalysis.py --run step1_prepare_analysis
Once it finishes, there will be a folder named rawReads with fastq files sorted according to sample names. There will also be a sample_names.txt file with a list of sample names, one per line.
2. Using the command --run step2_qc_and_trimming, the main script will read the analysis_info_file, the sample_names file and run fastqc, create a folder Report/figure/rawQC with plots, create a folder Report/figure/data with tables, run trim galore, run fastqc on the trimmed reads, and create a folder Report/figure/trimmedQC with plots:
$ python bin/runMethylationAnalysis.py --run step2_qc_and_trimming
3. Using the command --run step3_mapping_and_deduplication, the main script will read the analysis_info_file, the sample_names file and run bismark and deduplication scripts. The output includes bam file (original and deduplicated), and log files of each sample, and will be saved in a folder alignedReads/ (created automatically).
$ python bin/runMethylationAnalysis.py --run step3_mapping_and_deduplication.
4. Bismark outputs a bam file with the mapped reads and a report about the alignment.
The following command uses an executable Rscript which summarises mapping QC metrics.
Arguments:
Rscript bin/mappingQC.R <input folder containing bam files> <sample names file> <suffix pattern of report files output of bismark> <outdir>
Example:
$ Rscript bin/mappingQC.R /mnt/research/jb393/MethylSeq_Pilot/Aligned_data/Raw_bam/ sample_names_all.txt _bismark_bt2_PE_report.txt Report/figure/mappingQC/
5. Run the methylation extraction. You can change the parameters in the analysis_info.txt. Consider particularly --no_overlap for paired end reads.
$ python bin/methylationExtraction.py --out_dir alignedReads/
This creates 3 output files per sample: bedGraph.gz, bismark.cov.gz, and M-bias.txt.
There is also a log file per sample: methylExtract_log_sampleName.txt.
The M-bias.txt sample will be used in the next step to detect any bias in the %Methylation across the reads"'" positions.
6. Run the mbias plot
Arguments:
Rscript bin/methylExtractQC_mbias_plot.R <input folder containing .M-bias.txt files> <sample names file> <suffix pattern of M-bias.txt output of bismark> <name of output file>
Example:
$ Rscript bin/methylExtractQC_mbias_plot.R alignedReads/ sample_names.txt .M-bias.txt Report/figure/methExtractQC/Mbias_plot.pdf
This creates a plot in the specified outdir with the %Methylation across reads"'" positions.
Based on this plot, we need to decide whether or not to trim bases from 5p and 3p for each sample.
7. Create and fill a file mbias_remove_bases.txt with information about which bases to clip from reads.
Arguments:
python bin/remove_bases_file_info.py --outfile <name of output txt file>
Example:
$ python bin/remove_bases_file_info.py --outfile remove_bases.txt
A file with the specified name will be created. Open the file and fill it with the following information (one sample per line):
sample: <sample name>
5R1: <number of bases to clip from the 5 prime end from read 1 (forward read)>
3R1: <number of bases to clip from the 3 prime end from read 1 (reverse read)>
5R2: <number of bases to clip from the 5 prime end from read 2 (forward read)>
3R2: <number of bases to clip from the 3 prime end from read 2 (reverse read)>
8. Run the methyl extraction again, removing biased bases from reads with information in mbias_remove_bases.txt.
Arguments:
python bin/methylationExtraction_removeBases.py --in_dir <path to folder containing bam files from bismark> --sample_names_file <file with sample names> --out_dir <output directory for the methylation extrated data> --remove_bases_file <file with bases to be clipped from reads (created in step 7)>
Example:
$ python bin/methylationExtraction_removeBases.py --in_dir alignedReads/--sample_names_file sample_names_test.txt --out_dir alignedReads/clipped/ --remove_bases_file mbias_remove_bases.txt
Confirm that the clipping of bases worked:
$ Rscript bin/methylExtractQC_mbias_plot.R alignedReads/clipped/ sample_names.txt .M-bias.txt Report/figure/methExtractQC/Mbias_plot_clipped.pdf
Script: bin/coverage_methExtractedData.R
Arguments:
/usr/bin/Rscript bin/coverage_methExtractedData.R <sample_details.txt file> <folder containing bedgraph files (full path)> <output plot name: coverage whole genome> <output plot name: coverage in target regions> <bed file with annotation of target regions>
Example:
$ /usr/bin/Rscript bin/coverage_methExtractedData.R /mnt/research/jb393/MethylSeq_Pilot/Aligned_data/Raw_bam/methylationExtraction_clean/ Report/figure/methExtractQC/coverage_rawData_wg.pdf Report/figure/methExtractQC/coverage_rawData_tr.pdf /mnt/research/jb393/MethylSeq_Pilot/Kit_annotation/S03770311_Covered_GR38.bed