Analyzing & Tagging GoNL Deletions

This repository contains the code/scripts for analyzing and tagging deletions for the Genome of the Netherlands. The code serves two purposes:

explore the extent in which linkage disequilibria (correlations) between GoNL deletions and GWAS SNPs occur, and
finding appropriate tag SNPs for the found deletions.

The code is written in both C and Python.

Installation

Dependencies

The compilation of the C code requires the following libraries to be installed:

The GNU scientific library (GSL - see http://www.gnu.org/software/gsl)
CMake (see http://www.cmake.org)

The project depends for Python on the following packages:

PyVCF (see https://github.com/jamescasbon/PyVCF) for working with VCF files
snakemake (see https://bitbucket.org/johanneskoester/snakemake/wiki/Home) for using the pipeline, see the file `Snakefile' in the main directory

Installation instructions

In order to compile the C code in the folder src/, type in the main directory:

	$ cmake . 
	$ make
	$ make install

The executables gonl_create_pairs, gonl_imputation and gonl_tag_deletions will be placed in the bin/ folder together with the Python scripts.

Directory structure

The repository consists of the following directories:

back-up/ - contains some older versions of the project.
bin/ - contains the Python scripts and executables used for calling the somatic mutations.
data/ - contains a few of the raw data files used in the project. Some of the data files are ignored (e.g., in the folders gonl-deletions and gonl-snps) due to their size. Given the original VCF files all the data can be reproduced.
include/ - contains the header files for the C-code.
matlab/ - contains some Matlab code for plotting the results.
results/ - contains various (intermediate) results. The actual results are not in the repository (due to their size) but can be reproduced with the original VCF files and the present scripts/code.
src/ - contains the C-code.

File formats

This section contains a description of the two (novel) file formats used in this projects: .raw-observations and .calls.

.haplotypes

This data structure is used to compress the relevant haplotype data from a VCF file, see bin/extract-haplotypes.py. Each variant consists of two lines. The first line contains the data on the variant (type, reference and alternative allele etc). The second line is the sequence of binary values representing the presence of the reference allele (1) or the alternative allele (0).

The first line is structured as follows (space-separated):

<chr> <type> <pos> <length> <ref> <alt> <n> <af>

where <chr> is the chromosome on which the SNP resides. <type> is equal to + in case of an insertion, - in case of a deletion, * is case of a SNP and ! in case of a GWAS SNP. <position> is the position of the SNP as given in the VCF file. The field <length> is only used in case of an indel and gives the length. <ref> and <alt> are the reference and the alternative alleles (in case of a deletion/insertion, we just use .'s). <n> is the number of haplotypes present (996 in our case, since we consider the parents). <af> is the allele frequency of the reference allele. In case of a deletion, <af> is the frequency with which the deletion occurs.

The second line is just a sequence of binary values of length <n>. There is no space. 1 denotes that the reference allele/deletion is present.

.pairs

This file format is used for representing SNP-deletion pairs. The first line always represents a SNP; the lines that follow contain the data on deletions in the vicinity of the SNP. The number of deletions that follow the SNP may vary. The data on the SNP is structured as (space-separated):

<chr> <type> <position> <ref> <alt> <hit-allele> <dist-tss> <region>

where <chr> is the chromosome on which the SNP resides. <type> can either be * in case of a regular SNP and ! in case of a known GWAS SNP. <position> is the position of the SNP as given in the VCF file. <ref> and <alt> are the reference and the alternative alleles. In case that the SNP is a GWAS SNP, i.e., <type> is equal to !, then <hit-allele> is the allele associated with the disease. In case of a regular SNP, this field simply contains a .. The field <dist-tss> denotes the distance (in bp) to the closest transcription start site. The field <region> can either be intronic, exonic or intergenic, dependent on the location of the SNP.

The lines that follow the SNP are the deletions. These lines always start with -. A line like this is structured as follows:

- <pos> <length> <R> <p> <A> <B> <C> <D>

where <pos> is the position of the deletion and <length> is its length. The last four columns denote the following 2x2 contigency table:

	reference allele	alternative allele	total
deletion	A	C	A + C
no deletion	B	D	B + D
total	A + B	C + D	A+B+C+D

<R> is the Pearson R and <p> is the p-value found when applying Fisher's two-sided exact test on the table.

.tagsnps

Every line represents one deletion-tag snp pair and consists of 14 columns in total (space-delimited):

the chromosome on which the deletion resides.
the deletions position as given in the VCF file.
the length of the deletion.
position of the tag SNP with the maximum R-squared.
reference allele of the SNP.
alternative allele of the SNP.
Pearson R-value.
p-value (Fisher's two-sided exact test).
Conditional probability of having the deletion given the presence of the reference allele.
Conditional probability of having the deletion given the presence of the alternative allele.

The columns 11-14 provide the counts for the contingency table A, B, C and D. See the 2x2 contingency table of the previous section.

Contact

Louis Dijkstra

E-mail: louisdijkstra (at) gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.snakemake		.snakemake
back-up		back-up
bin		bin
data		data
matlab		matlab
python		python
results		results
src		src
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
FindGMP.cmake		FindGMP.cmake
FindGSL.cmake		FindGSL.cmake
LICENSE		LICENSE
README.md		README.md
Snakefile		Snakefile
config.h		config.h
config.h.in		config.h.in

License

louisdijkstra/gonl-sv

Folders and files

Latest commit

History

Repository files navigation

Analyzing & Tagging GoNL Deletions

Installation

Dependencies

Installation instructions

Directory structure

File formats

.haplotypes

.pairs

.tagsnps

Contact

About

Resources

License

Stars

Watchers

Forks

Languages