This is a tool used to filter annovar annotated variants file. Annovar is a wonderful tool that annotate genome wide variants. You can specify rules and apply different disease models while filtering. Hanqing Liu at Zhejiang University liuhanqing93@gmail.com
the file contain variants, if you just get vcf format, annotate it use annovar, or wannovar (a web app interface of annovar). See test/test.annovar.txt for example, which is generated by wannovar, with suffix “.annovar”.
the file contain samples’ information, should contain at least FamilyID, SampleID, Gender, Type, Father, Mother. See test/sample_info.txt for example.
Filter variants using given sample ID, for multiple samples, use space to separate. Variant will be remained if only one given sample has it.
Filter variants using given gene name, for multiple genes, use space to separate.
Filter variants using given region, format “chromosome start end”, like “chr2 1000 100000”. For multiple regions, use space to separate.
Filter variants using one column’s information in input file, format “column_name logic query_value na_remain”, like “SIFT_score ‘>’ 0.5 T”. For multiple column filter, you can specify this flag for many times.
- column_name: the name of column corresponding to your input file.
- logic: should be one of [‘>’, ‘<‘, ‘=‘, ‘!=‘, ‘>=‘, ‘<=‘] (remember to add quotes) when query value is number, or one of [in, !in, include, !include, is, !is] when query value is string.
- query_value: the value of this filter
- na_remain: should be ’T’ to remain NA value (information not provide in the input file, usually remain blank or ‘.’), or ‘F’ to exclude.
If you specify -CF for more than one times, than you should assert the overall filter logic of column filter. Should be one of [‘ALL_TRUE’, ‘NOT_ALL_TRUE’, 'ALL_FALSE’, ‘NOT_ALL_FALSE’, ’N_TRUE’, ’N_FALSE’], N is the number of true/flase columns.
Apply mendel's law to your input file, using the sample information provided by sample file. Should be [Dom, ResHom, ResComp], for multiple models, use space to separate. For every families:
- ‘Dom’: Dominant, all patients carry and all healthy people don’t.
- ‘ResHom’: Recessive Homozygote, all patients are homozygote and all healthy people don’t.
- ‘ResComp’: Recessive Compound Heterozygote, for at least two variants on one gene, all patients are both heterozygote (but not homozygote, which is contained in ResHom) and all healthy people don’t.
Specify the filename of filter result.
- Load input file, sample file, output to a file.
python main.py -I ./test/test.annovar -SI ./test/sampleinfo.txt -O filterresult
- Filter by genes, samples, regions.
python main.py -I ./test/test.annovar -SI ./test/sample_info.txt -O filter_result -S 1 2 3 -G USH2A -R chr1 215800000 216200000
- Use two column filter, one is to select allele frequency larger than 0.10 in 1000G_ALL (1000 genomes all), the other is to select prediction of Polyphen2 is not B (Benign). The overall logic is ALL_TRUE, which means both judgements of two filter should be TRUE.
python main.py -I ./test/test.annovar -SI ./test/sample_info.txt -O filter_result -CF 1000G_ALL '<=' 0.10 F -CF Polyphen2_HDIV_pred '!is' B F -TL ALL_TRUE
- Select samples from same family, apply Dominant model to them.
python main.py -I ./test/test.annovar -SI ./test/sample_info.txt -O filter_result -S 1 2 3 -G USH2A -M Dom