Skip to content

geighei/PRScs

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

86 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PRS-CS

PRS-CS is a Python based command line tool that infers posterior SNP effect sizes under continuous shrinkage (CS) priors using GWAS summary statistics and an external LD reference panel. Details of the method are described in the article:

T Ge, CY Chen, Y Ni, YCA Feng, JW Smoller. Polygenic Prediction via Bayesian Regression and Continuous Shrinkage Priors. Nature Communications, 10:1776, 2019.

Recent Version History

Sept 10, 2020: Fixed a bug in strand flip when there are non-ATGC alleles (e.g., indels) in the GWAS summary statistics. Previous versions erroneously remove variants that can be matched across GWAS summary statistics, the reference panel and the validation bim file via strand flip, which reduces the number of SNPs used in prediction and may affect prediction accuracy.

Apr 24, 2020: Accounted for a rare ZeroDivisionError in MCMC sampling.

Apr 20, 2020: Added non-ATGC allele check.

Apr 11, 2020: Added strand flip check.

Mar 25, 2020: Minor changes to make the software Python 2 and 3 compatible.

Oct 20, 2019: Added --seed, which can be used to seed the random number generator using a non-negative integer.

Jun 6, 2019: Fixed a bug in --beta_std. If you explicitly specified --beta_std=False, the output was actually standardized beta (in contrast to desired per-allele beta) and we recommend redoing the analysis. If you left --beta_std as default or used --beta_std=True, the results were not affected.

Getting Started

  • Clone this repository using the following git command:

    git clone https://github.com/getian107/PRScs.git

    Alternatively, download the source files from the github website (https://github.com/getian107/PRScs)

  • Download the LD reference computed using the 1000 Genomes samples, and extract files:

    EUR reference (~4.56G); tar -zxvf ldblk_1kg_eur.tar.gz

    EAS reference (~4.33G); tar -zxvf ldblk_1kg_eas.tar.gz

    AFR reference (~4.44G); tar -zxvf ldblk_1kg_afr.tar.gz

  • PRScs requires Python packages scipy (https://www.scipy.org/) and h5py (https://www.h5py.org/) installed.

  • Once Python and its dependencies have been installed, running

    ./PRScs.py --help or ./PRScs.py -h

    will print a list of command-line options.

Using PRS-CS

python PRScs.py --ref_dir=PATH_TO_REFERENCE --bim_prefix=VALIDATION_BIM_PREFIX --sst_file=SUM_STATS_FILE --n_gwas=GWAS_SAMPLE_SIZE --out_dir=OUTPUT_DIR [--a=PARAM_A --b=PARAM_B --phi=PARAM_PHI --n_iter=MCMC_ITERATIONS --n_burnin=MCMC_BURNIN --thin=MCMC_THINNING_FACTOR --chrom=CHROM --beta_std=BETA_STD --seed=SEED]

  • PATH_TO_REFERENCE (required): Full path (including folder name) to the directory (ldblk_1kg_eur, ldblk_1kg_eas or ldblk_1kg_afr) that contains information on the LD reference panel (snpinfo_1kg_hm3 and ldblk_1kg_chr*.hdf5). Note that the reference panel should match the ancestry of the GWAS sample (not the testing sample).

  • VALIDATION_BIM_PREFIX (required): Full path and the prefix of the bim file for the validation set.

  • SUM_STATS_FILE (required): Full path and the file name of the GWAS summary statistics. The summary statistics file must have the following format (including the header line):

    SNP          A1   A2   BETA      P
    rs4970383    C    A    -0.0064   4.7780e-01
    rs4475691    C    T    -0.0145   1.2450e-01
    rs13302982   A    G    -0.0232   2.4290e-01
    ...

Or:

    SNP          A1   A2   OR        P
    rs4970383    A    C    0.9825    0.5737                 
    rs4475691    T    C    0.9436    0.0691
    rs13302982   A    G    1.1337    0.0209
    ...

where SNP is the rs ID, A1 is the reference/effect allele, A2 is the alternative allele, BETA/OR is the effect/odds ratio of the reference allele, P is the p-value of the effect. In fact, BETA/OR is only used to determine the direction of an association, and therefore if z-scores or even +1/-1 indicating effect directions are presented in the BETA column, the algorithm should still work properly.

  • GWAS_SAMPLE_SIZE (required): Sample size of the GWAS.

  • OUTPUT_DIR (required): Output directory and output filename prefix of the posterior effect size estimates.

  • PARAM_A (optional): Parameter a in the gamma-gamma prior. Default is 1.

  • PARAM_B (optional): Parameter b in the gamma-gamma prior. Default is 0.5.

  • PARAM_PHI (optional): Global shrinkage parameter phi. If phi is not specified, it will be learnt from the data using a fully Bayesian approach. This usually works well for polygenic traits with large GWAS sample sizes (hundreds of thousands of subjects). For GWAS with limited sample sizes (including most of the current disease GWAS), fixing phi to 1e-2 (for highly polygenic traits) or 1e-4 (for less polygenic traits), or doing a small-scale grid search (e.g., phi=1e-6, 1e-4, 1e-2, 1) to find the optimal phi value often improves perdictive performance.

  • MCMC_ITERATIONS (optional): Total number of MCMC iterations. Default is 1,000.

  • MCMC_BURNIN (optional): Number of burnin iterations. Default is 500.

  • MCMC_THINNING_FACTOR (optional): Thinning of the Markov chain. Default is 5.

  • CHROM (optional): The chromosome(s) on which the model is fitted, separated by comma, e.g., --chrom=1,3,5. Parallel computation for the 22 autosomes is recommended. Default is iterating through 22 autosomes (can be time-consuming).

  • BETA_STD (optional): If True, return standardized posterior SNP effect sizes (i.e., effect sizes corresponding to standardized genotypes with zero mean and unit variance across subjects). If False, return per-allele posterior SNP effect sizes, calculated by properly weighting the posterior standardized effect sizes using allele frequencies estimated from the reference panel. Default is False.

  • SEED (optional): Non-negative integer which seeds the random number generator.

Output

PRS-CS writes posterior SNP effect size estimates for each chromosome to the user-specified directory. The output file contains chromosome, rs ID, base position, A1, A2 and posterior effect size estimate for each SNP. An individual-level polygenic score can be produced by concatenating output files from all chromosomes and then using PLINK's --score command (https://www.cog-genomics.org/plink/1.9/score). If polygenic scores are generated by chromosome, use the 'sum' modifier so that they can be combined into a genome-wide score.

Test Data

The test data contains GWAS summary statistics and a bim file for 1,000 SNPs on chromosome 22. An example to use the test data:

python PRScs.py --ref_dir=path_to_ref/ldblk_1kg_eur --bim_prefix=path_to_bim/test --sst_file=path_to_sumstats/sumstats.txt --n_gwas=200000 --chrom=22 --phi=1e-2 --out_dir=path_to_output/eur

Support

Please direct questions or bug reports to Tian Ge (tge1@mgh.harvard.edu).

About

Polygenic prediction via continuous shrinkage priors

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%