Research Summary

Statistical Methods to Enhance the Power of Discovering Associations from Genome-wide Data.


Complex diseases like diabetes, cardio-vascular diseases and cancer pose a significant public health burden in India and across the world.  Over the last decade, Genome-wide Association Studies (GWAS), conducted to identify genetic mutations (Single Nucleotide Polymorphisms or SNPs) contributing to these diseases, have successfully identified a large number of SNPs for many complex diseases. However, a large part of the genetic component of these diseases is still unexplained. A significant part of this ‘missing heritability’ may be due to undiscovered SNPs that fell below the stringent detection threshold in GWAS. Typically GWAS query millions of SNPs individually for association with the disease and then use an overly stringent threshold to assess the significance of each association (multiple-testing problem). This ensures that false-positives are minimized but at the same time hurts the power severely. Many associated SNPs with modest effects are possibly lost due to this ‘curse of dimensionality’. An obvious approach to improve power to recover these SNPs is to increase sample sizes or meta-analyze multiple studies. In this project we explore some other complementary approaches that can further enhance the power of discovery.

Common SNPs with modest effects are likely to be regulatory in nature. Using this fact it may be possible to narrow down the search space in GWAS by prioritizing potential regulatory SNPs. The space can be further narrowed by prioritizing SNPs that regulate the expression of relevant genes – e.g., those that are differentially expressed in patients. For this purpose, Expression-QTL (eQTL) studies can be used to empirically link genotypes with gene-expression. Through this project we aim to build and release an analysis pipeline that will enable scientists to take advantage of information across studies - from GWAS, gene-expression, eQTL studies and biological databases for powerful discovery of disease-associated SNPs. At the same time our approach will provide pointers to the genes that are likely to be mediating the causal action of these SNPs. We will develop novel statistical methodologies for multiple-testing and for incorporation of prior knowledge that will serve as building blocks for linking information across omics studies. The methodology and pipeline released through this project will facilitate identification of novel genetic variants and causal mechanisms thus helping to elucidate biological pathways and processes involved in complex diseases.

Figure Legend: Statistical power of discovery from GWAS can be enhanced significantly by borrowing information from experiments such as transcriptomic and eQTL (expression-QTL) studies. SNPs showing evidence of regulating differentially-expressed genes can be prioritized to conduct an ‘informed GWAS’ with a reduced search space of potential truly associated SNPs.