W-test: dataset adaptive association test for main and interaction effects in GWAS data

  JC School of Public Health and Primary Care (JC SPHPC)  |  Faculty of Medicine |  The Chinese University of Hong Kong (CUHK)  |  Hong Kong SAR

 

 

 

Method

 

Download

 

Guide
      - R

      - C++

      - Linux

 

CCRB-Statgene

 

 

The Method 

 

The W-test is designed to test the distributional differences in cases and controls for categorical variable set, which can be a single SNP, SNP-SNP, or SNP-Environment pairs. It takes a combined log of odds ratio form, calculated from the contingency table of the variable set. The test inherits a chi-squared distribution with data-set adaptive degrees of freedom f, which is estimated from smaller bootstrapped samples of the data. The flexible and data-corrected probability distribution allows W-test to give more accurate p-values, and produce good power while controlling type I error rates.  The calculation speed is very fast, and can perform genome-wide epistasis testing in about 2 hours, including the bootstrap time.

 

The test statistic takes the following form:

Where p1i is cell i's subject proportion in the case group, and p0i is the proportion in the control group. K is the number of categorical combinations. SE is the standard error of the log of odds ratio. For details of the expression, please refer to our paper (Wang, Sun et al. 2016).

 

  h and f

h and f are the parameters estimated using bootstrapped samples. N is the total number of samples, and P is total number variables. The suggested setting is bootstrap times B > 200, subjects number NB= min (1000, N), and number of pairs PB= min (1000, P). Empirical studies give h ≈ (k − 1)/k and f ≈ k − 1. Convergence simulations can be found here.

 

  Application note

  1. The power of W-test is greatest when the number of categorical combinations is in between 4 to 9 from simulation studies. When k is small, like 2 or 3, the dependency among the odds ratio is very high, and the power of W will only be slightly higher than the chi-squared test and logistic regression (see simulation studies here). When k is large and sample size is affordable, the dependency is moderate to low, and the h and f estimates will get very close to default values of h ≈ (k − 1)/k and f ≈ k – 1.

 

  2. For each dataset, first estimate h and f, then calculate the pairwise or main effect using the estimates. The default h and f can be used in the exploration step, without changing too much of the ranking of the returned pairs. One set of h and f is needed for each data set. If a sub-population is drawn, the genetic structure will be changed, and the parameters need to be estimated again.  In the R package, we provide diagnostic plots for the probability distribution under null hypothesis for each k.

 

  3. For very large statistics value, the corresponding p-value of a gamma statistic can be slightly different in R environment, Excel and C++ software. We recommend using the R returned p-value as a standard.

 

 

Updates

 

4 Aug 2017 – Howey and Cordell published a follow-up paper on the W-test [4]:

In this paper, Howey and Cordell further investigated the theoretical properties of the W-test, and applied the method in additional simulation and real data sets. The authors especially compared the method to the logistic regression with 8 degrees of freedom. Their simulation studies showed that the W-test has good/conservative type I error rates and better power than the Chi-squared test and Logistic regression under certain scenarios. The paper concluded that the W-test was a useful and practical methods in real GWAS data analysis, especially for low frequency and rare variants.  Howey and Cordell pointed out a data quality control problem in the original paper, and raised concerned about the reported real bio-markers.  Dr. Cordell has kindly provided us with the list of exclusion SNPs with high genotyping errors, by which we have re-done the real data analysis part by excluding them. The W-test manuscript real data part is therefore revised accordingly:  PDF (with revised real data analysis part – 4 Aug 2017), Supplementary Materials (4 Aug 2017).

 

 

 

References:

 

 1. Maggie Haitian Wang, Rui Sun (Co-first authors), Junfeng Guo, Haoyi Weng, Jack Lee, Inchi Hu, Pak Sham and Benny Chung-Ying Zee (2016).  A fast and powerful W-test for pairwise epistasis testing. Nucleic Acids Research. doi:10.1093/nar/gkw347.  PDF (with revised real data analysis part – 4 Aug 2017), Supplementary Materials (4 Aug 2017).

 

 

 2. Rui Sun, Haoyi Weng (Co-first), Inchi Hu, Junfeng Guo, William K.K. Wu, Benny Chung-Ying Zee, Maggie Haitian Wang* (2016) A W-test collapsing method for rare variant association testing in exome sequencing data. Genetic Epidemiology.

 

 3. Rui Sun, Maggie Haitian Wang. R package: wtest function for main and epistasis effect evaluation. arXiv preprint arXiv:1610.03182

 

 4. Howey R and Cordell HJ. Further investigations of the W-test for pairwise epistasis testing, Wellcome Open Res 2017, 2:54 (doi: 10.12688/wellcomeopenres.11926.1) 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Copyright