CCRB – Statgene (Statistical Genetics)

  JC School of Public Health and Primary Care (JC SPHPC)  |  Faculty of Medicine |  The Chinese University of Hong Kong (CUHK)  |  Hong Kong SAR













About us









W-test: inherits data-set adaptive probability distribution for efficient and powerful testing in case-control genetic data sets




Epistasis network identified by the W-test: Red line indicate significant interactions found in WTCCC bipolar data set, and Blue line indicates significant interactions found in the American GAIN bipolar data set. The purple circle is the gene that is replicated in the two independent data sets.



R package: wtest in CRAN





The W-test: 


We introduce a W-test to measure the distribution differences between the case and control groups. The measure is model-free and is constructed through contingency tables created by one or more variables. The statistic follows a chi-squared distribution with data-set adaptive degrees of freedom, estimated from small bootstrapped samples from the data under the null hypothesis. The advantage of the test is as follows:


 (1) it inherits a probability distribution thus p-value can be efficiently calculated; the method is scalable to genome-wide evaluation;


 (2) it makes use of bootstrapped sample to estimate an data-corrected degrees of freedom, and offers an more accurate probability distribution under complicated data structures, such as low frequency minor allele variables or mild population stratification. It is robust when sample size decreases.


(3) the test statistics is constructed from an integrated odds ratio, and takes a retrospective design, there is suitable to apply on case-control data set.


Ref: Wang, Sun et al. 2016 Nucleic Acids Research (pdf)


Extensions :


W-test collapsing for rare variants association test: This work extends the W-test as a rare variant association tool through collapsing single marker’s contingency tables within a genomic region, thus it belongs to the burden test category. The test has good power for rare variants and is very efficient. (Sun, Weng et al 2016 Genetic Epidemiology)


W-pure for detecting pure SNP-SNP interactions: work in progress.




Rare variant analysis:  locating optimal testing region via a Zoom-Focus algorithm (ZFA) to enhance power of any rare variant association test




Computation complexity:  full version - O(P), P is number of variants; fast-version: O(log2(P)).

Software available: zfa in R project.



ZFA (Zoom-Focus Algorithm) Performance:


Zooming: Locate the binary partition with maximized signals

Focusing: Refine the boundaries

Reference: Maggie Haitian Wang*, Haoyi Weng, Rui Sun, Jack Lee, William Ka Kei Wu, Ka Chun Chong, Benny Chung-Ying Zee* (2017) A Zoom-Focus Algorithm (ZFA) to locate the optimal testing region for rare variant association tests. Bioinformatics.


Complex disease classification from genetic data sets:  Polygenic stratified risk prediction model








This work proposed a stratified variant evaluation strategy to be applied in complex disease prediction.


Complex disorder:    Bipolar

Type of data:           Whole exome sequencing data set

Core statistics:         W-test

Prediction accuracy: 40% on independent test


Reference: MH Wang, B Chang, R Sun, I Hu, KC Chong and BCY Zee (2017). A Stratified Polygenic Risk Prediction Model with Application on CAGI Bipolar Disorder Sequencing Data. Human Mutation.



Interaction-Based Feature Selection and Classification for High-Dimensional Biological Data 


This work incorporates high order interaction gene sets in prediction modeling. 


Complex disorder:       Breast cancer

Type of data:              Microarray data set

Core statistics:            BDA

Prediction accuracy:    0-6%


Reference: MH Wang, SH Lo, T Zheng and I Hu (2012) A Classification Method Incorporating Interactions among Variables for High-dimensional Data. Bioinformatics (2012) 28 (21): 2834-2842  PDF




Wang et al. (2015) Two screening methods for genetic association study with application on psoriasis microarray data sets. IEEE International Congress on Big Data.



Virus genetic evolution:   Mapping epidemics via viral mutation prevalence




                 Figure 3

Genetic Evolution of Human Enterovirus A71 Subgenotype C4 in Shenzhen, China, 1998-2013

We provide a new angle to view infectious disease epidemics - through the mutation prevalence that faithfully reflected population level infection positive rates in different geographical areas.  Two amino acids on the EV-A71 virus are identified that mutations have driven the 2007 Hand-Foot-Mouth Disease epidemic in China: Q22H and A289T.  The mutation trend of early identified amino acids offers important targets to include in the next year’s vaccine formulation.


Ref: He, Zou, Chong, , Maggie Wang 2016, Journal of Infection

Biomedical projects collaborations



Epidemiology study design
GWAS and NGS data analysis

Rare variant association tests
Interaction effect methods

Survival analysis