CCRB – Statgene (Statistical Genetics)

  JC School of Public Health and Primary Care (JC SPHPC)  |  Faculty of Medicine |  The Chinese University of Hong Kong (CUHK)  |  Hong Kong SAR

 

 

Home

 

News

 

Research

 

Publications

 

Software

 

About us

 

Links

 

CCRB

 

 

   Research  

 

W-test: inherits data-set adaptive probability distribution for efficient and powerful testing in case-control genetic data sets

 

Figure4b

 

Epistasis network identified by the W-test: Red line indicate significant interactions found in WTCCC bipolar data set, and Blue line indicates significant interactions found in the American GAIN bipolar data set. The purple circle is the gene that is replicated in the two independent data sets.

 

 

R package: wtest in CRAN

 

 

 

 

The W-test: 

 

We introduce a W-test to measure the distribution differences between the case and control groups. The measure is model-free and is constructed through contingency tables created by one or more variables. The statistic follows a chi-squared distribution with data-set adaptive degrees of freedom, estimated from small bootstrapped samples from the data under the null hypothesis. The advantage of the test is as follows:

 

 (1) it inherits a probability distribution thus p-value can be efficiently calculated; the method is scalable to genome-wide evaluation;

 

 (2) it makes use of bootstrapped sample to estimate an data-corrected degrees of freedom, and offers an more accurate probability distribution under complicated data structures, such as low frequency minor allele variables or mild population stratification. It is robust when sample size decreases.

 

(3) the test statistics is constructed from an integrated odds ratio, and takes a retrospective design, there is suitable to apply on case-control data set.

 

Ref: Wang, Sun et al. 2016 Nucleic Acids Research (pdf)

 

Extensions :

 

W-test collapsing for rare variants association test: This work extends the W-test as a rare variant association tool through collapsing single marker’s contingency tables within a genomic region, thus it belongs to the burden test category. The test has good power for rare variants and is very efficient. (Sun, Weng et al 2016 Genetic Epidemiology)

 

W-pure for detecting pure SNP-SNP interactions: work in progress.

 

 

 

Rare variant analysis:  locating optimal testing region via a Zoom-Focus algorithm (ZFA) to enhance power of any rare variant association test

 

Algorithm:

 

Computation complexity:  full version - O(P), P is number of variants; fast-version: O(log2(P)).

Software available: zfa in R project.

 

 

ZFA (Zoom-Focus Algorithm) Performance:

 

Zooming: Locate the binary partition with maximized signals

Focusing: Refine the boundaries

Reference: Maggie Haitian Wang*, Haoyi Weng, Rui Sun, Jack Lee, William Ka Kei Wu, Ka Chun Chong, Benny Chung-Ying Zee* (2017) A Zoom-Focus Algorithm (ZFA) to locate the optimal testing region for rare variant association tests. Bioinformatics.

 

Complex disease classification from genetic data sets:  Polygenic stratified risk prediction model

 

 

Diagram:

 

 

 

 

This work proposed a stratified variant evaluation strategy to be applied in complex disease prediction.

 

Complex disorder:    Bipolar

Type of data:           Whole exome sequencing data set

Core statistics:         W-test

Prediction accuracy: 40% on independent test

 

Reference: MH Wang, B Chang, R Sun, I Hu, KC Chong and BCY Zee (2017). A Stratified Polygenic Risk Prediction Model with Application on CAGI Bipolar Disorder Sequencing Data. Human Mutation.

 

classification_pic.png

Interaction-Based Feature Selection and Classification for High-Dimensional Biological Data 

 

This work incorporates high order interaction gene sets in prediction modeling. 

 

Complex disorder:       Breast cancer

Type of data:              Microarray data set

Core statistics:            BDA

Prediction accuracy:    0-6%

 

Reference: MH Wang, SH Lo, T Zheng and I Hu (2012) A Classification Method Incorporating Interactions among Variables for High-dimensional Data. Bioinformatics (2012) 28 (21): 2834-2842  PDF

 

Extensions:

 

Wang et al. (2015) Two screening methods for genetic association study with application on psoriasis microarray data sets. IEEE International Congress on Big Data.

 

 

Virus genetic evolution:   Mapping epidemics via viral mutation prevalence

 

            

 

                 Figure 3

Genetic Evolution of Human Enterovirus A71 Subgenotype C4 in Shenzhen, China, 1998-2013

We provide a new angle to view infectious disease epidemics - through the mutation prevalence that faithfully reflected population level infection positive rates in different geographical areas.  Two amino acids on the EV-A71 virus are identified that mutations have driven the 2007 Hand-Foot-Mouth Disease epidemic in China: Q22H and A289T.  The mutation trend of early identified amino acids offers important targets to include in the next year’s vaccine formulation.

 

Ref: He, Zou, Chong, , Maggie Wang 2016, Journal of Infection

Biomedical projects collaborations

 

 

Biostatistics
Epidemiology study design
GWAS and NGS data analysis

Rare variant association tests
Interaction effect methods

Survival analysis

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Copyright