## KBAC Statistic Implementation

### Methodology

This program implements the KBAC statistic in Liu and Leal 20101). It carries out case-control association testing for rare variants for whole exome association studies. Briefly, consider a gene of length $N$ which harbors $M$ rare variants. Genotype on the $M$ variant sites & the disease status (case/control) are known for each individual. The program takes as input the $M$-site genotype and disease status (case/control) data files, and computes a $p$ value indicating the significance of association. In order to speed up permutation testing we use an “adaptive” approach to obtain $p$ values.

You can install the latest version of KBAC via the R script below:

Click to display ⇲

Click to hide ⇱

install_kbac.R
if (!require("devtools", character.only=TRUE, quietly=TRUE)) {
install.packages("devtools")
}
devtools::install_github("gaow/kbac")

### ChangeLog

• 2016.04.09 Move source code to github. No updates made to the package.
• 2013.02.22 Output the weights for the original genotype patterns, in response to user requests
• 2011.05.27 GSL portability: I keep updating the package trying to port the GSL libraries into the program for a speedy hypergeometric routine. Complexities thus arise (various dependency problems) that I keep resolving. Please let me know if you fail to compile the package on your computer
• 2011.05.18 Improved R/CPP interface
• 2011.04.12 A more proper implementation of two-sided test
• 2010.12.02 Initial release

## Documentation

### Note

In this R implementation

• Only hyper-geometric kernel implemented
• Only single candidate region analysis is supported (no covariates allowed)
• The length of candidate region reported in KBAC log is the number of variant sites analyzed (excluding sites with zero MAF or MAF above given threshold)
• Allows for testing of both protective and deleterious mutations (set alternative = 2); Genotypes codings are 0 or 1 or 2. Invalid codings will be re-coded as 0 (wild-type). Phenotype codings are 0 for ctrls, 1 for cases
• Genotype should consist only of the rare variants of interest – synonymous, non-polymophic sites and common variants (MAF > 0.01) must be excluded;
• Variant sites must be SNPs with numeric coding, no missing data allowed.

The purpose of this R packages to demonstration the KBAC methodology. It is not designed for analysis of real-world next-generation sequencing data, for which I suggest VAT as a resort.

### Input

Genotype for each locus are coded as 0 for wild type, 1 for heterozygotes and 2 for homozygotes, e.g.

STATUS M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12
0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0

### Example

#### Getting started

Install the package, start R, and type the following to load the package and read the usage

library("KBAC")
?KbacTest

#### A quick demonstration

Click to display ⇲

Click to hide ⇱

kbac_example.R
casectrl.dat <- read.table("phengen.dat", skip = 1)
# Set parameters and use the KbacTest() function to obtain p-value
alpha <- 0.05
num.permutation <- 3000
quiet <- 1
alternative <- 1
maf.upper.bound <- 0.05
kbac.pvalue <- KbacTest(casectrl.dat, alpha, num.permutation, quiet, maf.upper.bound, alternative)
print(kbac.pvalue)
# To evaluate test at small alpha we need huge number of permutations. Adaptive approach is thus necessary.
kbac.pvalue <- KbacTest(casectrl.dat, 0.00001, 1000000, quiet, maf.upper.bound, alternative)
print(kbac.pvalue)
# Not using adaptive p-value calculation; will take longer time
kbac.pvalue <- KbacTest(casectrl.dat, 9, 1000000, quiet, maf.upper.bound, alternative)
print(kbac.pvalue)

##### Result

If you set quiet = F then you will see the screen output like this:

Click to display ⇲

Click to hide ⇱

Number of each unique individual genotype patterns (totaling 16 patterns excluding wildtype):
3, 4, 4, 1, 29, 1, 16, 6, 40, 19, 1, 1, 51, 1, 1, 1,

Unique genotype patterns weights (model 1):
0.124812 0.687688 0.312312 1 0.98844 0.5 0.962177 0.656485 0.925138 0.821517 0.5 0.5 0.922296 0.5 0.5 1

Take the first genotype pattern for example. There are 3 individuals having this genotype pattern in the input data, all of whom are controls.

Click to display ⇲

Click to hide ⇱

casectrl.dat[c(507,525,549),]
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21
507  0  0  1  0  0  0  0  0  0   0   0   0   0   0   0   0   0   0   0   0   0
525  0  0  1  0  0  0  0  0  0   0   0   0   0   0   0   0   0   0   0   0   0
549  0  0  1  0  0  0  0  0  0   0   0   0   0   0   0   0   0   0   0   0   0
V22 V23 V24 V25 V26 V27 V28
507   0   0   0   0   0   0   0
525   0   0   0   0   0   0   0
549   0   0   0   0   0   0   0

To see how the weight 0.124812 is tabulated, consider the phyper(x,m,n,k) notation in R where

• x: number of cases (for model 1) having this genotype pattern
• m: total number of cases
• n: total number of ctrls
• k: total number of samples having this genotype pattern

Then the weight is computed by phyper(0,1000,1000,3) which is 0.124812

#### Compare with CMC method

An R script for the CMC method Li and Leal 20082) via Fisher's exact test is provided for comparison purpose:

Click to display ⇲

Click to hide ⇱

cmc.R
arg <- commandArgs()

Cmc <- function(fn) {
pgdata <- as.matrix(read.table(fn, as.is=T, skip = 1))
y <- pgdata[,1];
x <- ((rowSums(as.matrix(pgdata[,-1])))>0);
m <- matrix(nrow=2,ncol=2);
m[1,1] <- sum(x==1 & y==1);
m[1,2] <- sum(x==0 & y==1);
m[2,1] <- sum(x==1 & y==0);
m[2,2] <- sum(x==0 & y==0);
print(m);
stat <- fisher.test(m)\$p.value;
return(stat);
}

Cmc(arg[3])

To use this script:

R --no-save phengen.dat < cmc.R