This is my notes of the class Statistics for Genomic Data Science offered by Johns Hopkins on Coursera. This class covers four parts, providing a good start for exploring statistics and R programming in genomic science. The online resource is at http://jtleek.com/genstats_site/.
1. Exploratory analysis
1). central dogma of statistics: population --probability--> sample --inference--> population.
2). reproducible research!!
- data sharing plan: raw data, tiny data set, code book and explicit recipe.
3). three tables in genomics:
- samples
- feature data
- phenotype data
4). var(genomic measurement) = phenotypic variability + measurement error + natural biological variation.
5). power: probability of discovering a real signal if it is there (80%).
6). batch effects: the most common confounder.
e.g., 78% of genes differentially expressed <- batch effects.
7). randomization: without randomization, the confounding variable differs among treatment.
8). exploratory analysis
- check for missing values in a variety of ways
- M-A plot: x-axis~sum, y-axis~difference.
9). Data transform:on one hand, most statistical methods are designed to work better for non-skewed data; on the other hand, in reality, most genomic data is skewed. One way to address this skew is to use a transformation, like log transform + 1.
10). More transforms:depending on the type of data you are using.
- variance stabilizing transform: remove a mean variance relationship among the data
- box-cox transforms: make the data approximately Normally Distributed.
- rlog transform: unique to genomics count data; this is a regularized version of the log transform seeking to minimize differences at low count levels.
11). Clustering:
library(dendextend)
(a). Hierarchical Cluster
First, we need to get the distance matrix between samples first: dist(t(edata)) #dist function calculate by rows.
Then we can use heatmap to visualize the distance matrix.
Third, Let's perform a hierarchical clustering and plot the dengrogram:
hclust1 = hclust(dist1)
plot(hclust1)
(b). K-Means Cluster
In R, the kmeans() are clustered by rows by default:
kmeans1 = kmeans(edata,centers=3)
We can cluster the data together and plot:
heatmap(as.matrix(edata)[order(kmeans1$cluster),],col=colramp,Colv=NA,Rowv=NA)
Note: you will get a different answer if you run kmeans again.
- check for missing values in a variety of ways
- M-A plot: x-axis~sum, y-axis~difference.
9). Data transform:on one hand, most statistical methods are designed to work better for non-skewed data; on the other hand, in reality, most genomic data is skewed. One way to address this skew is to use a transformation, like log transform + 1.
10). More transforms:depending on the type of data you are using.
- variance stabilizing transform: remove a mean variance relationship among the data
- box-cox transforms: make the data approximately Normally Distributed.
- rlog transform: unique to genomics count data; this is a regularized version of the log transform seeking to minimize differences at low count levels.
11). Clustering:
library(dendextend)
(a). Hierarchical Cluster
First, we need to get the distance matrix between samples first: dist(t(edata)) #dist function calculate by rows.
Then we can use heatmap to visualize the distance matrix.
Third, Let's perform a hierarchical clustering and plot the dengrogram:
hclust1 = hclust(dist1)
plot(hclust1)
(b). K-Means Cluster
In R, the kmeans() are clustered by rows by default:
kmeans1 = kmeans(edata,centers=3)
We can cluster the data together and plot:
heatmap(as.matrix(edata)[order(kmeans1$cluster),],col=colramp,Colv=NA,Rowv=NA)
Note: you will get a different answer if you run kmeans again.
2. Normalization and preprocessing
1). Dimension reduction: PCA and SVD.
- SVD: to find patterns that explain the variation:
svd1 = svd(edata_centered).
plot(svd1$d^2/sum(svd1$d^2,ylab="percent variance explained,col=2)
A very common plot is to plot PC1 versus PC2 to see if we can see any "clusters" or "groups":
plot(svd1$v[,1],svd1$v[,2],ylab="2ndPC",xlab="1stPC",col=as.numeric(pdata$study))
Another common plot is to make boxplots comparing the PC for different levels of known covariates.
boxplot(svd1$v[,1]~pdata$study, border=c(1,2))
- PCA: calculate variations across columns:
pc1 = prcomp(edata)
plot(pc1$rotation[,1],svd1$v[,1])
In order to get the actual PC1 and PC2, we need to center the data by column, instead of by row.
edata_centered2 = t( t(edata)-colMeans(edata) )
svd2 = svd(edata_centered2)
plot(pc2$rotation[,1],svd2$v[,1],col=2)
- We need to center the dataset, otherwise the first singular values will always be the mean level.
- outliers, outlying genes.
1). Dimension reduction: PCA and SVD.
- SVD: to find patterns that explain the variation:
svd1 = svd(edata_centered).
plot(svd1$d^2/sum(svd1$d^2,ylab="percent variance explained,col=2)
A very common plot is to plot PC1 versus PC2 to see if we can see any "clusters" or "groups":
plot(svd1$v[,1],svd1$v[,2],ylab="2ndPC",xlab="1stPC",col=as.numeric(pdata$study))
Another common plot is to make boxplots comparing the PC for different levels of known covariates.
boxplot(svd1$v[,1]~pdata$study, border=c(1,2))
points(svd1$v[,1]~jitter(as.numeric(pdata$study)),col=as.numeric(pdata$study))
- PCA: calculate variations across columns:
pc1 = prcomp(edata)
plot(pc1$rotation[,1],svd1$v[,1])
In order to get the actual PC1 and PC2, we need to center the data by column, instead of by row.
edata_centered2 = t( t(edata)-colMeans(edata) )
svd2 = svd(edata_centered2)
plot(pc2$rotation[,1],svd2$v[,1],col=2)
- We need to center the dataset, otherwise the first singular values will always be the mean level.
- outliers, outlying genes.
- widely used for batch effects (refer to lecture).
2). Quantile Normalization
- preprocessing: try to remove technological artifacts.
- normalization: remove technological biases, make the samples comparable.
- quantile normalization: most common technique, "force the distribution to be same". yet doesn't work for biological variability.
library(preprocessCore)
norm_edata = normalized.quantiles(as.matrix(edata))
- Preprocessing and normalization are highly application specific: Affymetrix (affy), Illumina (lumi), gene count based models(Rsubread), Peaks(DiffBind).
3). Linear Models
- linear models with categorical covariates
-- additive model
-- dominant model
-- recessive model
-- two degrees of freedom model
- adjust for covariates
-- equivalent to fitting multiple lines
-- interpretation of interactions changes
- many regressions simultaneously: genomics measure many variables, and one of the goal is to find out which variable relates to the outcome.
library(limma)
library(edge)
mod_adj = model.matrix(~ pdata$strain + as.factor(pdata$lane.number))
fit_adj = lm.fit(mod_adj,t(edata))
fit_limma = lmFit(edata,mod_adj)
edge_study = build_study(data=edata,grp=pdata$strin,adj.var=as.factor(pdata$lane.number))
fit_edge = fit_models(edge_study)
summary(fit_edge)
-- linear models for microarray data: http://www.statsci.org/smyth/pubs/limma-biocbook-reprint.pdf
- batch effects and confounders-- If each biology group is on its own batch, then we can't tell the difference between group biology and batch variable.
- In general we'd like the residuals to looks symmetrically (e.g. approximately Normally) distributed. Therefore, data transforms are often applied before regression.
- Also, we need to be careful when two variables in the regression model are highly correlated.
library(sva)
library(snpStats)
mod = model.matrix(~as.factor(cancer) + as.factor(batch), data=pheno)
fit = lm.fit(mod,t(edata))
ComBat
- surrogate variable analysis
-- estimate batch effects from data itself.
mod = model.matrix(~cancer,data=pheno)
mod0 = model.matrix(~1,data=pheno)
sva1 = sva(edata,mod,mod0,n.sv=2)#number of significant surrogate variables
modsv = cbind(mod,sva1$sv) #add surrogate variables to the model matrix
fit_sv = lm.fit(modsv,t(edata))
3. Statistical modeling
- false positive: "came up just because you are doing so many data shifting when you look at the data."
- logistic regression: log odds
-- The coefficient is the change in log-odds for a one unit decrease in the number of copies of the minor allele.
snp1 = as.numeric(snpdata[,1]) #homozygous major allele is coded 1
snp1[snp1==0]=NA
glm1 = glm(status ~ snp1,family="binomial")
tidy(glm1)
smp1_dom = (snp1==1)
glm_all = snp.rhs.tests(status ~ 1, snp.data=sub.10)
slotNames(glm_all)
qq.chisq(chi.squared(glm_all),df=1)
library(DESeq2)
de = DESeqDataSetFromMatrix(edata,pdata,~strain)
glm_all_nb = DESeq(de)
result_nb = results(glm_all_nb)
- regression for counts: we want to model the distribution between the phenotype we care and the counts for particular genes.
-- Poisson distribution (counts): a common assumption: mean and variance are same.
-- Fit a regression on log of expectation of the counts.
-- Negative binomial distribution (snp): more flexible for modeling TWO parameters instead of one (compared to Poisson, for a mean value, the variances are fixed).
- Inference
- Null and Alternative hypotheses
- Calculate statistics in R:
-- Standardization of distance is critical.
-- t-statistic
-- F-statistics quantifies differences in model fit.
-- It is for sure that: more variables => smaller residuals. It is just that the residual doesn't go down enough to suggest that the model fits better.
tstats_obj = rowttests(edata,pdata$strain)
hist(tstats_obj$statistic,col=2)
fstats_obj = rowFtests(edata,as.factor(pdata$lane.number))
(limma)
mod_adj = model.matrix(~ pdata$strain + as.factor(pdata$lane.number))
fit_limma_adj = lmFit(edata,mod_adj)
ebayes_limma_adj = eBayes(fit_limma_adj)
head(ebayes_limma_adj$t)
-- calculate a nested comparison: todo
- Permutation: permute labels to "break relationship"
- P-values:
-- The probability of observing a statistics that extreme if the null hypothesis is true.
-- The probability of observing a statistics more extreme under the permutations than the statistics we observed!!
-- If the null hypothesis is true, the p-value is equally distributed between the value of 0 and 1.
-- NOT: the probability the null is true. NOT: a measure of statistical evidence.
- Multiple testing
- P-values / hypothesis testing are designed for one.
-- expected number of false positives
-- Family wise error rate
-- False discovery rate: the noise among the discoveries you are willing to tolerate.
-- Bonferroni Correstion
-- Benjamini-Hochberg Correction
-- Type I errors, family wise error rate, and false discovery rate DO NOT MEASURE THE SAME THING!!
(limma)
mod_adj = model.matrix(~ pdata$strain + as.factor(pdata$lane.number))
fit_limma_adj = lmFit(edata,mod_adj)
ebayes_limma_adj = eBayes(fit_limma_adj)
limma_pvals = topTable(ebayes_limma, number=dim(edata)[1])$P.Value
fp_bonf = p.adjust(fstats_obj%p.value,method="bonferroni")
- logistic regression: log odds
-- The coefficient is the change in log-odds for a one unit decrease in the number of copies of the minor allele.
snp1 = as.numeric(snpdata[,1]) #homozygous major allele is coded 1
snp1[snp1==0]=NA
glm1 = glm(status ~ snp1,family="binomial")
tidy(glm1)
smp1_dom = (snp1==1)
glm_all = snp.rhs.tests(status ~ 1, snp.data=sub.10)
slotNames(glm_all)
qq.chisq(chi.squared(glm_all),df=1)
library(DESeq2)
de = DESeqDataSetFromMatrix(edata,pdata,~strain)
glm_all_nb = DESeq(de)
result_nb = results(glm_all_nb)
- regression for counts: we want to model the distribution between the phenotype we care and the counts for particular genes.
-- Poisson distribution (counts): a common assumption: mean and variance are same.
-- Fit a regression on log of expectation of the counts.
-- Negative binomial distribution (snp): more flexible for modeling TWO parameters instead of one (compared to Poisson, for a mean value, the variances are fixed).
- Inference
- Null and Alternative hypotheses
- Calculate statistics in R:
-- Standardization of distance is critical.
-- t-statistic
-- F-statistics quantifies differences in model fit.
-- It is for sure that: more variables => smaller residuals. It is just that the residual doesn't go down enough to suggest that the model fits better.
tstats_obj = rowttests(edata,pdata$strain)
hist(tstats_obj$statistic,col=2)
fstats_obj = rowFtests(edata,as.factor(pdata$lane.number))
(limma)
mod_adj = model.matrix(~ pdata$strain + as.factor(pdata$lane.number))
fit_limma_adj = lmFit(edata,mod_adj)
ebayes_limma_adj = eBayes(fit_limma_adj)
head(ebayes_limma_adj$t)
-- calculate a nested comparison: todo
- Permutation: permute labels to "break relationship"
- P-values:
-- The probability of observing a statistics that extreme if the null hypothesis is true.
-- The probability of observing a statistics more extreme under the permutations than the statistics we observed!!
-- If the null hypothesis is true, the p-value is equally distributed between the value of 0 and 1.
-- NOT: the probability the null is true. NOT: a measure of statistical evidence.
- Multiple testing
- P-values / hypothesis testing are designed for one.
-- expected number of false positives
-- Family wise error rate
-- False discovery rate: the noise among the discoveries you are willing to tolerate.
-- Bonferroni Correstion
-- Benjamini-Hochberg Correction
-- Type I errors, family wise error rate, and false discovery rate DO NOT MEASURE THE SAME THING!!
(limma)
mod_adj = model.matrix(~ pdata$strain + as.factor(pdata$lane.number))
fit_limma_adj = lmFit(edata,mod_adj)
ebayes_limma_adj = eBayes(fit_limma_adj)
limma_pvals = topTable(ebayes_limma, number=dim(edata)[1])$P.Value
fp_bonf = p.adjust(fstats_obj%p.value,method="bonferroni")
4. Statistical summarization
1). gene set enrichment: look for groups of genes that share function or other characteristics.
- GO, MSigDB, BioGRID
library goseq
# perform a differential expression analysis
de = DESeqDataSetFromMatrix(expr,pdata,~grp)
de_fit = DESeq(de)
de_results = results(de_fit)
# get the differentially expressed genes after FDR correction
genes = as.integer(de_results$padj < 0.05)
not_na = !is.na(genes)
names(genes) = rownames(expr)
gene = gene[not_na]
# set up a weighting function for all the genes in the genome
pwf = nullp(genes,"hg19","ensGene")
#perform the enrichment analysis parametrically
Go.wall = goseq(pwf,"hg19","ensGene")
2). enrichment in the genome
- contingency table
3). Steps in an RNA-Seq analysis in R:
S1: Align: Star, Tophat2,HiSat
S2: Count: HTSeq
S2: Assemble and quantify: StringTie, Cufflinks, Trinity
S3: Normalize: EDAseq, DESeq2/edgeR (UPenn workshop)
S4: Statistical tests: DESeq2/edgeR
S5: Gene set enrichment: goseq
4). Steps in a CHIP-Seq analysis in R:
- measure the way in which protein interacts with DNA.
S1: Align: bowtie2, BWA
S2: Peak detection: MACS, CisGenome
S3: Count: MACS, CisGenome,diffbind
S4: Normalize: diffbind
S5: Statistical test: CisGenome, MACS, diffbind(multiple samples)
S6: Sequence motifs & Annotation: CisGenome, meme-suite, BioC Annotation Workflow
-- are they commonly enriched with particular genome motifs?
5). Steps in a DNA methylation analysis in R:
- methylation: a particular methyl groups that binds to CpG sites of of the genome.
- the idea is to identify those places where the binding occurs.
S1: Normalize: minfi
S2: Smooth:charm, bsseq
S3: Region finding:charm,bsseq
-- need to fit a statistical model to identify those regions
S4: Annotation: charm, bsseq
6). Steps in Whole Genome Sequencing in R:
S1: Variant Identification:crlmm (SNP); freeBayes,GATK(SEQUENCING).
S2: Population stratification correction: snpStats
S3: Statistical tests: snpStats
S4: Examining local region: PLINK,Annotating Genomic Variants Workflow
-- identify variants associated with disease
S5: Annotation: Annotating Genomic Variants Workflow
7). Expression Quantitative Trait Loci (eOTL)
- try to do association between ALL levels of expression levels and ALL possible SNP levels.
1). gene set enrichment: look for groups of genes that share function or other characteristics.
- GO, MSigDB, BioGRID
library goseq
# perform a differential expression analysis
de = DESeqDataSetFromMatrix(expr,pdata,~grp)
de_fit = DESeq(de)
de_results = results(de_fit)
# get the differentially expressed genes after FDR correction
genes = as.integer(de_results$padj < 0.05)
not_na = !is.na(genes)
names(genes) = rownames(expr)
gene = gene[not_na]
# set up a weighting function for all the genes in the genome
pwf = nullp(genes,"hg19","ensGene")
#perform the enrichment analysis parametrically
Go.wall = goseq(pwf,"hg19","ensGene")
2). enrichment in the genome
- contingency table
3). Steps in an RNA-Seq analysis in R:
S1: Align: Star, Tophat2,HiSat
S2: Count: HTSeq
S2: Assemble and quantify: StringTie, Cufflinks, Trinity
S3: Normalize: EDAseq, DESeq2/edgeR (UPenn workshop)
S4: Statistical tests: DESeq2/edgeR
S5: Gene set enrichment: goseq
4). Steps in a CHIP-Seq analysis in R:
- measure the way in which protein interacts with DNA.
S1: Align: bowtie2, BWA
S2: Peak detection: MACS, CisGenome
S3: Count: MACS, CisGenome,diffbind
S4: Normalize: diffbind
S5: Statistical test: CisGenome, MACS, diffbind(multiple samples)
S6: Sequence motifs & Annotation: CisGenome, meme-suite, BioC Annotation Workflow
-- are they commonly enriched with particular genome motifs?
5). Steps in a DNA methylation analysis in R:
- methylation: a particular methyl groups that binds to CpG sites of of the genome.
- the idea is to identify those places where the binding occurs.
S1: Normalize: minfi
S2: Smooth:charm, bsseq
S3: Region finding:charm,bsseq
-- need to fit a statistical model to identify those regions
S4: Annotation: charm, bsseq
6). Steps in Whole Genome Sequencing in R:
S1: Variant Identification:crlmm (SNP); freeBayes,GATK(SEQUENCING).
S2: Population stratification correction: snpStats
S3: Statistical tests: snpStats
S4: Examining local region: PLINK,Annotating Genomic Variants Workflow
-- identify variants associated with disease
S5: Annotation: Annotating Genomic Variants Workflow
7). Expression Quantitative Trait Loci (eOTL)
- try to do association between ALL levels of expression levels and ALL possible SNP levels.
No comments:
Post a Comment