Progress in Bioinformatics World: January 2016

Monday, January 11, 2016

NGS: simple pipeline for variant calling

The flow of data type: fastq -> sam/bam -> vcf/bcf

1. align NGS short reads to reference genome
1). Build the index for the reference genome:
mkdir ref/ref
bowtie2-build /reference genome/*.fa ref/ref

2). Align the reads to the reference genome:
bowtie2 -x ref/ref sample.fastq -S sample.bt2.sam

3). View the alignment:
samtools view -b -T ref.fa sample.bt2.sam > sample.bt2.bam

4). Sort and Index the aligned bam file:
samtools sort sample.bt2.bam sample.bt2.sorted
samtools index sample.bt2.sorted.bam

5). mpileup:
samtools mpileup -v -u -f /reference genome/*.fa sample.bt2.sorted.bam > sample.vcf

Saturday, January 9, 2016

Notes of Statistics for Genomic Data Science

This is my notes of the class Statistics for Genomic Data Science offered by Johns Hopkins on Coursera. This class covers four parts, providing a good start for exploring statistics and R programming in genomic science. The online resource is at http://jtleek.com/genstats_site/.

1. Exploratory analysis

1). central dogma of statistics: population --probability--> sample --inference--> population.

2). reproducible research!!

- data sharing plan: raw data, tiny data set, code book and explicit recipe.

3). three tables in genomics:

- samples

- feature data

- phenotype data

4). var(genomic measurement) = phenotypic variability + measurement error + natural biological variation.

5). power: probability of discovering a real signal if it is there (80%).

6). batch effects: the most common confounder.

e.g., 78% of genes differentially expressed <- batch effects.

7). randomization: without randomization, the confounding variable differs among treatment.

8). exploratory analysis
- check for missing values in a variety of ways
- M-A plot: x-axis~sum, y-axis~difference.
9). Data transform:on one hand, most statistical methods are designed to work better for non-skewed data; on the other hand, in reality, most genomic data is skewed. One way to address this skew is to use a transformation, like log transform + 1.
10). More transforms:depending on the type of data you are using.
- variance stabilizing transform: remove a mean variance relationship among the data
- box-cox transforms: make the data approximately Normally Distributed.
- rlog transform: unique to genomics count data; this is a regularized version of the log transform seeking to minimize differences at low count levels.
11). Clustering:
library(dendextend)
(a). Hierarchical Cluster
First, we need to get the distance matrix between samples first: dist(t(edata)) #dist function calculate by rows.
Then we can use heatmap to visualize the distance matrix.
Third, Let's perform a hierarchical clustering and plot the dengrogram:
hclust1 = hclust(dist1)
plot(hclust1)
(b). K-Means Cluster
In R, the kmeans() are clustered by rows by default:
kmeans1 = kmeans(edata,centers=3)
We can cluster the data together and plot:
heatmap(as.matrix(edata)[order(kmeans1$cluster),],col=colramp,Colv=NA,Rowv=NA)
Note: you will get a different answer if you run kmeans again.

2. Normalization and preprocessing
1). Dimension reduction: PCA and SVD.
- SVD: to find patterns that explain the variation:
svd1 = svd(edata_centered).
plot(svd1$d^2/sum(svd1$d^2,ylab="percent variance explained,col=2)

A very common plot is to plot PC1 versus PC2 to see if we can see any "clusters" or "groups":
plot(svd1$v[,1],svd1$v[,2],ylab="2ndPC",xlab="1stPC",col=as.numeric(pdata$study))

Another common plot is to make boxplots comparing the PC for different levels of known covariates.
boxplot(svd1$v[,1]~pdata$study, border=c(1,2))

points(svd1$v[,1]~jitter(as.numeric(pdata$study)),col=as.numeric(pdata$study))

- PCA: calculate variations across columns:
pc1 = prcomp(edata)
plot(pc1$rotation[,1],svd1$v[,1])

In order to get the actual PC1 and PC2, we need to center the data by column, instead of by row.
edata_centered2 = t( t(edata)-colMeans(edata) )
svd2 = svd(edata_centered2)
plot(pc2$rotation[,1],svd2$v[,1],col=2)

- We need to center the dataset, otherwise the first singular values will always be the mean level.

- outliers, outlying genes.

- widely used for batch effects (refer to lecture).

2). Quantile Normalization
- preprocessing: try to remove technological artifacts.
- normalization: remove technological biases, make the samples comparable.
- quantile normalization: most common technique, "force the distribution to be same". yet doesn't work for biological variability.

library(preprocessCore)
norm_edata = normalized.quantiles(as.matrix(edata))

- Preprocessing and normalization are highly application specific: Affymetrix (affy), Illumina (lumi), gene count based models(Rsubread), Peaks(DiffBind).

3). Linear Models
- linear models with categorical covariates
-- additive model
-- dominant model
-- recessive model
-- two degrees of freedom model
- adjust for covariates
-- equivalent to fitting multiple lines
-- interpretation of interactions changes
- many regressions simultaneously: genomics measure many variables, and one of the goal is to find out which variable relates to the outcome.

library(limma)

library(edge)

mod_adj = model.matrix(~ pdata$strain + as.factor(pdata$lane.number))

fit_adj = lm.fit(mod_adj,t(edata))

fit_limma = lmFit(edata,mod_adj)

edge_study = build_study(data=edata,grp=pdata$strin,adj.var=as.factor(pdata$lane.number))

fit_edge = fit_models(edge_study)

summary(fit_edge)

-- linear models for microarray data: http://www.statsci.org/smyth/pubs/limma-biocbook-reprint.pdf

- batch effects and confounders
-- If each biology group is on its own batch, then we can't tell the difference between group biology and batch variable.
- In general we'd like the residuals to looks symmetrically (e.g. approximately Normally) distributed. Therefore, data transforms are often applied before regression.
- Also, we need to be careful when two variables in the regression model are highly correlated.

library(sva)
library(snpStats)
mod = model.matrix(~as.factor(cancer) + as.factor(batch), data=pheno)
fit = lm.fit(mod,t(edata))

ComBat

- surrogate variable analysis
-- estimate batch effects from data itself.
mod = model.matrix(~cancer,data=pheno)
mod0 = model.matrix(~1,data=pheno)
sva1 = sva(edata,mod,mod0,n.sv=2)#number of significant surrogate variables

modsv = cbind(mod,sva1$sv) #add surrogate variables to the model matrix
fit_sv = lm.fit(modsv,t(edata))

3. Statistical modeling

- false positive: "came up just because you are doing so many data shifting when you look at the data."
- logistic regression: log odds
-- The coefficient is the change in log-odds for a one unit decrease in the number of copies of the minor allele.
snp1 = as.numeric(snpdata[,1]) #homozygous major allele is coded 1
snp1[snp1==0]=NA
glm1 = glm(status ~ snp1,family="binomial")
tidy(glm1)

smp1_dom = (snp1==1)

glm_all = snp.rhs.tests(status ~ 1, snp.data=sub.10)
slotNames(glm_all)
qq.chisq(chi.squared(glm_all),df=1)

library(DESeq2)
de = DESeqDataSetFromMatrix(edata,pdata,~strain)
glm_all_nb = DESeq(de)
result_nb = results(glm_all_nb)

- regression for counts: we want to model the distribution between the phenotype we care and the counts for particular genes.
-- Poisson distribution (counts): a common assumption: mean and variance are same.
-- Fit a regression on log of expectation of the counts.
-- Negative binomial distribution (snp): more flexible for modeling TWO parameters instead of one (compared to Poisson, for a mean value, the variances are fixed).
- Inference
- Null and Alternative hypotheses
- Calculate statistics in R:
-- Standardization of distance is critical.
-- t-statistic
-- F-statistics quantifies differences in model fit.
-- It is for sure that: more variables => smaller residuals. It is just that the residual doesn't go down enough to suggest that the model fits better.
tstats_obj = rowttests(edata,pdata$strain)
hist(tstats_obj$statistic,col=2)

fstats_obj = rowFtests(edata,as.factor(pdata$lane.number))

(limma)
mod_adj = model.matrix(~ pdata$strain + as.factor(pdata$lane.number))
fit_limma_adj = lmFit(edata,mod_adj)
ebayes_limma_adj = eBayes(fit_limma_adj)
head(ebayes_limma_adj$t)

-- calculate a nested comparison: todo
- Permutation: permute labels to "break relationship"
- P-values:
-- The probability of observing a statistics that extreme if the null hypothesis is true.
-- The probability of observing a statistics more extreme under the permutations than the statistics we observed!!
-- If the null hypothesis is true, the p-value is equally distributed between the value of 0 and 1.
-- NOT: the probability the null is true. NOT: a measure of statistical evidence.
- Multiple testing
- P-values / hypothesis testing are designed for one.
-- expected number of false positives
-- Family wise error rate
-- False discovery rate: the noise among the discoveries you are willing to tolerate.
-- Bonferroni Correstion
-- Benjamini-Hochberg Correction
-- Type I errors, family wise error rate, and false discovery rate DO NOT MEASURE THE SAME THING!!

(limma)
mod_adj = model.matrix(~ pdata$strain + as.factor(pdata$lane.number))
fit_limma_adj = lmFit(edata,mod_adj)
ebayes_limma_adj = eBayes(fit_limma_adj)
limma_pvals = topTable(ebayes_limma, number=dim(edata)[1])$P.Value

fp_bonf = p.adjust(fstats_obj%p.value,method="bonferroni")

4. Statistical summarization
1). gene set enrichment: look for groups of genes that share function or other characteristics.
- GO, MSigDB, BioGRID
library goseq
# perform a differential expression analysis
de = DESeqDataSetFromMatrix(expr,pdata,~grp)
de_fit = DESeq(de)
de_results = results(de_fit)
# get the differentially expressed genes after FDR correction
genes = as.integer(de_results$padj < 0.05)
not_na = !is.na(genes)
names(genes) = rownames(expr)
gene = gene[not_na]
# set up a weighting function for all the genes in the genome
pwf = nullp(genes,"hg19","ensGene")
#perform the enrichment analysis parametrically
Go.wall = goseq(pwf,"hg19","ensGene")
2). enrichment in the genome
- contingency table
3). Steps in an RNA-Seq analysis in R:
S1: Align: Star, Tophat2,HiSat
S2: Count: HTSeq
S2: Assemble and quantify: StringTie, Cufflinks, Trinity
S3: Normalize: EDAseq, DESeq2/edgeR (UPenn workshop)
S4: Statistical tests: DESeq2/edgeR
S5: Gene set enrichment: goseq
4). Steps in a CHIP-Seq analysis in R:
- measure the way in which protein interacts with DNA.
S1: Align: bowtie2, BWA
S2: Peak detection: MACS, CisGenome
S3: Count: MACS, CisGenome,diffbind
S4: Normalize: diffbind
S5: Statistical test: CisGenome, MACS, diffbind(multiple samples)
S6: Sequence motifs & Annotation: CisGenome, meme-suite, BioC Annotation Workflow
-- are they commonly enriched with particular genome motifs?
5). Steps in a DNA methylation analysis in R:
- methylation: a particular methyl groups that binds to CpG sites of of the genome.
- the idea is to identify those places where the binding occurs.
S1: Normalize: minfi
S2: Smooth:charm, bsseq
S3: Region finding:charm,bsseq
-- need to fit a statistical model to identify those regions
S4: Annotation: charm, bsseq
6). Steps in Whole Genome Sequencing in R:
S1: Variant Identification:crlmm (SNP); freeBayes,GATK(SEQUENCING).
S2: Population stratification correction: snpStats
S3: Statistical tests: snpStats
S4: Examining local region: PLINK,Annotating Genomic Variants Workflow
-- identify variants associated with disease
S5: Annotation: Annotating Genomic Variants Workflow
7). Expression Quantitative Trait Loci (eOTL)
- try to do association between ALL levels of expression levels and ALL possible SNP levels.

Friday, January 8, 2016

Notes of Bioinformatics V

Notes of Genomic Data Science and Clustering (Bioinformatics V)

Chapter 9: How Did Yeast Become a Wine Maker? (Clustering Algorithms)

In this chapter, we learned the clustering algorithms in bioinformatics study.

Some concepts:

gene expression analysis.

Good clustering Problem.

k-Means Clustering Problem: optimization problem

- Farthest First Traversal Problem

- Squared error distortion

- Lloyd Algorithm for k-means clustering: two steps: centers to clusters and clusters to centers.

- limitations: make a "hard" assignment of each point to only one cluster.

Soft k-means clustering:

- Expectation Maximization Algorithm: starts with a random choice of Parameters. It then alternates between the E-step, in which we compute a responsibility matrix HiddenMatrix for Data given Parameters, and the M-step, in which we re-estimate Parameters using HiddenMatrix.

- Centers to Soft Clusters (E-step) AND Soft Clusters to Centers (M-step)

- SoftKMeans Problem

Hierarchical Clustering

- how to use distance matrix to partition genes into clusters

- UPGMA in disguise

The Python implementations of all the algorithms are available on my github: https://github.com/aprilchunyuzhao/BioinformaticsFromCoursera.

Notes of Bioinformatics IV

My NOTES of Bioinformatics IV: Molecular Evolution

Chapter 7: Which Animal Gave Us SARS? (Evolutionary Tree Reconstruction)

In this chapter, we learned the algorithms of constructing evolutionary trees.

Distance-based Phylogeny Construction

1. distance matrix

- symmetric; non-negative; triangle inequality.

- non-additive v.s. additive.

2. Tree

- limb: an edge connecting a leaf to its parent.

- Distances Between Leaves Problem

- maximal non-branching tree

- Limb Length Problem

3. Algorithms

3.1 Additive Phylogeny Problem

3.2 Least Squares for Approximate Phylogenies

3.3 UPGMA: unweighted pair group method with arithmetic mean

- clustering heuristic

- yet the smallest element in the distance matrix does not necessarily correspond to a pair of neighboring leaves.

3.4 Neighbor-Joining

- neighbor-joining matrix instead of distance matrix

Character-based Phylogeny Construction

1. character table

2. Small Parsimony Problem

- parsimony score

- dynamic programming algorithm

- SmallParsimony does not help us if we do not know the evolutionary tree in advance.

3. Large Parsimony Problem

- nearest neighbor interchange

Chapter 8: Was T.rex Just a Big Chicken? (Computational Proteomics)

1. peptide vector

2. spectral vector

3. spectrum graph: a node for each element of Spectrum, and directed edge labeled by the amino acid.

4. Score peptide against spectrum: dot product.

5. Peptide Identification Problem

6. spectral dictionary

7. PSM disctionary

8. Spectral Alignment Problem: todo

Wednesday, January 6, 2016

Notes of Bioinformatics III

My NOTES of Bioinformatics III: Comparing Genes, Proteins, and Genomes

Chapter 5: How Do We Compare Biological Sequences? (Dynamic Programming)

For my PhD research, I developed a novel pairwise protein structure alignment algorithm (UniAlign: http://sacan.biomed.drexel.edu/unialign/). And part of it is to calculate the sequence similarity between two protein sequences by glocal alignment. So, I am quite familiar with dynamic programming.

What is new to me in this Chapter is that, it applies dynamic programming to the problem of finding the longest common subsequence. Finding a longest common subsequence of two strings is equivalent to finding an alignment of these strings maximizing the number of matches.

The matches in an alignment of two strings define a common subsequence of the two strings, or a sequence of symbols appearing in the same order (yet not necessarily consecutively) in both strings.

An introduction to Dynamic Programming: the Change problem

1. The trick behind dynamic programming is to solve the smaller problems once rather than billions of times.

2. To find a longest path in an arbitrary DAG, first we need to order the nodes of the DAG so that every node falls after all its predecessors.

Computational Problems:1. Manhattan Tourist Problem2. Output LCS Problem3. Longest Path in a DAG problem

Some Concepts:

1. scoring matrices
2. global alignment

3. local alignment

- when compute the s(i,j), add zero-weight edges from (0,0) to every node, which means that the source node (0, 0) is a predecessor of every node (i,j) (free taxi ride).

4. alignment with affine gap penalties problem

The Changing Faces of Sequence Alignment:

(1). Edit Distance: the minimum number of edit operations(insertion, deletion or substitution) needed to transform one string into another.

Hints: do a global alignment with mismatch penalty = -1 and gaps = -1, match = 0.

(2). Fitting Alignment: find a region within the longer protein sequence v that has high similarity with all of the shorter sequence w. In other words, "fitting" w to v required finding a substring v' of v that maximize the global alignment score between v' and w among all substrings of v.

Hints: "free taxi rides" only for the second string.

(3). Overlap Alignment: refers to the global alignment of a suffix of v with a prefix of w.

Hints: "free taxi rides" only for the first string.

These three applications of sequence alignment using dynamic programming are very nice!!

Affine Gap Penalties:

- Build Manhattan on three levels: lower, middle and upper.

Space-Efficient Sequence Alignment:

1. The memory required to store the dynamic programming matrix is substantial O(n*m).

2. divide-and-conquer algorithms: divide phase splits a problem instance into smaller instances and solves them; conquer phase stitches the smaller solutions into a solution to the original problem.

3. The idea of space reduction techniques is that we don't need to store any backtracking pointers if we are willing to spend a little more time.

4. The Middle Node Problem!!

- In general, at each new step before the final step, we double the number of middle nodes found while halving the runtime required to find middle nodes.

5. The Middle Edge Problem

- middle edge: an edge in an optimal alignment path starting at the middle node (more than one middle edge may exist for a given middle node).

Multiple Sequences Alignment

- todo

Chapter 6: Are There Fragile Regions in the Human Genome? (Combinatorial Algorithms)

Biology:

reversal: the most common form of genome rearrangement.

Random breakage model of chromosome evolution: the breakage points of rearrangements are selected randomly.

- exponential distribution: of synteny block lengths

Concept:

reversal distance: the minimum number of reversals required to transform P into Q.

identity permutation.

Breakpoint Theorem.

Rearrange in multi-chromosomal genomes:

- reversals and translocations, as well as fusions and fissions.

- genome graph

- breakpoint graph

- Cycle Theorem: given genome P and Q, any 2-break applied to P can increase Cycles(P,Q) by at most 1.

- 2-Break Distance Theorem: the 2-break distance between P and Q is equal to Blocks(P)-Cycles(Q).

Fragile Breakage Model: every mammalian genome is a mosaic of long solid regions, which are rarely affected by rearrangements, as well as short fragile regions that serve as rearrangement hotspots and that account only for a small fraction of the genome. For humans and mice, these fragile regions make up approximately 3% of the genome.

Synteny Block Construction:

- construct synteny blocks from shared k-mers

- construct synteny blocks as connected components in graph

Computational Problem:

1. Greedy Sorting by Reversal Problem

2. 2-Break Distance Problem

3. 2-Break Sorting Problem

4. Finding Shared K-mers Problem

All the python codes are available on my github: https://github.com/aprilchunyuzhao/BioinformaticsFromCoursera

Notes of Bioinformatics II

My NOTES of Bioinformatics II: Genome Sequencing offered by UCSD on Coursera.

Chapter 3: How Do We Assemble Genomes? (Graph Algorithms)

Difficulty:

1. double stranded DNA: no way of knowing which strand a given read derives from

2. sequencing errors

3. some regions of the genome may not be covered by any reads

4. Repeats complicate genome assembly: approximately 50% of human genome is made of repeats, e.g., the 300 nucleotide-long Alu sequence is repeated over a million times.

Some Concepts:

reads: sequence much shorter DNA fragments.

genome assembly: put a genome back together from its reads using overlapping information!

overlap graph: form a node for each k-mer in Patterns and connect k-mers Pattern and Pattern' by a directed edge if Suffix(Pattern) is equal to Prefix(Pattern').

hamiltonian path: a path in a graph visit every node once.

k-universal string: contains every binary k-mer exactly once.

de Bruijn graph:assign each k-mer in Patterns as edges, prefix and suffix of each edge as the node, and then glue identical nodes together.

eulerian path: visit every edge exactly once.

NP-complete: can be verified in polynomial time, yet no fast solution. (non-deterministic polynomial time). All NP complete problems are equivalent to each other.

Computational Problems:

(1) String Composition Problem

(2) String Spelled by a Genome Path Problem

(3) Overlap Graph Problem

(4) de Bruijn Graph Problem

(5) de Bruijn Graph From k-mers Problem

Concepts:

Eulerian Cycle: a cycle that traverse each edge of a graph exactly once.

Euler's Theorem: every balanced, strongly connected direct graph is Eulerian.

read pairs: pairs of reads separated by a fixed distance d in the genome.

Paired de Bruijn Graph:

Genome Assembly Faces Real Sequencing Data:

1. Breaking reads into shorted k-mers: yet smaller k may result in a more tangled de Bruijn graph

2. Splitting the genome into contigs (long, continuous fragment of the genome)

- maximal non-branching path: a non-branching path that cannot be extended into a longer non-branching path.

Computational Problems:

(1) Eulerian Cycle Problem

- STACK!!

(2) Eulerian Path Problem

(3) String Reconstruction Problem: reduced to finding an Eulerian Path in the de Bruijn graph generated from read!

(4) k-Universal Circular String Problem: reconstruct a circular string given its k-mer composition.

(5) String Reconstruction from Read-Pairs Problem

(6) Contig Generation Problem

Chapter 4: How Do We Sequence Antibiotics? (Brute Force Algorithms)

A difficult problem in antibiotics research is that of sequencing newly discovered antibiotics, or determining the order of amino acids making up the antibiotic peptide.

Necessary Biology:

1. There are three different ways to divide a DNA string into codons for translation, one starting at each of the first three starting positions of the string (reading frames).

2. Protein translation is carried out by a molecular machine called ribosome.

3. non-ribosomal peptides (NRPs): synthesized not by the ribosome, but by a giant protein called NRP synthetase.

NRPs have pharmaceutical applications because they have been optimized by eons of evolution as "molecular bullets" that bacteria and fungi use to kill their enemies. If these enemies happen to be pathogens, then researchers are eager to borrow these bullets as antibacterial drugs.

4. mass spectrometer: an expensive molecular scale that shatters molecules into pieces and then weighs the resulting fragments, in daltons (Da); 1 Da is approximately equal to the mass of a single nuclear particle.

The mass spectrometer can break each molecule of Tyrocidine B1 into two linear fragments, and it analyzes samples that may contain billions of identical copies of the peptide, with each copy breaking in its own way.

Our goal is to use all of these different fragments to sequence the peptide.

5. subpeptides: for a cyclic peptide with n amino acids, there are n * (n-1) subpeptides.

6. Cyclospectrum(Peptide): theoretical spectrum; the collection of all masses of its subpeptides, as well as mass 0 and mass of the entire peptide.

Assumption: for a given linear peptide Peptide, the mass of any subpeptide is equal to the difference between the masses of two prefixes of Peptide!!

7. For a given cyclic peptide: its theoretical spectrum are those found by LinearSpectrum and those corresponding to subpeptides wrapping around the end of Peptide.

Again, our goal is to reconstruct unknown peptide from its experimental spectrum.

8. A branch-and-bound algorithm for cyclopeptide sequencing: "grow" candidate linear peptides whose theoretical spectra are "consistent" with the experimental spectrum.

- branching step: to increase the number of candidate solutions

- bounding step: to remove hopeless candidates

9. Real Spectra: mass spectrometers generate "noisy" spectra: false masses and missing masses. THUS, in our algorithm design, instead of finding exact match, we need a scoring function that can score how match between the given theoretical spectrum and the given experimental spectrum.

10. Replace Peptides with Leaderboard: hold the N highest scoring candidate for further extension (including ties).

11. Since there are more than 100 NRPs, we MUST determine the amino acid composition of a peptide from its spectrum so that we may run LeaderboardCyclopeptideSequencing on this smaller alphabet of amino acids.

12. spectral convolution: take the positive differences of masses in the spectrum.

13. Epilogue: from simulated to real spectra!

Computational Problems:

(1) Protein Translation Problem:

(2) Generating Theoretical Spectrum Problem: LinearSpectrum and CyclicSpectrum

(3) Counting Peptides with Given Mass Problem: dynamic programming algorithm

(4) Cyclopeptide Sequencing Problem: still not efficient enough

Tuesday, January 5, 2016

Notes of Bioinformatics I

My NOTES of Bioinformatics I: Find Hidden Messages in DNA offered by UCSD on Coursera.

Chapter 1: Where in the Genome Does DNA Replication Begin?

In this chapter, we learned how to find the replication region (oriC) in bacterial genomes (mostly single circular chromosome).

Biology:

1. Feature of oriC of bacterial genome: typically a few hundred nucleotides long.

2. The initiation of DNA replication is mediated by DnaA, which is a protein binds to a short segment within the oriC region called DnaA box. And the DnaA box is the 'hidden messages' that we are trying to look for from the DNA sequence.

The computational translation for this problem is: find the most frequent k-mer in the given oriC sequence ( if it maximizes Count(text, pattern) among all k-mers).

Related Computational tasks:

1. PatternCount(text, pattern): count the occurrence of pattern in text

2. FrequentWord(text, k): find the all most frequent k-mers in text

Charging Station:

The running time of the brute force implementation of FrequentWord problem is O(|Text|^2). To improve the algorithm, the data structure frequency array was introduced.
1. Frequency Array: given a integer k, the length of the frequency array is 4^k (since 4^k different integers for all 4^k k-mers lexicographically), and the i-th element of the array hold the number of occurrences of i-th k-mer in lexicographic order in text. In other words, for each k-mer, we need to calculate a unique integer for that k-mer used as the index in the frequency array.
Considering the DNA sequence, we need a function SymbolToNumber(symbol) to transform symbol A,C,G,T to 0,1,2,3.

PatternToNumber(pattern): transform a k-mer pattern into an integer (recursion)

PatternToNumber(pattern) = 4 * PatternToNumber(pattern[:-1]) + SymbolToNumber(pattern[-1])

NumberToPattern(index,k): transform an integer between 0 and 4^k-1 into a k-mer
FasterFrequentWords(text, k): running time O(4^k + |text|^k + 4^k) => still impractical when k is large.

2. FindingFrequentWorksBySorting(text,k): sorting can bring same things together!

After learning how to find the 'hidden messages' from the DNA sequences, we continue our journal based on the observation (statistically interesting) that some hidden messages are more surprising than the others.
First some biological concepts:

each DNA strand has a direction: read in the 5' -> 3' direction
DnaA protein does not care which of the two strands it binds to when the DnaA protein binds to DnaA boxes and initiates the replication.

Followed by some statistical concepts:

overlapping words paradox: different k-mers have different probabilities of occurring multiple times as a substring of a random string.
Pr(N,A,Pattern,t) is a complex problem because this probability depends heavily on the particular choice of Pattern.

Before we conclude that we have found the DnaA box, we need to check whether there are other short regions in the bacterial genome exhibiting frequent occurrence of that k-mer.
clumps: appear close to each other in a small region of the genome; we defined a k-mer as a "clump" if it appears many times within a short interval of the genome; (L, t) - clump

Related Computational tasks:
3. ClumpFinding(Genome, k, t, L)

Furthermore, we continued to learn a new algorithm to find the oriC based on some features of DNA replication biology.
Important Biology:
1. DNA polymerase does not wait for the two parent strands to completely separate before initiating replication; instead, it starts copying while the strands are unraveling.
2. To start replications, a DNA polymerase needs a primer, a short complementary segment that binds to the parent strand and jump starts the DNA polymerase.
3. DNA polymerase is unidirectional: only traverse a template strand of DNA in the 3' -> 5' direction, which is opposite from the 5' -> 3' direction of DNA.
4. Asymmetry of Replication: replication on reverse half-strand progresses continuously; yet a DNA polymerase on a forward half-strand has no choice but to wait again until the replication fork has opened another 2000 nucleotides or so. -> Okazaki fragments from multiple primers
5. Deamination: cytosine (C) has a tendency to mutate into thymine (T) through a process called deamination. Deamination rates increase 100-fold when DNA is single-stranded, which leads to a decrease in cytosine (C) on the forward half-strand.

THEREFORE, we can use the skew diagram #G-#C to find the oriC <- if this difference starts increasing, then we guess that we are on the forward half-strand; if this difference starts decreasing, then we guess we are on the reverse half-strand.
The skew should achieve a minimum at the position where the reverse half-strand ends and the forward half-strand begins, which is exactly the location of oriC!
Related Computational tasks:
4. Minimum Skew Problem

The LAST computational task of this chapter is Approximate Pattern Matching Problem! And the following are some thoughts:
First, we define a d-neighborhood Neighbors(Pattern, d) of a k-mer Pattern, as the set of all k-mers that are close to Pattern. (Recursive)
Then, for each neighbor only, we update the frequency array.
The approximate pattern matching by sorting and pattern matching by sorting employ the same idea that sorting bring same things together so we can count their appearance by comparing with the neighbor element....Remember this!

Chapter 2: Which DNA Patterns Play the Role of Molecular Clocks? (Randomized Algorithms)
In Chapter 1, we learned one type of hidden message in genomes, however, there are other types of hidden messages that have nothing to do with genome replication, for example, regulatory DNA motifs responsible for gene expression, that we will learn in Chapter 2.
BIOLOGY:
1. A transcription factor regulates a gene by binding to a specific short DNA interval called a regulatory motif, or transcription factor binding site, in the gene's upstream region, a 600-1000 nucleotide-long region preceding the start of the gene.
2. Regulatory motif are not completely conserved, meaning they can vary at some positions for different genes.
THUS, comes our new COMPUTATIONAL task: develop algorithms for motif finding: discover the "hidden message" shared by a collection of strings.

The DIFFERENCE between Frequent Words Problem and Motif Finding Problem is that: a DnaA box is a pattern that clumps, or appears frequently, within a relatively short interval of the genome; WHILE a regulatory motif is a pattern that appears at least once (with variation) in each of my different regions that are dispersed throughout the genome.

(k,d) motif: a k-mer appears in every string from Dna with at most d mismatches.
Observation: any (k, d) motif must be at most d mismatches apart from some k-mer appearing in one of the strings of Dna.
(1) A brute force algorithm for motif finding.
Limitation: as long as a single sequence does not contain the transcription factor binding sites, a (k,d) motif does not exist!

Another model: select a k-mer from each string and score these k-mers of how similar they are to each other (motif matrix).
- define the Score(Motifs)
- minimize this score
Motifs -> Score -> Count -> Profile -> Consensus String
- Entropy: the entropy of completely conserved column is 0, and 2 for equally-likely nucleotides column.
- Motif logo
d(Pattern, Dna): the sum of distances between Pattern and all strings in Dna
(2) Median String Problem
Limitation: since median string problem has to consider 4^k k-mers, so toooo slow for cases like k = 15.
(3) Greedy Motif Search
Observation: a k-mer has a higher probability when it is more similar to the consensus string of a profile.
Improvement: pesudocounts
(4) Randomized Motif Search
- Monte Carlo Algorithm: not guaranteed to return exact solutions, but they do quickly find approximate solutions.
- Las Vegas Algorithm
- continue to iterate for as long as the constructed motifs keeps improving
Advantage: Randomized Motif Search can find longer motifs.
(5) Gibbs Sampling
Limitation: local optimum