Tuesday, January 5, 2016

Notes of Bioinformatics I

My NOTES of  Bioinformatics I: Find Hidden Messages in DNA offered by UCSD on Coursera.

Chapter 1: Where in the Genome Does DNA Replication Begin?

In this chapter, we learned how to find the replication region (oriC) in bacterial genomes (mostly single circular chromosome).

Biology:
1. Feature of oriC of bacterial genome: typically a few hundred nucleotides long.
2. The initiation of DNA replication is mediated by DnaA, which is a protein binds to a short segment within the oriC region called DnaA box. And the DnaA box is the 'hidden messages' that we are trying to look for from the DNA sequence
The computational translation for this problem is: find the most frequent k-mer in the given oriC sequence ( if it maximizes Count(text, pattern) among all k-mers).

Related Computational tasks:
1. PatternCount(text, pattern): count the occurrence of pattern in text
2. FrequentWord(text, k): find the all most frequent k-mers in text

Charging Station:
The running time of the brute force implementation of FrequentWord problem is O(|Text|^2). To improve the algorithm, the data structure frequency array was introduced.
1. Frequency Array: given a integer k, the length of the frequency array is 4^k (since 4^k different integers for all 4^k k-mers lexicographically), and the i-th element of the array hold the number of occurrences of i-th k-mer in lexicographic order in textIn other words, for each k-mer, we need to calculate a unique integer for that k-mer used as the index in the frequency array. 
Considering the DNA sequence, we need a function SymbolToNumber(symbol) to transform symbol A,C,G,T to 0,1,2,3.
  • PatternToNumber(pattern): transform a k-mer pattern into an integer (recursion)


PatternToNumber(pattern) = 4 * PatternToNumber(pattern[:-1]) + SymbolToNumber(pattern[-1])
  • NumberToPattern(index,k): transform an integer between 0 and 4^k-1 into a k-mer
  • FasterFrequentWords(text, k): running time O(4^k + |text|^k + 4^k) => still impractical when k is large.
2. FindingFrequentWorksBySorting(text,k): sorting can bring same things together! 



After learning how to find the 'hidden messages' from the DNA sequences, we continue our journal based on the observation (statistically interesting) that some hidden messages are more surprising than the others.
First some biological concepts: 
  • each DNA strand has a direction: read in the 5' -> 3' direction
  • DnaA protein does not care which of the two strands it binds to when the DnaA protein binds to DnaA boxes and initiates the replication.
Followed by some statistical concepts:
  • overlapping words paradox: different k-mers have different probabilities of occurring multiple times as a substring of a random string.
  • Pr(N,A,Pattern,t) is a complex problem because this probability depends heavily on the particular choice of Pattern.
Before we conclude that we have found the DnaA box, we need to check whether there are other short regions in the bacterial genome exhibiting frequent occurrence of that k-mer.
clumps: appear close to each other in a small region of the genome; we defined a k-mer as a "clump" if it appears many times within a short interval of the genome; (L, t) - clump
Related Computational tasks:
3. ClumpFinding(Genome, k, t, L)


Furthermore, we continued to learn a new algorithm to find the oriC based on some features of DNA replication biology.

Important Biology:
1. DNA polymerase does not wait for the two parent strands to completely separate before initiating replication; instead, it starts copying while the strands are unraveling.
2. To start replications, a DNA polymerase needs a primer, a short complementary segment that binds to the parent strand and jump starts the DNA polymerase.
3. DNA polymerase is unidirectional: only traverse a template strand of DNA in the 3' -> 5' direction, which is opposite from the 5' -> 3' direction of DNA.
4. Asymmetry of Replication: replication on reverse half-strand progresses continuously; yet a DNA polymerase on a forward half-strand has no choice but to wait again until the replication fork has opened another 2000 nucleotides or so. -> Okazaki fragments from multiple primers
5. Deamination: cytosine (C) has a tendency to mutate into thymine (T) through a process called deamination. Deamination rates increase 100-fold when DNA is single-stranded, which leads to a decrease in cytosine (C) on the forward half-strand.

THEREFORE, we can use the skew diagram #G-#C to find the oriC <- if this difference starts increasing, then we guess that we are on the forward half-strand; if this difference starts decreasing, then we guess we are on the reverse half-strand.
The skew should achieve a minimum at the position where the reverse half-strand ends and the forward half-strand begins, which is exactly the location of oriC!
Related Computational tasks:
4. Minimum Skew Problem


The LAST computational task of this chapter is Approximate Pattern Matching Problem! And the following are some thoughts:
First, we define a d-neighborhood Neighbors(Pattern, d) of a k-mer Pattern, as the set of all k-mers that are close to Pattern. (Recursive)
Then, for each neighbor only, we update the frequency array.
The approximate pattern matching by sorting and pattern matching by sorting employ the same idea that sorting bring same things together so we can count their appearance by comparing with the neighbor element....Remember this!



Chapter 2: Which DNA Patterns Play the Role of Molecular Clocks? (Randomized Algorithms)
In Chapter 1, we learned one type of hidden message in genomes, however, there are other types of hidden messages that have nothing to do with genome replication, for example, regulatory DNA motifs responsible for gene expression, that we will learn in Chapter 2. 
BIOLOGY:
1. A transcription factor regulates a gene by binding to a specific short DNA interval called a regulatory motif, or transcription factor binding site, in the gene's upstream region, a 600-1000 nucleotide-long region preceding the start of the gene.
2. Regulatory motif are not completely conserved, meaning they can vary at some positions for different genes.
THUS, comes our new COMPUTATIONAL task: develop algorithms for motif finding: discover the "hidden message" shared by a collection of strings.

The DIFFERENCE between Frequent Words Problem and Motif Finding Problem is that: a DnaA box is a pattern that clumps, or appears frequently, within a relatively short interval of the genome; WHILE a regulatory motif is a pattern that appears at least once (with variation) in each of my different regions that are dispersed throughout the genome.

(k,d) motif: a k-mer appears in every string from Dna with at most d mismatches.
Observation: any (k, d) motif must be at most d mismatches apart from some k-mer appearing in one of the strings of Dna.
(1) A brute force algorithm for motif finding.
Limitation: as long as a single sequence does not contain the transcription factor binding sites, a (k,d) motif does not exist!

Another model: select a k-mer from each string and score these k-mers of how similar they are to each other (motif matrix).
- define the Score(Motifs)
- minimize this score
Motifs -> Score -> Count -> Profile -> Consensus String
- Entropy: the entropy of completely conserved column is 0, and 2 for equally-likely nucleotides column.
- Motif logo
d(Pattern, Dna): the sum of distances between Pattern and all strings in Dna
(2) Median String Problem
Limitation: since median string problem has to consider 4^k k-mers, so toooo slow for cases like k = 15.
(3) Greedy Motif Search
Observation: a k-mer has a higher probability when it is more similar to the consensus string of a profile.
Improvement: pesudocounts
(4) Randomized Motif Search
- Monte Carlo Algorithm: not guaranteed to return exact solutions, but they do quickly find approximate solutions.
- Las Vegas Algorithm
- continue to iterate for as long as the constructed motifs keeps improving
Advantage: Randomized Motif Search can find longer motifs.
(5) Gibbs Sampling
Limitation: local optimum

No comments:

Post a Comment