Wednesday, January 6, 2016

Notes of Bioinformatics III

My NOTES of  Bioinformatics III: Comparing Genes, Proteins, and Genomes

Chapter 5: How Do We Compare Biological Sequences? (Dynamic Programming)


For my PhD research, I developed a novel pairwise protein structure alignment algorithm (UniAlign: http://sacan.biomed.drexel.edu/unialign/). And part of it is to calculate the sequence similarity between two protein sequences by glocal alignment. So, I am quite familiar with dynamic programming.

What is new to me in this Chapter is that, it applies dynamic programming to the problem of finding the longest common subsequence. Finding a longest common subsequence of two strings is equivalent to finding an alignment of these strings maximizing the number of matches.

The matches in an alignment of two strings define a common subsequence of the two strings, or a sequence of symbols appearing in the same order (yet not necessarily consecutively) in both strings.


An introduction to Dynamic Programming: the Change problem
1. The trick behind dynamic programming is to solve the smaller problems once rather than billions of times.
2. To find a longest path in an arbitrary DAG, first we need to order the nodes of the DAG so that every node falls after all its predecessors



Computational Problems:1. Manhattan Tourist Problem2. Output LCS Problem3. Longest Path in a DAG problem

Some Concepts:
1. scoring matrices
2. global alignment
3. local alignment
- when compute the s(i,j), add zero-weight edges from (0,0) to every node, which means that the source node (0, 0) is a predecessor of every node (i,j) (free taxi ride).
4. alignment with affine gap penalties problem



The Changing Faces of Sequence Alignment:
(1). Edit Distance: the minimum number of edit operations(insertion, deletion or substitution) needed to transform one string into another.
Hints: do a global alignment with mismatch penalty = -1 and gaps = -1, match = 0.
(2). Fitting Alignment: find a region within the longer protein sequence v that has high similarity with all of the shorter sequence w. In other words, "fitting" w to v required finding a substring v' of v that maximize the global alignment score between v' and w among all substrings of v.
Hints: "free taxi rides" only for the second string.
(3). Overlap Alignment: refers to the global alignment of a suffix of v with a prefix of w
Hints: "free taxi rides" only for the first string.
These three applications of sequence alignment using dynamic programming are very nice!!

Affine Gap Penalties:
- Build Manhattan on three levels: lower, middle and upper.

Space-Efficient Sequence Alignment:
1. The memory required to store the dynamic programming matrix is substantial O(n*m).
2. divide-and-conquer algorithms: divide phase splits a problem instance into smaller instances and solves them; conquer phase stitches the smaller solutions into a solution to the original problem. 
3. The idea of space reduction techniques is that we don't need to store any backtracking pointers if we are willing to spend a little more time.
4. The Middle Node Problem!! 
- In general, at each new step before the final step, we double the number of middle nodes found while halving the runtime required to find middle nodes.
5. The Middle Edge Problem
- middle edge: an edge in an optimal alignment path starting at the middle node (more than one middle edge may exist for a given middle node).


Multiple Sequences Alignment
- todo


Chapter 6: Are There Fragile Regions in the Human Genome? (Combinatorial Algorithms)

Biology:
reversal: the most common form of genome rearrangement.
Random breakage model of chromosome evolution: the breakage points of rearrangements are selected randomly.
- exponential distribution: of synteny block lengths

Concept:
reversal distance: the minimum number of reversals required to transform P into Q.
identity permutation.
Breakpoint Theorem.

Rearrange in multi-chromosomal genomes
- reversals and translocations, as well as fusions and fissions.
- genome graph
- breakpoint graph
- Cycle Theorem: given genome P and Q, any 2-break applied to  P can increase Cycles(P,Q) by at most 1.
- 2-Break Distance Theorem: the 2-break distance between P and Q is equal to Blocks(P)-Cycles(Q).

Fragile Breakage Model: every mammalian genome is a mosaic of long solid regions, which are rarely affected by rearrangements, as well as short fragile regions that serve as rearrangement hotspots and that account only for a small fraction of the genome. For humans and mice, these fragile regions make up approximately 3% of the genome.

Synteny Block Construction:
- construct synteny blocks from shared k-mers
- construct synteny blocks as connected components in graph


Computational Problem:
1. Greedy Sorting by Reversal Problem
2. 2-Break Distance Problem
3. 2-Break Sorting Problem
4. Finding Shared K-mers Problem

All the python codes are available on my github: https://github.com/aprilchunyuzhao/BioinformaticsFromCoursera  

No comments:

Post a Comment