Progress in Bioinformatics World: Notes of Bioinformatics III

My NOTES of Bioinformatics III: Comparing Genes, Proteins, and Genomes

Chapter 5: How Do We Compare Biological Sequences? (Dynamic Programming)

For my PhD research, I developed a novel pairwise protein structure alignment algorithm (UniAlign: http://sacan.biomed.drexel.edu/unialign/). And part of it is to calculate the sequence similarity between two protein sequences by glocal alignment. So, I am quite familiar with dynamic programming.

What is new to me in this Chapter is that, it applies dynamic programming to the problem of finding the longest common subsequence. Finding a longest common subsequence of two strings is equivalent to finding an alignment of these strings maximizing the number of matches.

The matches in an alignment of two strings define a common subsequence of the two strings, or a sequence of symbols appearing in the same order (yet not necessarily consecutively) in both strings.

An introduction to Dynamic Programming: the Change problem

1. The trick behind dynamic programming is to solve the smaller problems once rather than billions of times.

2. To find a longest path in an arbitrary DAG, first we need to order the nodes of the DAG so that every node falls after all its predecessors.

Computational Problems:1. Manhattan Tourist Problem2. Output LCS Problem3. Longest Path in a DAG problem

Some Concepts:

1. scoring matrices
2. global alignment

3. local alignment

- when compute the s(i,j), add zero-weight edges from (0,0) to every node, which means that the source node (0, 0) is a predecessor of every node (i,j) (free taxi ride).

4. alignment with affine gap penalties problem

The Changing Faces of Sequence Alignment:

(1). Edit Distance: the minimum number of edit operations(insertion, deletion or substitution) needed to transform one string into another.

Hints: do a global alignment with mismatch penalty = -1 and gaps = -1, match = 0.

(2). Fitting Alignment: find a region within the longer protein sequence v that has high similarity with all of the shorter sequence w. In other words, "fitting" w to v required finding a substring v' of v that maximize the global alignment score between v' and w among all substrings of v.

Hints: "free taxi rides" only for the second string.

(3). Overlap Alignment: refers to the global alignment of a suffix of v with a prefix of w.

Hints: "free taxi rides" only for the first string.

These three applications of sequence alignment using dynamic programming are very nice!!

Affine Gap Penalties:

- Build Manhattan on three levels: lower, middle and upper.

Space-Efficient Sequence Alignment:

1. The memory required to store the dynamic programming matrix is substantial O(n*m).

2. divide-and-conquer algorithms: divide phase splits a problem instance into smaller instances and solves them; conquer phase stitches the smaller solutions into a solution to the original problem.

3. The idea of space reduction techniques is that we don't need to store any backtracking pointers if we are willing to spend a little more time.

4. The Middle Node Problem!!

- In general, at each new step before the final step, we double the number of middle nodes found while halving the runtime required to find middle nodes.

5. The Middle Edge Problem

- middle edge: an edge in an optimal alignment path starting at the middle node (more than one middle edge may exist for a given middle node).

Multiple Sequences Alignment

- todo

Chapter 6: Are There Fragile Regions in the Human Genome? (Combinatorial Algorithms)

Biology:

reversal: the most common form of genome rearrangement.

Random breakage model of chromosome evolution: the breakage points of rearrangements are selected randomly.

- exponential distribution: of synteny block lengths

Concept:

reversal distance: the minimum number of reversals required to transform P into Q.

identity permutation.

Breakpoint Theorem.

Rearrange in multi-chromosomal genomes:

- reversals and translocations, as well as fusions and fissions.

- genome graph

- breakpoint graph

- Cycle Theorem: given genome P and Q, any 2-break applied to P can increase Cycles(P,Q) by at most 1.

- 2-Break Distance Theorem: the 2-break distance between P and Q is equal to Blocks(P)-Cycles(Q).

Fragile Breakage Model: every mammalian genome is a mosaic of long solid regions, which are rarely affected by rearrangements, as well as short fragile regions that serve as rearrangement hotspots and that account only for a small fraction of the genome. For humans and mice, these fragile regions make up approximately 3% of the genome.

Synteny Block Construction:

- construct synteny blocks from shared k-mers

- construct synteny blocks as connected components in graph

Computational Problem:

1. Greedy Sorting by Reversal Problem

2. 2-Break Distance Problem

3. 2-Break Sorting Problem

4. Finding Shared K-mers Problem

All the python codes are available on my github: https://github.com/aprilchunyuzhao/BioinformaticsFromCoursera

Progress in Bioinformatics World

Wednesday, January 6, 2016

Notes of Bioinformatics III

No comments:

Post a Comment