Progress in Bioinformatics World: Notes of Bioinformatics II

My NOTES of Bioinformatics II: Genome Sequencing offered by UCSD on Coursera.

Chapter 3: How Do We Assemble Genomes? (Graph Algorithms)

Difficulty:

1. double stranded DNA: no way of knowing which strand a given read derives from

2. sequencing errors

3. some regions of the genome may not be covered by any reads

4. Repeats complicate genome assembly: approximately 50% of human genome is made of repeats, e.g., the 300 nucleotide-long Alu sequence is repeated over a million times.

Some Concepts:

reads: sequence much shorter DNA fragments.

genome assembly: put a genome back together from its reads using overlapping information!

overlap graph: form a node for each k-mer in Patterns and connect k-mers Pattern and Pattern' by a directed edge if Suffix(Pattern) is equal to Prefix(Pattern').

hamiltonian path: a path in a graph visit every node once.

k-universal string: contains every binary k-mer exactly once.

de Bruijn graph:assign each k-mer in Patterns as edges, prefix and suffix of each edge as the node, and then glue identical nodes together.

eulerian path: visit every edge exactly once.

NP-complete: can be verified in polynomial time, yet no fast solution. (non-deterministic polynomial time). All NP complete problems are equivalent to each other.

Computational Problems:

(1) String Composition Problem

(2) String Spelled by a Genome Path Problem

(3) Overlap Graph Problem

(4) de Bruijn Graph Problem

(5) de Bruijn Graph From k-mers Problem

Concepts:

Eulerian Cycle: a cycle that traverse each edge of a graph exactly once.

Euler's Theorem: every balanced, strongly connected direct graph is Eulerian.

read pairs: pairs of reads separated by a fixed distance d in the genome.

Paired de Bruijn Graph:

Genome Assembly Faces Real Sequencing Data:

1. Breaking reads into shorted k-mers: yet smaller k may result in a more tangled de Bruijn graph

2. Splitting the genome into contigs (long, continuous fragment of the genome)

- maximal non-branching path: a non-branching path that cannot be extended into a longer non-branching path.

Computational Problems:

(1) Eulerian Cycle Problem

- STACK!!

(2) Eulerian Path Problem

(3) String Reconstruction Problem: reduced to finding an Eulerian Path in the de Bruijn graph generated from read!

(4) k-Universal Circular String Problem: reconstruct a circular string given its k-mer composition.

(5) String Reconstruction from Read-Pairs Problem

(6) Contig Generation Problem

Chapter 4: How Do We Sequence Antibiotics? (Brute Force Algorithms)

A difficult problem in antibiotics research is that of sequencing newly discovered antibiotics, or determining the order of amino acids making up the antibiotic peptide.

Necessary Biology:

1. There are three different ways to divide a DNA string into codons for translation, one starting at each of the first three starting positions of the string (reading frames).

2. Protein translation is carried out by a molecular machine called ribosome.

3. non-ribosomal peptides (NRPs): synthesized not by the ribosome, but by a giant protein called NRP synthetase.

NRPs have pharmaceutical applications because they have been optimized by eons of evolution as "molecular bullets" that bacteria and fungi use to kill their enemies. If these enemies happen to be pathogens, then researchers are eager to borrow these bullets as antibacterial drugs.

4. mass spectrometer: an expensive molecular scale that shatters molecules into pieces and then weighs the resulting fragments, in daltons (Da); 1 Da is approximately equal to the mass of a single nuclear particle.

The mass spectrometer can break each molecule of Tyrocidine B1 into two linear fragments, and it analyzes samples that may contain billions of identical copies of the peptide, with each copy breaking in its own way.

Our goal is to use all of these different fragments to sequence the peptide.

5. subpeptides: for a cyclic peptide with n amino acids, there are n * (n-1) subpeptides.

6. Cyclospectrum(Peptide): theoretical spectrum; the collection of all masses of its subpeptides, as well as mass 0 and mass of the entire peptide.

Assumption: for a given linear peptide Peptide, the mass of any subpeptide is equal to the difference between the masses of two prefixes of Peptide!!

7. For a given cyclic peptide: its theoretical spectrum are those found by LinearSpectrum and those corresponding to subpeptides wrapping around the end of Peptide.

Again, our goal is to reconstruct unknown peptide from its experimental spectrum.

8. A branch-and-bound algorithm for cyclopeptide sequencing: "grow" candidate linear peptides whose theoretical spectra are "consistent" with the experimental spectrum.

- branching step: to increase the number of candidate solutions

- bounding step: to remove hopeless candidates

9. Real Spectra: mass spectrometers generate "noisy" spectra: false masses and missing masses. THUS, in our algorithm design, instead of finding exact match, we need a scoring function that can score how match between the given theoretical spectrum and the given experimental spectrum.

10. Replace Peptides with Leaderboard: hold the N highest scoring candidate for further extension (including ties).

11. Since there are more than 100 NRPs, we MUST determine the amino acid composition of a peptide from its spectrum so that we may run LeaderboardCyclopeptideSequencing on this smaller alphabet of amino acids.

12. spectral convolution: take the positive differences of masses in the spectrum.

13. Epilogue: from simulated to real spectra!

Computational Problems:

(1) Protein Translation Problem:

(2) Generating Theoretical Spectrum Problem: LinearSpectrum and CyclicSpectrum

(3) Counting Peptides with Given Mass Problem: dynamic programming algorithm

(4) Cyclopeptide Sequencing Problem: still not efficient enough

Progress in Bioinformatics World

Wednesday, January 6, 2016

Notes of Bioinformatics II

No comments:

Post a Comment