BME205, Fall 2011, Section 01: Python Assignment 4

Dynamic Programming

Due Friday 11 Nov 2011


There is a nice tutorial on dynamic programming at http://www.sbc.su.se/~per/molbioinfo2001/seqali-dyn.html with pointers to an example at http://www.sbc.su.se/~per/molbioinfo2001/dynprog/dynamic.html.

If you want a textbook presentation, one of the best is in Chapter 2 of Durbin, Eddy, Krogh, and Mitchison's book Biological Sequence Analysis. The class text: An Introduction to Bioinformatics Algorithms by Neil C. Jones and Pavel A. Pevzner may be more tutorial. Chapter 6 is the relevant chapter there.

Use the recurrence relationships presented in Durbin and Eddy - also described in class and are in the lecture notes.

Be careful of the boundary conditions. For global alignment, you should have "start"-"start" aligned with 0 cost in the MATCH state and neg. infinity in Ix and Iy, so that a gap opening penalty is incurred if you start with an insertion or deletion. Before start, all rows and columns are negative infinity, and the rest of the start row and column can be inferred from the recurrence relation (see lecture notes). Make sure that your initial conditions are consistent with the recurrence relations people sometimes get the indices of their matrices reversed, or confuse the Ix and Iy matrices.

For local alignment, the recurrence relations and the boundary conditions are different.  Again, use the recurrences from Durbin and Eddy, or those presented in class.

Name your program "test_align", so that we can test everyone's program with a script that does test_align --align=local < test.fasta and test_align -align=global < test.fasta . Turn in both your program and the output for running the program on the sequence pairs in the the following fasta files:Since local alignment is allowed to start at any match state (with the 0 choice for previous alignment cost), all rows and columns before the first letters of the sequences can be minus infinity.

You'll need a substitution matrix for this assignment. I've provided BLOSUM50.

Q1 (hand computation):

Compute the score for the following global alignment using the BLOSUM50 scoring matrix with a gap-opening cost of 12 and a gap-extension cost of 1 (all are in 1/2 bit units to be compatible with the BLOSUM50 matrix). Show what you added up to get the score, since the final number alone could easily be incorrect due to a transcription or addition error.  Also note that this is a different table from last year.

    
---CTNIDDSADKN-R--
VLDAME-EIGEGDDIKMV

Note: this is neither the global nor the local optimal alignment.

Q2 (align module):

Create a module (align.py) that contains two classes: local_aligner and global_aligner. Both classes should have the same methods:

aligner= local_aligner(subst,gaps)
aligner= global_aligner(subst,gaps)

records the substitution matrix and gap costs for the alignments to follow. 

aligner.score_fasta(self,s1,s2)

Given two aligned strings that represent an alignment in fasta format, score the alignment with the parameters used when the aligner was created. Use this method to check that the alignments you output score what the dynamic programming algorithm thinks that they should have.

scored.score=aligner.align(xseq,yseq)

Creates and fills in alignment matrices and stores xseq and yseq. If you implement with matrices, consider using dict objects.  You might also consider a mesh - a 2 or 3 dimensional linked list.

seqList=aligner.traceback()

Returns a list object that contains the aligned xseq and yseq sequences.

Write a program test_align

Use your aligner module to do a master-slave alignment program.

Required Options:

  • --subst_matrix BLOSUM50  (Specifies the file to read the substitution matrix from (integer values in BLOSUM format))
  • --align local (Specify either local or global alignment.)
  • --open 10  (Gap opening penalty)
  • --extend 2  (Gap extension penalty.)

The program should read a fasta file from stdin. The fasta files may have alignment characters in them, like spaces, periods, and hyphens. These are not part of the sequences and should be removed or ignored. The alphabet for the fasta input is taken to be the characters used as indices in the substitution matrix.

The first sequence of the fasta file is taken to be the "master" sequence and output to stdout. All subsequent sequences are aligned to the first sequence and then both are output to stdout as a pair. The score for each alignment should be output to stderr.

The gap penalties should (by default) be open=12, extend=1.

To help you with your debugging, I provide possible output for local alignment of python5-3, with g=12, e=1:

>fat    
GARFIELD-------ISAFATCAT
>cat
GARFIELDWASTHELASTFATCAT

>fat
GARFIELDISAFATCAT
>cat2
GARFIELDISAFATCAT

>fat    
GARFIELDISA--FATCAT
>cat3
GARFIELDISTHEFATCAT

>fat    
GARFIELDISAFA-TCAT
>cat4
GARFIELDISAFASTCAT

>fat
GARFIELDISA--FA-TCAT
>cat4
GARFIELDISTHEFASTCAT

>fat
GARFIELDISAFAT
>cat5
GARFIELDIS-FAT

>fat    
SAYGA
>say
SAYFA

with scores 79, 108, 90, 96, 78, 68 and 19. Possible global alignments are

>fat
ISAYGARFIELD-------ISAFATCAT
>cat
----GARFIELDWASTHELASTFATCAT

>fat           
ISAYGARFIELDISAFATCAT
>cat2
----GARFIELDISAFATCAT

>fat
ISAYGARFIELDISA--FATCAT
>cat3
----GARFIELDISTHEFATCAT

>fat           
ISAYGARFIELDISAFA-TCAT
>cat4
----GARFIELDISAFASTCAT

>fat          
ISAYGARFIELDISA--FA-TCAT
>cat4
----GARFIELDISTHEFASTCAT

>fat
ISAYGARFIELDISAFATCAT
>cat5
----GARFIELDIS----FAT

>fat          
ISAYGARFIELDISAFATCAT
>say
-SAY--------------FAT

with scores 64, 93, 75, 81, 63, 40, and -11.

Bonus points

  • linear gap costs

Implement linear gap costs and compare running times for affine and linear algorithms. You may need to use longer test sequences to get repeatable timing results, as the expected 3-fold difference in the inner loop may be buried in overhead.

  • double_gap costs

Add an optional parameter that allows transitions between the 2 insert states (Ix <-> Iy).  By default, this should be neg infinity as discussed in class.

  • doctest

As always, correct (and effective) use of doctest gets you an extra point. 

Attachments

  • BLOSUM50, application/octet-stream, 1,823 bytes