BME205, Fall 2011, Section 01: Python Assignment 2

Simple Markov chains
Due Friday 14 Oct 2011 (beginning of class).

This assignment is intended to deepen your understanding of first-order Markov chains and of relative entropy. I will provide you with a data set consisting of many protein sequences in a FASTA file (see Python assignment 1). You will build stochastic models of these sequences as zero-order and first-order Markov chains, and measure the information gain of the first-order model over the zero-order model.

I have provided two FASTA files: Python2-seqs-1.fa and Python2-seqs-2.fa for training and testing the models. You will do 2-fold cross-validation (training on each of the sets and testing on the other). 

The combined set was derived from Dunbrack's culledpdb list of X-ray structures with resolution 1.6 Angstroms or better, by Professor Karplus.

We will be modeling the sequences including a special "stop" character at the end of the sequences. This means that the Markov chains will have probabilities that add to one over all strings that end with the stop character, rather than over strings of a fixed length. For simplicity, we'll be using the same non-sequence character to indicate the start of a sequence, by default "=". It might be a good idea in your code to have separate parameters for start and stop characters, and to allow start=None to mean that k-mers that start before the first character of the sequence are not counted (similarly for stop=None.

The simplest model we discussed in class is the i.i.d. (independent, identically distributed) model, or zero-order Markov model. It assumes that each character of the string comes from the same background distribution P0, and the probability of a string is just the product of the probabilities of the letters of the string (including the final "="). To estimate the background distribution, we need counts of all the letters in the training set.

The next simplest model was a first-order Markov chain, in which the probability of each character depends only on the preceding character. To estimate the conditional probabilities, we'll need the counts for all pairs of letters in the training set, including initial pairs (like "=M") and ending pairs (like "K=").

In general, for estimating a k-th order Markov chain, we need counts of all (k+1)-mers in the training data.

First program: count-kmers

Write a program called "count-kmers" that reads a FASTA file from stdin and outputs a table of counts to stdout. The output should have two tokens on each line: the k-mer followed by the count of that k-mer. Initial and final k-mers should be symmetrically included, that is, if the sequence is ABDEF and you are counting 3-mers, the counts should be

==A 1
=AB 1
EF= 1
F== 1

Note: case should be ignored in the input sequence, so that "abDE", "ABDE", and "abde" are all counted as the same 4-mer.

The "count-kmers" program should accept two options:

--order(or -o) that gives the order of the model (hence --order 0 should count 1-mers) and 
--alphabet(or -a) which specifies which letters of the sequence are to be used (for example, to count only standard amino acids, use --alphabet ACDEFGHIKLMNPQRSTVWY). All characters in sequences that are not in the alphabet should be ignored. Note: you probably want to use the "set" type to represent the alphabet internally.

The actual counting of k-mers should be done by a function you define in a separate module ( or Make sure that your function definitions all have docstrings. The counts are most easily kept and returned in a "dict" object, which provides a very convenient way to do all sorts of mappings ”in this case k-mers to counts. I found it convenient to use count=collections.defaultdict(int), so that I could use count[kmer]+=1, without having to worry about whether the kmer had previously been seen. 

If the function is passed sys.stdin as an argument, then you can easily reuse it on other files. Furthermore, if you use:

 for line in input_stream:

to read the input, then you can do testing by passing in a list of lines:

>>> dummy_in= '>seq1\\nABCDEF\\n>seq2\\nAC\\nde\\nFG\\n>seq3\\n'
>>> count=get_counts(dummy_in.splitlines(True), 2, "ABCDEFGHIJK", '=')

Here is an example of what you should get when your run your program:

count-kmers -o 0 --alphabet=ACDEFGHIKLMNPQRSTVWY < /cse/classes/bme205/Fall10/p2_files/Python2_f10_1.seqs 
= 1886
A 34864
C 5894
D 24520
E 26427
F 16494
G 32867
H 11313
I 22395
K 23590
L 35618
M 9400
N 18435
P 19286
Q 15431
R 19873
S 26139
T 23318
V 28693
W 6248
Y 14851

Optional bonus points:

  • Use the doctest module to do unit testing for the functions in your markov module.
  • Add an option --nonsequence(or -n) to change the stop character from "=" to a user-defined symbol.
  • Sort the k-mers in alphabetical order for the output file (this is what I did in the example above).
  • Sort the k-mers in reverse order of counts for the output file.
  • Check that the alphabet has only printable, non-whitespace characters in it. Raise an exception if there are problems.
  • Check that the non-sequence character is not part of the alphabet, and raise an exception if it is.

Second program: coding-cost

Write a program that reads in a table of k-mer counts (in the same format as you output them) from stdin and FASTA-formatted sequences from a file (passed as a required argument to the program), and computes the total encoding cost in bits for the sequences. The encoding cost of a character x using model M is,-log2(PM(x)), and the encoding cost of a sequence is the sum of the costs for the characters (note: this is equivalent to taking -log2 of the product of the probabilities).

Report not only the total encoding cost for the file, but the average cost per sequence and per character. So that everyone gets the same numbers, for the per-character costs, divide the total cost by the number of sequence characters, excluding the non-sequence symbols (though encoding them is included in the total cost).  Said another way, the total cost includes encoding one stop character at the end (for any order), but do not count the stop character in the count of characters when dividing. 

Note that to do the computation, you need to convert the counts into probabilities. The simplest approach is to normalize the counts to sum to 1 over each "condition". You might think that because the FASTA training file will be large, you will not need to worry about pseudocounts, but even by order 2, there are 3-mers that have zero counts, so the maximum-likelihood estimate of their probability is 0 and the encoding cost is infinite (resulting in a "ValueError: math domain error" from python if you actually try to compute it).

The next simplest approach is to add a pseudocount to every count (including the zero counts for a context, so if you have seen "WWA", "WWC", ... but not "WWH" or "WWI", you have the context "WW" and will need to add a pseudocount for every "WWx" where x ranges over your alphabet. I recommend adding a function to your markov module that produces a conditional probability dict from a count dict (without changing count), so that you can easily modify the way you handle the probability estimation. The default pseudocount should be 1.

Hint: for this assignment, since we are not looking at the encoding cost individually for each sequence, but only the total encoding cost, you can summarize the fasta file with the same sort of dictionary of counts that was used for count-kmers. You can determine the size of k-mer you need to count by looking at the size of the keys you read in for the model.

Run coding-cost on both sets of sequences using both order 0 and order 1 models. Get the coding costs both for the training set and for the test set. 

Optional bonus points:Note that a simple uniform model over 20 characters would take 4.32193 bits/char, and over a 21-letter alphabet (including stop) would take 4.39232 bits/char.

  • The simple approach of making sure that pseudocounts have been used for letters in every observed context will work fine up to about order 2, but for higher order Markov chains, you might run into k-mers for which even the k-1-mer that is the context has never been seen in the training set. (For example, Python2-seqs-2.fa has a sequence starting with CYC, producing 4-mer =CYC , but the context =CY never occurs in Python2-seqs-1.fa.)

    You can fix this problem by making sure that the log-prob table has keys for all k-mers over the alphabet.

  • Test the "reversibility" of an alphabet, by determining the coding cost for reversed sequences using the Markov chain. Note that since we counted off-the-end characters symmetrically at the beginning and end, we can simply create a new reverse_count object which has the same values but with each key reversed.
  • Have the program automatically run tests of different values of the pseudocount, to find out what value works best. This can be as simple as
        for pseudocount in [1e-10, 0.01, 0.1, 0.2, 0.5, 1, 2, 5, 10, 100]:  
    or as sophisticated as you want to make it.
  • Try a more sophisticated regularizer than pseudocounts. For example, you could use the probabilities from a smaller context. One method I think is cool, but have not yet implemented, is to have the pseudocounts for a order k+1 model be alpha*probabilities from an order k model, recursively down to pseudocounts of alpha/num_letters for an order 0 model.
  • To get even fancier with alphabets, allow the user to specify the alphabet by name, and look it up in the /projects/compbio/lib/alphabet/files we have defined for our various other programs. The ExtAA alphabet gives the definition of the amino acids. Note the specification of wildcards. Another alphabet we use a lot is the EHL2 alphabet, in DSSP.alphabet---note its use of aliases to collapse several different letters in a sequence into the same equivalence class.
  • Think about adding options to handle letters that are not the 20 standard amino acids (wildcards B,Z, and X, and spaces "." and "-"). The basic assignment calls for you to ignore any letters or punctuation not in the alphabet you specify.
    • X is a code for any amino acid
    • B is a code for aspartic acid or asparagine (D or N)
    • Z is a code for glutamic acid or glutamine (E or Q)
      The B and Z codes occur mainly in protein sequences that have been determined by old-fashioned chemical methods, not from translating DNA sequences.

    There are several ways to handle wildcards and non-standard amino acid letters:

    1. treat them the same as spaces, tabs, ".", and "-". That is ignore them completely on input. This will make one erroneous transition: ABF would look like AF. This is the default behavior that your programs are expected to provide.
    2. Arbitrarily pick one of the possible amino acids (maybe replace B by D and Z by E).
    3. Treat the character as a gap in the sequence---handle the subsequences on each side as different sequences. There are two variants of this---one which starts a new sequence after the B like any other sequence (so ABF would have a start->F transition) and one that doesn't add the start->F transition. Note that adding a start->F transition may throw off probabilities for what the first residue of a sequence really is.

    There are undoubtedly other methods also.

To turn in

I want copies of your programs on paper, but I don't want all the output from every run of the programs! Instead, write a short summary describing the results of the tests, giving the average encoding costs for each set of sequences for each model. Try to write your summary as if you were describing the method and the results in a journal paper that has a page limit: you want to describe the data sets and methods completely but concisely. If you do it well, the presentation should take less than a page.

Turn in your summary page with the program stapled after it. (Avoid paper clips---they have a tendency to either come off or to pick up other people's papers.)

Submit your program and writeup files using eCommons.


  • Uppercase your lines from the fasta input, since you want to ignore case.
  • Because real sequences may have length 0, be sure your program can handle the empty sequence properly.
  • The reversal of a string is standard python function, but it doesn't return a string object, but a "reversed" object, which appears to be derived from lists rather than strings. To get a reversed string for kmer, use
  • An observation from previous years: A lot of students managed to make their programs more complex to write and harder to use by insisting on file I/O instead of using stdin and stdout. The assignment requires that you use stdin and stdout. Also, error messages should be printed to stderr (to avoid corrupting the main output in stdout).
  • Remember that Markov chains are conditional probabilities P(xi=s | xi+1=t), not joint probabilities
  • Break code into paragraphs, separated by blank lines, and start each paragraph with a comment as a topic sentence for the paragraph.
  • Don't read all the data before processing any. If you want to make a fasta-reading module, make sure its interface returns one sequence at a time, using the yield statement in Python to make a generator.
  • A sequence may span many lines. Make sure that you don't start a new sequence with each new line. Sequences only start after an ID line, which starts with ">".
  • Don't have a logarithm computation in the inner loop. Convert your counts into costs before running through the test set.