Assignment 3

Due 11:59 pm on Nov 5 (Monday)

 

Given a set of proteins that belong to a family (positive set) and a set of proteins that do not belong to the family (negative set), implement a method based on Markov chain modes that predict whether a protein (test set) belongs to the family (100 points).

 

 

1. The program should implement both 1st and 2nd order Markov chains and allow users to choose.

 

2. The program should ask for three files

 

3. The program use the positive set and negative set to train Markov chain models and make prediction for every protein in the test set. The output should show the following information for each protein in the test set:

(1) Protein name

(2) log likelihood ratio, i.e.,        

(3) Prediction (1 or -1), with 1 denoting that the protein is predicted to be positive (belongs to the family) and -1 denoting that the protein is predicted to be negative (does not belong to the family)

 

The output should be in following format (one line for one protein)

 

[Protein name]              [log likelihood ratio]                  [Prediction]

 

4. Implement your program using Java, C++ or C#.

 

5. Provide a README file showing

 

6. Turn in

 

DO NOT turn in any junk files!

 

7. 20 letters for amino acids

  A  R  N  D  C  Q  E   G   H  I  L  K   M  F  P  S  T  W  Y  V    
 
8. Format of the input files. All the input files are in the following format. 
 

>Name of protein 1

Sequence of protein 1

>Name of protein 2

Sequence of protein 2

 
 
For example,
 

>Q24682_DUGTI/54-110

RKERTAFSKGQILELEKEFAVHNYLTRLRRYELAVALNLNERQIKVWFQNRRMKCKR

>WOQ2_ARATH/11-72

SSSRWNPTKDQITLLENLYKEGIRTPSADQIQQITGRLRAYGHIEGKNVFYWFQNHKARQRQ

 
 
8. To avoid zero transition probability, we add a pseudo count of 10-6 to each estimation as follows