Background:
Markov models and hidden Markov models are often used for modeling written text. A common way of modeling text is to consider each word to be a single observation x_t (perhaps with some preprocessing to appropriately handle punctuation, etc.). In an HMM, the hidden states z_t represent abstract concepts or word types, and they may or may not have a clear interpretation. More complex models are also frequently used to model text, especially higher-order Markov chains and probabilistic context-free grammars (PCFGs).

In this exercise, you will apply a hidden Markov model to a short sample of text resembling stories from a children's book. You will use the Baum-Welch algorithm to estimate the parameters. Using a short sample of text involving a limited vocabulary and relatively simple grammar makes the problem easier. (With a larger sample of text, larger vocabulary, and more complex grammar, a more sophisticated approach would probably be needed.)


Data:
The file "x-words.txt" contains a sequence of integers x_1,...,x_n, where x_t represents the t'th word/symbol in the sample of text.  The file "code-words.txt" contains the mapping from integers to words/symbols. (You will not need code-words.txt until part 3 of this assignment.)


Setup:
The hidden values z_t belong to the set {1,...,m}, for some value of m that we will choose below.  Each observation x_t is an integer in {1,...,k} where k=59, since there are 59 distinct words/symbols occurring in the text.  There are n=292 observations, x_1,...,x_n (the text is 292 words/symbols long). For each state i=1,...,m, we model the emission distribution p(x_t | z_t=i) as an arbitrary probability vector phi_i = (phi_i1,...,phi_ik) summing to 1.


Exercises:
1. Implement the forward algorithm and backward algorithm. As a sanity check, compute the value of log(p(x_1,...,x_n)) using the forward algorithm, assuming that the parameters are:
	m = 10
	pi_i = 1/m for all i=1,...,m.
	T_ij = 1/m for all i=1,...,m and j=1,...,m.
	phi_iw = 1/k for all i=1,...,m and w=1,...,k.
Report the value you get.  These parameter values are simple enough that you can analytically calculate log(p(x_1,...,x_n)) to check your answer.

2. Implement the Baum-Welch algorithm. For the convergence criterion, stop when the change in log(p(x_1,...,x_n)) from one iteration to the next is less than 0.01. Run the algorithm on the data in x-words.txt, with the following settings:
a) m = 10
b) m = 30
c) m = 100
For each m, run the algorithm 3 times with different randomly chosen initializations of the parameters (so, you will be doing 9 runs altogether). For each run, report the number of iterations until "convergence" and report the value of log(p(x_1,...,x_n)) at the last iteration.

3. Implement a function that takes the HMM parameters, generates a random sequence of hidden states z_1,...,z_n and observations x_1,...,x_n, and (using code-words.txt) prints out the sequence of words/symbols corresponding to x_1,...,x_n in a nicely-formatted way.  For each m = 10, 30, and 100, run your implementation of the Baum-Welch algorithm to estimate the parameters, generate a random sequence x_1,...,x_N with N=250, and print out the corresponding sequence of words/symbols. Discuss the differences you observe between the generated text for m = 10, 30, and 100. 


Notes:
- In the M-step for phi_i, you will need to use Lagrange multipliers to analytically compute the maximum.
- For reference, my implementation takes about one second when m = 10, three or four seconds when m = 30, and roughly 30-40 seconds when m = 100.