# Visualizing the Effects of Multicollinearity on LS Regression

In this post I would like to share with you two interesting visual insights into the effects of multicollinearity among the predictor variables on the coefficients of least squares regression (LSR). This post is very non-technical and I skip over most mathematical details. One of these insights is borrowed from Using Multivariate Statistics by Barbara Tabachnick and Linda Fidell (2013, 6th edition). When I first received it by post, after purchasing it on amazon, I immediately wanted to return it back. The paperback edition is the size of two large bricks and weighs 2.2 kg! I remember paging through it and thinking why on earth did this book receive so many good reviews on amazon. Only when I started reading it I understood why. This book is indispensable in providing clear and intuitive insight into topics like LSR, logistic regression, PCA and DA, canonical correlation, data cleaning and pre-processing, and many others. The source of the second insight is an online stats course offered by the Pennsylvania State University, which I find to be very good. Now that I am done with the credits, let’s see what the insights are!

## I Thought you Said White with No Sugar?

Ok, so you probably know that multi/collinearity and singularity among predictor variables in regression is bad. But why is it bad and what does it affect? Multicollinearity is defined as a high (>90%) correlation among two or more predictor variables or their combinations (e.g. structural multicollinearity). Singularity can be simply defined as predictor redundancy. The presence of collinearity affects stability and interpretability of the regression model. By stability I mean what happens to the regression coefficients when new data points or new predictor variables are added in. By predictability I mean the usefulness of the magnitude and sign of the regression coefficients in telling the story. For example, economist are often interested in using regression models to explain the relationships of one set of economic parameters on another. Take for instance using inflation index and consumer price index to forecast borrow rates. The economists would be interested in building a model that can tell them that and x-amount increase in inflation will result in w*x-amount increase/decrease in the rate. Using predictor variables that are highly correlated with each other will most likely result in a model that “keeps changing” its story.

Let’s continue with our example of forecasting the borrow rate. Please note that in reality there may be no relationship between the borrow rate and inflation index or consumer price index. I am simply using these to illustrate the regression point. So, let’s imagine that inflation index (II) is correlated with the consumer price index (CPI) at 92%. Let’s also assume that we want to use the import tax rate (ITR) in the regression, as we believe that it is a factor in the inflation dynamics. Interestingly, ITR has no correlation with II and it correlates with CPI at 5%. We believe that the regression equation therefore is:

$br=\beta + w_{1}ii + w_{2}cpi + w_{3}itr + \epsilon$

The LSR will estimate $w_{1}, w_{2}$ and $w_{3}$. Both the magnitude and the sign of the coefficients will depend on how strongly the predictor variables correlated with each other. In [1] we can find a great visual explanation for this using Venn diagrams. Take a look at figure below. A Venn diagram is a good way to show to what extend CPI and II can predict the rate. There is a substantial overlap between CPI and the rate and the II and the rate (~ 40%?). However, because CPI and II overlap themselves, the only credit each predictor gets assigned is the unique non-overlapping contribution. The unique contribution of CPI will be the size of area c. The unique contribution of II will be the area b. The area a will be “lost” to the standard error (see Note a below).

How does this make our regression solution unstable? Imagine we decide not to use II. The Venn diagram now looks like Figure 2. The regression coefficient of CPI will blow-up to the full credit it deserves. This will change our previous story about how CPI explains the borrow rate. If I were to remove the ITR instead, the coefficients of II and CPI would not have changed significantly. Thus in presence of multicollinearity among the predictor variables, neither do we get a stable solution, nor can we use it to properly interpret the underlying relationships.

## Are you Sitting Comfortably?

Which chair would you rather sit on? Me too. In [2] we can find another great visual insight into how multicollinearity impacts the stability of LSR.

A scatter plot of two uncorrelated variables looks like a sea of data points. A scatter plot of two highly correlated variables looks more like a straight line. A regression solution will fit a hyperplane through the data points and the more spread-out the data points are, the more stable is the shape and direction of the best fit hyperplane. If, however, the data is align in a straight line, the best fit plane can be at any angle, depending on which target point it has to go through.

Figure 3 above shows that the uncorrelated data points act like anchors for the left hyperplane. In case of strongly correlated II and CPI where the data points are in a line, there are no anchors for the sides and corners of the right plane, and it will “tilt” into the direction of whichever BR point it has to go through.

## Summary and Conclusion

I hope you have found this visual insight into the impact of multi/collinearity revealing. I definitely did. To conclude, the presence of multicollinearity among the predictor variables in the least squares standard multiple regression does not invalidate the predictions generated by the fitted model. However, the stability of the model and its usefulness in telling the story about the model parameters is compromised.

References:
[1]. Using Multivariate Statistics. Barbara Tabachnick and Linda Fidell. 2013, 6th edition. Pearson.
[2]. Online Statistics course STATS501. https://onlinecourses.science.psu.edu/stat501/

Notes:

(a) In practice, the total coefficients weight assigned to CPI and II will approximately reflect the entire a+b+c area. If CPI and II both have the same real contributions, the individual weights assigned by the regression may depend on the order of data columns (i.e. does CPI appear before II) that is used to fit regression model, with the first predictor receiving the greatest weight. By “lost to standard error” I mean the dependence on these random factors like order, as well as an overall increase in the standard error of CPI and II regression coefficients.

# Test-Driven Data Analysis (and its possible application to the LS Regression)

I have recently attended a PyData meetup in London where Nicolas Radcliffe gave a nice talk on the concept of Test-Driven Data Analysis (TDDA). Here is a link to the slides that he presented.

Essentially, the idea behind TDDA is born from a well-known idea of Test Driven Software Development (TDSD or simply TDD). In TDSD, the test cases are written before the core code. This is often done to avoid developing solutions for specific scenarios and missing the edge cases. Take for example division. Everyone (hopefully) provides for division by zero, but what about integer division when a floating number is expected? How many times have you been caught out by this in a dynamic programming language like Python, staring at the code and wondering why every number comes out as zero? Developers who use TDSD specify at the start what the code should behave like and the constraints placed on the results.

Right now the practice of data science and data analysis very much lacks the structure of TDSD. I liked Nicolas’s 5th slide where he schematically shows the pillars of the data scientist’s workflow: trying, eyeballing and trying again until it makes sense. However, each stage of data analysis is prone to error. Starting from choosing the modelling approach down to interpretation of the results. Errors can be made in data pre-processing, model implementation, output gathering and even when plotting graphs. One of the main points of the talk was reproducible research. If every analysis is wrapped in a verification procedure that ensures that the same outputs are produced for the same inputs, then analysis bugs can be greatly reduced.

Another strong point of the talk was about using constraints. Much of the test-driven software development relies on constraints placed on the allowed inputs&outputs. In the data analysis, such constraints can be placed on data types, data value ranges, legality of duplicates, or even statistical and distributional properties. For example, can a value of an insurance claim be negative? What about a negative correlation between the insurance costs and the values of the claim?

I am hugely in favor of everything that makes software development more disciplined and the final result more robust. Thus you can see why I liked Nicolas’s ideas so much. It made me think about how and where I can incorporate TDDA in my work. Let’s take the work horse of data science – the least squares (standard multiple) regression. Can TDDA safeguard from common mistakes of applying regression and interpreting its results? I think it can. The quality of regression analysis and the ability of reproducing and reusing the regression model results will depend on the following parameters, all of which can be the constraints of TDDA:

• Inputs: number of input cases vs. the number of independent variables: too many or too few input cases can invalidate regression analysis. When the number of input variables is small compared to the number of independent variables (i.e. the attributes of regression), the regression converges on perfect but meaningless solution. Another possibility is that the variance-covariance matrix can be non-invertable. Alternatively, when one tries to fit regression to too much data, almost any amount of correlation between the dependent and the independent variable can become statistically significant. A large number of false positives is an unfortunate side-effect of big data used in the wrong way. A possible constraint can be a simple check if the number of cases is ≥ 50 + 8 * ‘# of independent variables’.
• Inputs: presence of singularity and multicollinearity: two regression attributes are singular if one of them is redundant (e.g. age and year of birth). Multicolliniear attributes are highly correlated, and in the case of standard multiple regression (as opposed to the step-wise regression, for example), can significantly impact/reduce the regression coefficients.
• Inputs: presence of outliers: everyone knows what outliers can do to the statistics like the mean and standard deviation. Likewise, outliers can significantly impact the coefficients of regressions. A constraint on outliers absence can be placed on the data.
• Output: normality and linearity of residuals: if regression was performed on non-transformed data, the residuals should always be examined. Skewed residuals may point to absence of normality in the input data, and non-normal input data may invalidate your prediction intervals if regression model is used to make predictions. Non-linear residuals may indicate that the originally suggested linear relationship between the dependent and the independent variables is inaccurate. Absence of normality is often dealt with data transformations (e.g. taking square roots or logarithms). Non-linearity will require re-visiting the original model (e.g. squaring or cubing the attributes).
• Output: homoscedasticity of residuals: the absence of homoscedasticity between the error and the predicted variable may occur if the response is related to some transformation of the dependent variable instead of its original form. In other words, the relationship of the two variables varies over time. For example, the relationship between age and income is very likely to have this non-linear shape, since income of young people varies less than income of people who are over forty. I should note that the presence of heteroscedasticity does not invalidate regression results, but it may weaken the strength of its interpretation. Again, data transformation can come to the rescue (e.g. taking the logarithm).

Here you go – we have come up with five constraints that can be placed on regression input and output to ensure robust data analysis. Luckily most analytical platforms like SAS or SPSS have built-in checks that flag when these constraints are not satisfied. But in case if you are playing with a regression tool from another package – watch out. There is no need to increase that figure of 90% (see slide # 10)…

# samplepy – a new Python Sampling Package

This post is to introduce a new Python package samplepy. This package was written to simplify sampling tasks that so often creep-up in machine learning. The package implements Importance, Rejection and Metropolis-Hastings sampling algorithms.

samplepy has a very simple API. The package can be installed with pip by simply running pip install samplepy. Once installed, you can use it to sample from any univariate distribution as following (showing rejection sampling use):

from samplepy import Rejection
import matplotlib.pyplot as plt
import numpy as np

# define a unimodal function to sample under
f = lambda x: 2.0*np.exp(-2.0*x)
# instantiate Rejection sampling with f and required interval
rej = Rejection(f, [0.01, 3.0])
# create a sample of 10K points
sample = rej.sample(10000, 1)

# plot the original function and the created sample set
x = np.arange(0.01, 3.0, (3.0-0.01)/10000)
fx = f(x)

figure, axis = plt.subplots()
axis.hist(sample, normed=1, bins=40)
axis2 = axis.twinx()
axis2.plot(x, fx, 'g', label="f(x)=2.0*exp(-2*x)")
plt.legend(loc=1)
plt.show()


The three sampling method (i.e. Rejection, Importance and MH) are quite different and will achieve slightly different results for the same function. Performance is another important difference factor, with Metropolis-Hastings probably being the slowest. Let’s compare how the three sampling algorithm deliver on a bi-modal univariate function:

$f(x)=exp(-x^{2})*(2+\sin(5x)+\sin(2x))$

from samplepy import Rejection, Importance, MH
import matplotlib.pyplot as plt
import numpy as np

f = lambda x: np.exp(-1.0*x**2)*(2.0+np.sin(5.0*x)+np.sin(2.0*x))
interval = [-3.0, 3.0]
rej = Rejection(f, interval)  # instantiate Rejection sampling with f and interval
sample = rej.sample(10000, 1)    # create a sample of 10K points

x = np.arange(interval[0], interval[1], (interval[1]-interval[0])/10000)
fx = f(x)

figure, axis = plt.subplots()
axis.hist(sample, normed=1, bins=40)
axis2 = axis.twinx()
axis2.plot(x, fx, 'g', label="Rejection")
plt.legend(loc=1)
plt.show()

mh = MH(f,interval)
sample = mh.sample(20000, 100, 1)  # Make sure we have enough points in the sample!

figure, axis = plt.subplots()
axis.hist(sample, normed=1, bins=40)
axis2 = axis.twinx()
axis2.plot(x, fx, 'g', label="MH")
plt.legend(loc=1)
plt.show()

imp = Importance(f, interval)
sample = imp.sample(10000, 0.0001, 0.0010) # create a sample where essentially no extra importance is given to any quantile

figure, axis = plt.subplots()
axis.hist(sample, normed=1, bins=40)
axis2 = axis.twinx()
axis2.plot(x, fx, 'g', label="Importance")
plt.legend(loc=1)
plt.show()



Hopefully this gives you enough examples to get you started using samplepy!

# A Tutorial on Hidden Markov Model with a Stock Price Example – Part 2

This is the 2nd part of the tutorial on Hidden Markov models. In this post we will look at a possible implementation of the described algorithms and estimate model performance on Yahoo stock price time-series.

#### Implementation of HMM in Python

I am providing an example implementation on my GitHub space. Please note that all code is provided with a disclaimer that you are free to use it at your own risk. HMM.py contains the main implementation. There are a few things to point out in this file:

1. As is pointed out in the referenced document in part 1, $\alpha$ and $\beta$ require to be scaled for longer observations since the product of probabilities quickly tends to zero, resulting in underflow. The code has a flag for scaling, and defaults to positive.
2. While testing out this model I noticed that initial assignment of $\hat{a_{ij}}$ and $\hat{b_{j}}(k)$ makes a big difference to the final solution. I am providing two possible assignments. One, as described in part 1, setting approximately to the same values, adding to 1. And two, using Dirichlet distribution to create non-uniform assignment that adds-up to 1. I’ve noticed that the former results in more meaningful model parameters.
3. You will notice a small hack in HMMBaumWelch method where I am setting $M$ to the maximum observation in $O$. This is needed because not all observation sequences will contain all values (i.e. 0, 1, 2). And the transition and emission matrices are accessed as if all the values exist.
4. I am using yahoo_finance Python module to source the stock prices for Yahoo. It seems to work and is pretty easy to use. You should be able to download this module here.

#### HMM Model performance to predict Yahoo stock price move

On my github space, HMM_test.py contains a possible test example code. I am testing the model as following: train the model on a specified window of daily historical moves (e.g. 10 days) and using the model parameters determine the predicted current model state. Then, using the predicted state, determine the next likely state from $\hat{a}$:

prediction_state = np.argmax(a[path[-1],:])


The most likely emitted value from that state can be found as follows:

prediction = np.argmax(b[prediction_state,:])


I then compare the predicted move to the historical one. So, how did HMM do? Well, not so well. The model parameters $\lambda=(A,B, \pi)$ are very sensitive to the convergence tolerance, initial assignment and, obviously the training time-window. Calculating accuracy ratio as the number of correctly predicted directions is pretty much around 50%-56% (tests on more recent data produce the higher accuracy in this range). Thus, we might as well be throwing a coin to make buy or sell predictions. It is possible that EOD price moves are not granular enough to provide a coherent market dynamic model.

# A Tutorial on Hidden Markov Model with a Stock Price Example – Part 1

This tutorial is on a Hidden Markov Model. What is a Hidden Markov Model and why is it hiding?

I have split the tutorial in two parts. Part 1 will provide the background to the discrete HMMs. I will motivate the three main algorithms with an example of modeling stock price time-series. In part 2 I will demonstrate one way to implement the HMM and we will test the model by using it to predict the Yahoo stock price!

A Hidden Markov Model (HMM) is a statistical signal model. This short sentence is actually loaded with insight! A statistical model estimates parameters like mean and variance and class probability ratios from the data and uses these parameters to mimic what is going on in the data. A signal model is a model that attempts to describe some process that emits signals. Putting these two together we get a model that mimics a process by cooking-up some parametric form. Then we add “Markov”, which pretty much tells us to forget the distant past. And finally we add ‘hidden’, meaning that the source of the signal is never revealed. BTW, the later applies to many parametric models.

#### Setting up the Scene

HMM is trained on data that contains an observed sequence of signals (and optionally the corresponding states the signal generator was in when the signals were emitted). Once the HMM is trained, we can give it an unobserved signal sequence and ask:

1. How probable is that this sequence was emitted by this HMM? This would be useful for a problem like credit card fraud detection.
2. What is the most probable set of states the model was in when generating the sequence? This would be useful for a problem of time-series categorization and clustering.

It is remarkable that the model that can do so much was originally designed in the 1960-ies! Here we will discuss the 1-st order HMM, where only the current and the previous model states matter. Compare this, for example, with the nth-order HMM where the current and the previous n states are used. Here, by “matter” or “used” we will mean used in conditioning of states’ probabilities. For example, we will be asking about the probability of the HMM being in some state $q_{t}=s_{i}$ given that the previous state was $q_{t-1}=s_{j}$. Such probabilities can be expressed in 2 dimensions as a state transition probability matrix. Let’s pick a concrete example. Let’s image that on the 4th of January 2016 we bought one share of Yahoo Inc. stock. Let’s say we paid $32.4 for the share. From then on we are monitoring the close-of-day price and calculating the profit and loss (PnL) that we could have realized if we sold the share on the day. The states of our PnL can be described qualitatively as being up, down or unchanged. Here being “up” means we would have generated a gain, while being down means losing money. The PnL states are observable and depend only on the stock price at the end of each new day. What generates this stock price? The stock price is generated by the market. The states of the market influence whether the price will go down or up. The states of the market can be inferred from the stock price, but are not directly observable. In other words they are hidden. So far we have described the observed states of the stock price and the hidden states of the market. Let’s imagine for now that we have an oracle that tells us the probabilities of market state transitions. Generally the market can be described as being in bull or bear state. We will call these “buy” and “sell” states respectively. Table 1 shows that if the market is selling Yahoo stock, then there is a 70% chance that the market will continue to sell in the next time frame. We also see that if the market is in the buy state for Yahoo, there is a 42% chance that it will transition to selling next. Table 1 Sell Buy Sell 0.70 0.30 Buy 0.42 0.58 The oracle has also provided us with the stock price changes probabilities per market state. Table 2 shows that if the market is selling Yahoo, there is an 80% chance that the stock price will drop below our purchase price of$32.4 and will result in negative PnL. We also see that if the market is buying Yahoo, then there is a 10% chance that the resulting stock price will not be different from our purchase price and the PnL is zero. As I said, let’s not worry about where these probabilities come from. It will become clear later on. Note that row probabilities add to 1.0

Table 2
Down Up Unchanged
Sell 0.80 0.15 0.05

It is February 10th 2016 and the Yahoo stock price closes at $27.1. If we were to sell the stock now we would have lost$5.3. Before becoming desperate we would like to know how probable it is that we are going to keep losing money for the next three days.

To put this in the HMM terminology, we would like to know the probability that the next three time-step sequence realised by the model will be {down, down, down} for t=1, 2, 3. This sequence of PnL states can be given a name $O=\{O_{1}, O_{2},O_{3}\}$. So far the HMM model includes the market states transition probability matrix (Table 1) and the PnL observations probability matrix for each state (Table 2). Let’s call the former A and the latter B. We need one more thing to complete our HMM specification – the probability of stock market starting in either sell or buy state. Intuitively, it should be clear that the initial market state probabilities can be inferred from what is happening in Yahoo stock market on the day. But, for the sake of keeping this example more general we are going to assign the initial state probabilities as $\pi=\{0.5, 0.5\}$. Thus we are treating each initial state as being equally likely. So, the market is selling and we are interested to find out $P(O|\lambda(A,B,\pi))$. This is the probability to observe sequence $O$ given the current HMM parameterization. It is clear that sequence $O$ can occur under 2^3=8 different market state sequences. To solve the posed problem we need to take into account each state and all combinations of state transitions. The probability of this sequence being emitted by our HMM model is the sum over all possible state transitions and observing sequence values in each state.

This gives rise to very long sum!

$P(O|\lambda)=P(\pi = sell)P(emit \, down \, from \, sell)P(move \, to \, sell \, from \, sell) \\ P(emit \, down \, from \, sell)P(move \, to \, sell \, from \, sell)P(emit \, down \, from \, sell) \\ + P(\pi = sell)P(emit \, down \, from \, sell)P(move \, to \, sell \, from \, sell)P(emit \, down \, from \, sell) \\ P(move \, to \, buy\, from \, sell)P(emit \, down \, from \, buy) \\ + ... \\ + P(\pi = buy)P(emit \, down \, from \, buy)P(move \, to \, sell \, from \, buy)P(emit \, down \, from \, sell) \\ P(move \, to \, sell \, from \, sell)P(emit \, down \, from \, sell) \\ + ... \\ + P(\pi = buy)P(emit \, down \, from \, buy)P(move \, to \, buy \, from \, buy)P(emit \, down \, from \, buy) \\ P(move \, to \, buy \, from \, buy)P(emit \, down \, from \, buy)$

In total we need to consider 2*3*8=48 multiplications (there are 6 in each sum component and there are 8 sums). That is a lot and it grows very quickly. Please note that emission probability is tied to a state and can be re-written as a conditional probability of emitting an observation while in the state. If we perform this long calculation we will get $P(O|\lambda=(A,B,\pi))=0.192$. There is an almost 20% chance that the next three observations will be a PnL loss for us!

#### Oracle is No More

In the previous section we have gained some intuition about HMM parameters and some of the things the model can do for us. However, the model is hidden, so there is no access to oracle! The state and emission transition matrices we used to make projections must be learned from the data. The hidden nature of the model is inevitable, since in life we do not have access to the oracle. In life we have access to historical data/observations and a magic methods of “maximum likelihood estimation” (MLE) and Bayesian inference. The MLE essentially produces distributional parameters that maximize the probability of observing the data at hand (i.e. it gives you the parameters of the model that is most likely have had generated the data). The HMM has three parameters $\lambda=(A,B,\pi)$. A fully specified HMM should be able to do the following:

1. Given a sequence of observed values, provide us with a probability that this sequence was generated by the specified HMM. This can be re-phrased as the probability of the sequence occurring given the model. This is what we have calculated in the previous section.
2. Given a sequence of observed values, provide us with the sequence of states the HMM most likely has been in to generate such values sequence.
3. Given a sequence of observed values we should be able to adjust/correct our model parameters $A$, $B$ and $\pi$.

When looking at the three ‘should’, we can see that there is a degree of circular dependency. That is, we need the model to do steps 1 and 2, and we need the parameters to form the model in step 3. Where do we begin? There are three main algorithms that are part of the HMM to perform the above tasks. These are: the forward&backward algorithm that helps with the 1st problem, the Viterbi algorithm that helps to solve the 2nd problem, and the Baum-Welch algorithm that puts it all together and helps to train the HMM model. Let’s discuss them next.

#### Going Back and Forth

That long sum we performed to calculate $P(O|\lambda=(A,B,\pi))$ grows exponentially in the number of states and observed values. The Forward and Backward algorithm is an optimization on the long sum. If you look back at the long sum, you should see that there are sum components that have the same sub-components in the product. For example,

$P(O|\lambda)=P(\pi = sell)P(emit \, down \, from \, sell)P(move \, to \, sell \, from \, sell)...$

appears twice. The HMM Forward and Backward (HMM FB) algorithm does not re-compute these, but stores the partial sums as a cache. There are $N$ states, $M$ discrete values that can be emitted from each of $s_{i,...,N}$ states and $i=1,...,N$. There are $T$ observations in the considered sequence $O$. The state transition matrix is $A$, where $a_{i,j}$ is an individual entry and $j=1,...,N$. The emission matrix is $B$, where $b_{j}(O_{t})$ is an individual entry $b_{j}(T_{t})=P(emitting \, O_{t}|q_{t}=j)$, and $t=1,...,T$, $q_{t}$ is state at time t. For initial states we have $\pi_{i}=P(q_{1}=s_{i})$. The problem of finding the probability of sequence $O$ given an HMM is $P(O|\lambda=(A,B,\pi))$. Formally this probability can be expressed as a sum:

$P(O|\lambda)=\sum_{q_{1},q_{2},...,q_{T}}\pi_{q_{1}}b_{q_{1}}(O_{1})a_{q_{1}q_{2}}b_{q_{2}}(O_{2})a_{q_{2}q_{3}}b_{q_{3}}(O_{3})...a_{q_{T-1}q_{T}}b_{q_{T}}(O_{T})$

HMM FB calculates this sum efficiently by storing the partial sum calculated up to time $t$. HMM FB is defined as follows:

1. Start by initializing the “cache” $\alpha_{1}(i)=\pi_{i}b_{i}(O_{1})$, where $1 \le i \le N$.
2. Calculate over all remaining observation sequences and states the partial sums:
$\alpha_{t+1}(j)=\big[\sum_{j=1}^{N} \alpha_{ij}\big] b_{j}(O_{t+1})$, where $1 \le t \le T-1$ and $1 \le j \le N$. Since at a single iteration we hold $i$ constant, this results in a trellis-like calculation along the columns of the transition matrix.
3. The probability is then given by $P(O|\lambda)=\sum_{i=1}^{N}\alpha_{T}(i)$. This is where we gather the individual states probabilities together.

The above is the Forward algorithm which requires only $TN^{2}$ calculations. The algorithm moves forward. We can imagine an algorithm that performs similar calculation, but backwards, starting from the last observation in $O$. Why do we need this? Is the Forward algorithm not enough? It is enough to solve the 1st poised problem. But it is not enough to solve the 3rd problem, as we will see later. So, let’s define the Backward algorithm now.

1. Start by initializing the end “cache” $\beta_{T}(i)=1$, for $1 \le i \le N$. This is an arbitrary assignment.
2. Calculate over all remaining observation sequences and states the partial sums (moving back to the start of the observation sequence):
$\beta_{t}(i)=\sum_{j=1}^{N} \alpha_{ij}b_{j}(O_{t+1})\beta_{t+1}(j)$, where $t=T-1,T-2,...,1$ and $1 \le j \le N$. Since at a single iteration we hold $i$ constant, this results in a trellis-like calculation along the columns of the transition matrix. However, it is performed backwards in time.

The $\beta$ matrix also contains probabilities of observing sequence $O$. However, the actual values in $\beta$ are different from those in $\alpha$ because of the arbitrary assignment of $\beta_{T}$ to 1.

#### The State of the Affairs

We are now ready to solve the 2nd problem of the HMM – given the model and a sequence of observations, provide the sequence of states the model likely was in to generate such a sequence. Strictly speaking, we are after the optimal state sequence for the given $O$. Optimal often means maximum of something. At each state and emission transitions there is a node that maximizes the probability of observing a value in a state. Let’s take a closer look at the $\alpha$ and $\beta$ matrices we calculated for the $\{down, \, down, \, down\}$ example.

I have circled the values that are maximum. All of these correspond to the Sell market state. Our HMM would have told us that the most likely market state sequence that produced $\{down, \, down, \, down\}$ was $\{Sell, \, Sell, \, Sell\}$. Which makes sense. The HMM algorithm that solves the 2nd problem is called Viterbi Algorithm, named after its inventor Andrew Viterbi. The essence of Viterbi algorithm is what we have just done – find the path in the trellis that maximizes each node probability. It is a little bit more complex than just looking for the max, since we have to ensure that the path is valid (i.e. it is reachable in the specified HMM).

We are after the best state sequence $Q=\{q_{1},q_{2},...,q_{T}\}$ for the given $O$. The best state sequence maximizes the probability of the path. We will be taking the maximum over probabilities and storing the indices of states that result in the max.

1. Start by initializing the end “cache” $\delta_{1}(j)=\pi_{i}b_{i}(O_{1})$, for $1 \le i \le N$. We also need to initialize to zero a variable used to track the index of the max node: $\phi_{1}(i)=0$.
2. Calculate over all remaining observation sequences and states the partial max and store away the index that delivers it:
$\delta_{t}(j)=\max_{1\le i \le N} \small[ \delta_{t-1}(i)a_{ij}\small]b_{j}(O_{t})$, where $2 \le t \le T$ and $1 \le j \le N$.
Update the sequence of indices of the max nodes as:
$\phi_{t}=argmax_{1 \le i \le N} \small[ \delta_{t-1}(i) a_{ij}\small]$ for j and t as above.
3. Termination (probability and state):
$P*=\max_{1 \le i \le N} \small[\delta_{T}(i)\small]$
$q_{T}*=argmax_{1 \le i \le N} \small[\delta_{T}(i)\small]$
4. The optimal state sequence $Q$ is given by the path:
$q_{t}^{*}=\phi_{t+1}(q_{t+1}^{*})$, for $t=T-1,T-2,...,1$

#### Expect the Unexpected

And now what is left is the most interesting part of the HMM – how do we estimate the model parameters from the data? We will now describe the Baum-Welch Algorithm to solve this 3rd poised problem.

The reason we introduced the Backward Algorithm is to be able to express a probability of being in some state i at time t and moving to a state j at time t+1. Imagine again the probabilities trellis. Pick a model state node at time t, use the partial sums for the probability of reaching this node, trace to some next node j at time t+1, and use all the possible state and observation paths after that until T. This gives the probability of being in state $s_{t}(i)$ and move to $s_{t+1}(j)$. The figure below graphically illustrates this point.

To make this transition into a proper probability, we need to scale it by all possible transitions in $\lambda$ and $O$. It can now be defined as follows:

$\gamma_{t}(i,j)=\frac{\alpha_{t}(i)a_{ij}b_{j}(O_{t+1})\beta_{t+1}(j)}{P(O|\lambda)}$

So, $\gamma_{t}(i,j)$ is a probability of being in state i at time t and moving to state j at time t+1. It makes perfect sense as long as we have true estimates for $\alpha$, $A$, $B$ and $\pi$. We will come to these in a moment…

Summing $\gamma_{t}(i,j)$ across $j$ gives the probability of being in state i at time t under $O$ and $\lambda$: $\gamma_{t}(i)=\sum_{j=1}^{N}\gamma_{t}(i,j)$. Let’s look at an example to make things clear. Let’s consider $O=\{0,1,1\}=\{down, up, up\}$. We will use the same $A$ and $B$ from Table 1 and Table 2. The $alpha$ learned for this observation sequence is shown below:

The (un-scaled) $\beta$ is shown below:

So, what is $\gamma_{1}(0,1)$? It is (0.7619*0.30*0.65*0.176)/0.05336=49%, where the denominator is calculated for $t=1$. The denominator is calculated across all i and j at $t=1$, thus it is a normalizing factor. If we calculate $\gamma_{2}(0,1)$ and sum the two estimates together, we will get the expected number of transitions from state $s(i)$ to $s(j)$. Note, we do transition between two time-steps, but not from the final time-step as it is absorbing. Similarly, the sum over $\gamma_{t}(i)$, where $t=1,2$ gives the expected number of transitions from $s(i)$. It should now be easy to recognize that $\frac{\sum_{t=1}^{T-1}\gamma_{t}(i,j)}{\sum_{t=1}^{T-1}\gamma_{t}(i)}$ is the transition probability $a_{ij}$, and this is how we estimate it:

$\hat{a_{ij}}=\frac{\sum_{t=1}^{T-1}\gamma_{t}(i,j)}{\sum_{t=1}^{T-1}\gamma_{t}(i)}$

We can derive the update to $b_{j}(O_{t})$ in a similar fashion. The matrix $B$ stores probabilities of observing a value from $O$ in some state. For $k=1,2,...,M$ $b_{j}(k)=\frac{\sum_{t\in\{1,...,T\}O_{t}=k}\gamma_{t}(j)}{\sum_{t=1}^{T}\gamma_{t}(j)}$ is the probability of observing symbol $k$ in state j. The update rule becomes:

$\hat{b_{j}}(k)=\frac{\sum_{t\in\{1,...,T\}O_{t}=k}\gamma_{t}(j)}{\sum_{t=1}^{T}\gamma_{t}(j)}$

$\pi$ stores the initial probabilities for each state. This parameter can be updated from the data as:

$\pi_{i}=\gamma_{1}(i)$

We now have the estimation/update rule for all parameters in $\lambda$. The Baum-Welch algorithm is the following:

1. Repeat until convergence:

2. Initialize $A, B, \pi$ to random values such that $\pi_{i}\approx \frac{1}{N}$, $a_{ij}\approx \frac{1}{N}$ and $b_{j}(k)\approx \frac{1}{M}$. It is important to ensure that row probabilities add up to 1 and probabilities are not uniform as this may result in convergence to a local maximum.
3. Compute $\alpha_{t}(i)$, $\beta_{t}(i)$, and $\gamma_{t}(j)$ and $\gamma_{t}(i,j)$.
4. Re-estimate $A$, $B$, and $\pi$.

The convergence can be assessed as the maximum change achieved in values of $A$ and $B$ between two iterations. The authors of this algorithm have proved that either the initial model defines the optimal point of the likelihood function and $\lambda = \hat{\lambda}$, or the converged solution provides model parameters that are more likely for a given sequence of observations $O$. The described algorithm is often called the expectation maximization algorithm and the observation sequence works like a pdf and is the “pooling” factor in the update of $\lambda$. I have described the discrete version of HMM, however there are continuous models that estimate a density from which the observation come from, rather than a discrete time-series.

I hope some of you may find this tutorial revealing and insightful. I will share the implementation of this HMM with you next time.

Reference: L.R.Rabiner. An introduction to Hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257-268, 1989.