A Tutorial on Hidden Markov Model with a Stock Price Example – Part 1

This tutorial is on a Hidden Markov Model. What is a Hidden Markov Model and why is it hiding?

Can you see me?
Can you see me?

I have split the tutorial in two parts. Part 1 will provide the background to the discrete HMMs. I will motivate the three main algorithms with an example of modeling stock price time-series. In part 2 I will demonstrate one way to implement the HMM and we will test the model by using it to predict the Yahoo stock price!

A Hidden Markov Model (HMM) is a statistical signal model. This short sentence is actually loaded with insight! A statistical model estimates parameters like mean and variance and class probability ratios from the data and uses these parameters to mimic what is going on in the data. A signal model is a model that attempts to describe some process that emits signals. Putting these two together we get a model that mimics a process by cooking-up some parametric form. Then we add “Markov”, which pretty much tells us to forget the distant past. And finally we add ‘hidden’, meaning that the source of the signal is never revealed. BTW, the later applies to many parametric models.

Setting up the Scene

HMM is trained on data that contains an observed sequence of signals (and optionally the corresponding states the signal generator was in when the signals were emitted). Once the HMM is trained, we can give it an unobserved signal sequence and ask:

  1. How probable is that this sequence was emitted by this HMM? This would be useful for a problem like credit card fraud detection.
  2. What is the most probable set of states the model was in when generating the sequence? This would be useful for a problem of time-series categorization and clustering.

It is remarkable that the model that can do so much was originally designed in the 1960-ies! Here we will discuss the 1-st order HMM, where only the current and the previous model states matter. Compare this, for example, with the nth-order HMM where the current and the previous n states are used. Here, by “matter” or “used” we will mean used in conditioning of states’ probabilities. For example, we will be asking about the probability of the HMM being in some state q_{t}=s_{i} given that the previous state was q_{t-1}=s_{j}. Such probabilities can be expressed in 2 dimensions as a state transition probability matrix. Let’s pick a concrete example. Let’s image that on the 4th of January 2016 we bought one share of Yahoo Inc. stock. Let’s say we paid $32.4 for the share. From then on we are monitoring the close-of-day price and calculating the profit and loss (PnL) that we could have realized if we sold the share on the day. The states of our PnL can be described qualitatively as being up, down or unchanged. Here being “up” means we would have generated a gain, while being down means losing money. The PnL states are observable and depend only on the stock price at the end of each new day. What generates this stock price? The stock price is generated by the market. The states of the market influence whether the price will go down or up. The states of the market can be inferred from the stock price, but are not directly observable. In other words they are hidden.

So far we have described the observed states of the stock price and the hidden states of the market. Let’s imagine for now that we have an oracle that tells us the probabilities of market state transitions. Generally the market can be described as being in bull or bear state. We will call these “buy” and “sell” states respectively. Table 1 shows that if the market is selling Yahoo stock, then there is a 70% chance that the market will continue to sell in the next time frame. We also see that if the market is in the buy state for Yahoo, there is a 42% chance that it will transition to selling next.

Table 1
Sell Buy
Sell 0.70 0.30
Buy 0.42 0.58

The oracle has also provided us with the stock price changes probabilities per market state. Table 2 shows that if the market is selling Yahoo, there is an 80% chance that the stock price will drop below our purchase price of $32.4 and will result in negative PnL. We also see that if the market is buying Yahoo, then there is a 10% chance that the resulting stock price will not be different from our purchase price and the PnL is zero. As I said, let’s not worry about where these probabilities come from. It will become clear later on. Note that row probabilities add to 1.0

Table 2
Down Up Unchanged
Sell 0.80 0.15 0.05
Buy 0.25 0.65 0.10

It is February 10th 2016 and the Yahoo stock price closes at $27.1. If we were to sell the stock now we would have lost $5.3. Before becoming desperate we would like to know how probable it is that we are going to keep losing money for the next three days.

Losing Money

To put this in the HMM terminology, we would like to know the probability that the next three time-step sequence realised by the model will be {down, down, down} for t=1, 2, 3. This sequence of PnL states can be given a name O=\{O_{1}, O_{2},O_{3}\}. So far the HMM model includes the market states transition probability matrix (Table 1) and the PnL observations probability matrix for each state (Table 2). Let’s call the former A and the latter B. We need one more thing to complete our HMM specification – the probability of stock market starting in either sell or buy state. Intuitively, it should be clear that the initial market state probabilities can be inferred from what is happening in Yahoo stock market on the day. But, for the sake of keeping this example more general we are going to assign the initial state probabilities as \pi=\{0.5, 0.5\}. Thus we are treating each initial state as being equally likely. So, the market is selling and we are interested to find out P(O|\lambda(A,B,\pi)). This is the probability to observe sequence O given the current HMM parameterization. It is clear that sequence O can occur under 2^3=8 different market state sequences. To solve the posed problem we need to take into account each state and all combinations of state transitions. The probability of this sequence being emitted by our HMM model is the sum over all possible state transitions and observing sequence values in each state.

Table 3
Sell, Sell, Sell
Sell, Sell, Buy
Sell, Buy, Sell
Sell, Buy, Buy
Buy, Sell, Sell
Buy, Buy, Sell
Buy, Sell, Buy
Buy, Buy, Buy

This gives rise to very long sum!

P(O|\lambda)=P(\pi = sell)P(emit \, down \, from \, sell)P(move \, to \, sell \, from \, sell) \\  P(emit \, down \, from \, sell)P(move \, to \, sell \, from \, sell)P(emit \, down \, from \, sell) \\  + P(\pi = sell)P(emit \, down \, from \, sell)P(move \, to \, sell \, from \, sell)P(emit \, down \, from \, sell) \\  P(move \, to \, buy\, from \, sell)P(emit \, down \, from \, buy) \\  + ... \\  + P(\pi = buy)P(emit \, down \, from \, buy)P(move \, to \, sell \, from \, buy)P(emit \, down \, from \, sell) \\  P(move \, to \, sell \, from \, sell)P(emit \, down \, from \, sell) \\  + ... \\  + P(\pi = buy)P(emit \, down \, from \, buy)P(move \, to \, buy \, from \, buy)P(emit \, down \, from \, buy) \\  P(move \, to \, buy \, from \, buy)P(emit \, down \, from \, buy)

In total we need to consider 2*3*8=48 multiplications (there are 6 in each sum component and there are 8 sums). That is a lot and it grows very quickly. Please note that emission probability is tied to a state and can be re-written as a conditional probability of emitting an observation while in the state. If we perform this long calculation we will get P(O|\lambda=(A,B,\pi))=0.192. There is an almost 20% chance that the next three observations will be a PnL loss for us!

Oracle is No More

Neo Absorbs Fail
Neo Absorbs Fail

In the previous section we have gained some intuition about HMM parameters and some of the things the model can do for us. However, the model is hidden, so there is no access to oracle! The state and emission transition matrices we used to make projections must be learned from the data. The hidden nature of the model is inevitable, since in life we do not have access to the oracle. In life we have access to historical data/observations and a magic methods of “maximum likelihood estimation” (MLE) and Bayesian inference. The MLE essentially produces distributional parameters that maximize the probability of observing the data at hand (i.e. it gives you the parameters of the model that is most likely have had generated the data). The HMM has three parameters \lambda=(A,B,\pi). A fully specified HMM should be able to do the following:

  1. Given a sequence of observed values, provide us with a probability that this sequence was generated by the specified HMM. This can be re-phrased as the probability of the sequence occurring given the model. This is what we have calculated in the previous section.
  2. Given a sequence of observed values, provide us with the sequence of states the HMM most likely has been in to generate such values sequence.
  3. Given a sequence of observed values we should be able to adjust/correct our model parameters A, B and \pi.

When looking at the three ‘should’, we can see that there is a degree of circular dependency. That is, we need the model to do steps 1 and 2, and we need the parameters to form the model in step 3. Where do we begin? There are three main algorithms that are part of the HMM to perform the above tasks. These are: the forward&backward algorithm that helps with the 1st problem, the Viterbi algorithm that helps to solve the 2nd problem, and the Baum-Welch algorithm that puts it all together and helps to train the HMM model. Let’s discuss them next.

Going Back and Forth

That long sum we performed to calculate P(O|\lambda=(A,B,\pi)) grows exponentially in the number of states and observed values. The Forward and Backward algorithm is an optimization on the long sum. If you look back at the long sum, you should see that there are sum components that have the same sub-components in the product. For example,

P(O|\lambda)=P(\pi = sell)P(emit \, down \, from \, sell)P(move \, to \, sell \, from \, sell)...

appears twice. The HMM Forward and Backward (HMM FB) algorithm does not re-compute these, but stores the partial sums as a cache. There are N states, M discrete values that can be emitted from each of s_{i,...,N} states and i=1,...,N. There are T observations in the considered sequence O. The state transition matrix is A, where a_{i,j} is an individual entry and j=1,...,N. The emission matrix is B, where b_{j}(O_{t}) is an individual entry b_{j}(T_{t})=P(emitting \, O_{t}|q_{t}=j), and t=1,...,T, q_{t} is state at time t. For initial states we have \pi_{i}=P(q_{1}=s_{i}). The problem of finding the probability of sequence O given an HMM is P(O|\lambda=(A,B,\pi)). Formally this probability can be expressed as a sum:

P(O|\lambda)=\sum_{q_{1},q_{2},...,q_{T}}\pi_{q_{1}}b_{q_{1}}(O_{1})a_{q_{1}q_{2}}b_{q_{2}}(O_{2})a_{q_{2}q_{3}}b_{q_{3}}(O_{3})...a_{q_{T-1}q_{T}}b_{q_{T}}(O_{T})

HMM FB calculates this sum efficiently by storing the partial sum calculated up to time t. HMM FB is defined as follows:

  1. Start by initializing the “cache” \alpha_{1}(i)=\pi_{i}b_{i}(O_{1}), where 1 \le i \le N.
  2. Calculate over all remaining observation sequences and states the partial sums:
    \alpha_{t+1}(j)=\big[\sum_{j=1}^{N} \alpha_{ij}\big] b_{j}(O_{t+1}), where 1 \le t \le T-1 and 1 \le j \le N. Since at a single iteration we hold i constant, this results in a trellis-like calculation along the columns of the transition matrix.
  3. The probability is then given by P(O|\lambda)=\sum_{i=1}^{N}\alpha_{T}(i). This is where we gather the individual states probabilities together.

The above is the Forward algorithm which requires only TN^{2} calculations. The algorithm moves forward. We can imagine an algorithm that performs similar calculation, but backwards, starting from the last observation in O. Why do we need this? Is the Forward algorithm not enough? It is enough to solve the 1st poised problem. But it is not enough to solve the 3rd problem, as we will see later. So, let’s define the Backward algorithm now.

  1. Start by initializing the end “cache” \beta_{T}(i)=1, for 1 \le i \le N. This is an arbitrary assignment.
  2. Calculate over all remaining observation sequences and states the partial sums (moving back to the start of the observation sequence):
    \beta_{t}(i)=\sum_{j=1}^{N} \alpha_{ij}b_{j}(O_{t+1})\beta_{t+1}(j), where t=T-1,T-2,...,1 and 1 \le j \le N. Since at a single iteration we hold i constant, this results in a trellis-like calculation along the columns of the transition matrix. However, it is performed backwards in time.

The \beta matrix also contains probabilities of observing sequence O. However, the actual values in \beta are different from those in \alpha because of the arbitrary assignment of \beta_{T} to 1.

The State of the Affairs

We are now ready to solve the 2nd problem of the HMM – given the model and a sequence of observations, provide the sequence of states the model likely was in to generate such a sequence. Strictly speaking, we are after the optimal state sequence for the given O. Optimal often means maximum of something. At each state and emission transitions there is a node that maximizes the probability of observing a value in a state. Let’s take a closer look at the \alpha and \beta matrices we calculated for the \{down, \, down, \, down\} example.

Alpha and Beta
Example Alpha and Beta

I have circled the values that are maximum. All of these correspond to the Sell market state. Our HMM would have told us that the most likely market state sequence that produced \{down, \, down, \, down\} was \{Sell, \, Sell, \, Sell\}. Which makes sense. The HMM algorithm that solves the 2nd problem is called Viterbi Algorithm, named after its inventor Andrew Viterbi. The essence of Viterbi algorithm is what we have just done – find the path in the trellis that maximizes each node probability. It is a little bit more complex than just looking for the max, since we have to ensure that the path is valid (i.e. it is reachable in the specified HMM).

We are after the best state sequence Q=\{q_{1},q_{2},...,q_{T}\} for the given O. The best state sequence maximizes the probability of the path. We will be taking the maximum over probabilities and storing the indices of states that result in the max.

  1. Start by initializing the end “cache” \delta_{1}(j)=\pi_{i}b_{i}(O_{1}), for 1 \le i \le N. We also need to initialize to zero a variable used to track the index of the max node: \phi_{1}(i)=0.
  2. Calculate over all remaining observation sequences and states the partial max and store away the index that delivers it:
    \delta_{t}(j)=\max_{1\le i \le N} \small[ \delta_{t-1}(i)a_{ij}\small]b_{j}(O_{t}), where 2 \le t \le T and 1 \le j \le N.
    Update the sequence of indices of the max nodes as:
    \phi_{t}=argmax_{1 \le i \le N} \small[ \delta_{t-1}(i) a_{ij}\small] for j and t as above.
  3. Termination (probability and state):
    P*=\max_{1 \le i \le N} \small[\delta_{T}(i)\small]
    q_{T}*=argmax_{1 \le i \le N} \small[\delta_{T}(i)\small]
  4. The optimal state sequence Q is given by the path:
    q_{t}^{*}=\phi_{t+1}(q_{t+1}^{*}), for t=T-1,T-2,...,1

Expect the Unexpected

And now what is left is the most interesting part of the HMM – how do we estimate the model parameters from the data? We will now describe the Baum-Welch Algorithm to solve this 3rd poised problem.

The reason we introduced the Backward Algorithm is to be able to express a probability of being in some state i at time t and moving to a state j at time t+1. Imagine again the probabilities trellis. Pick a model state node at time t, use the partial sums for the probability of reaching this node, trace to some next node j at time t+1, and use all the possible state and observation paths after that until T. This gives the probability of being in state s_{t}(i) and move to s_{t+1}(j). The figure below graphically illustrates this point.

Starting in i and moving to j.
Starting in i and moving to j.

To make this transition into a proper probability, we need to scale it by all possible transitions in \lambda and O. It can now be defined as follows:

\gamma_{t}(i,j)=\frac{\alpha_{t}(i)a_{ij}b_{j}(O_{t+1})\beta_{t+1}(j)}{P(O|\lambda)}

So, \gamma_{t}(i,j) is a probability of being in state i at time t and moving to state j at time t+1. It makes perfect sense as long as we have true estimates for \alpha, A, B and \pi. We will come to these in a moment…

Summing \gamma_{t}(i,j) across j gives the probability of being in state i at time t under O and \lambda: \gamma_{t}(i)=\sum_{j=1}^{N}\gamma_{t}(i,j). Let’s look at an example to make things clear. Let’s consider O=\{0,1,1\}=\{down, up, up\}. We will use the same A and B from Table 1 and Table 2. The alpha learned for this observation sequence is shown below:

Alpha for O={0,1,1}
Alpha for O={0,1,1}

The (un-scaled) \beta is shown below:

Beta for O={0,1,1}
Beta for O={0,1,1}

So, what is \gamma_{1}(0,1)? It is (0.7619*0.30*0.65*0.176)/0.05336=49%, where the denominator is calculated for t=1. The denominator is calculated across all i and j at t=1, thus it is a normalizing factor. If we calculate \gamma_{2}(0,1) and sum the two estimates together, we will get the expected number of transitions from state s(i) to s(j). Note, we do transition between two time-steps, but not from the final time-step as it is absorbing. Similarly, the sum over \gamma_{t}(i), where t=1,2 gives the expected number of transitions from s(i). It should now be easy to recognize that \frac{\sum_{t=1}^{T-1}\gamma_{t}(i,j)}{\sum_{t=1}^{T-1}\gamma_{t}(i)} is the transition probability a_{ij}, and this is how we estimate it:

\hat{a_{ij}}=\frac{\sum_{t=1}^{T-1}\gamma_{t}(i,j)}{\sum_{t=1}^{T-1}\gamma_{t}(i)}

We can derive the update to b_{j}(O_{t}) in a similar fashion. The matrix B stores probabilities of observing a value from O in some state. For k=1,2,...,M b_{j}(k)=\frac{\sum_{t\in\{1,...,T\}O_{t}=k}\gamma_{t}(j)}{\sum_{t=1}^{T}\gamma_{t}(j)} is the probability of observing symbol k in state j. The update rule becomes:

\hat{b_{j}}(k)=\frac{\sum_{t\in\{1,...,T\}O_{t}=k}\gamma_{t}(j)}{\sum_{t=1}^{T}\gamma_{t}(j)}

\pi stores the initial probabilities for each state. This parameter can be updated from the data as:

\pi_{i}=\gamma_{1}(i)

We now have the estimation/update rule for all parameters in \lambda. The Baum-Welch algorithm is the following:

  1. Repeat until convergence:

  2. Initialize A, B, \pi to random values such that \pi_{i}\approx \frac{1}{N}, a_{ij}\approx \frac{1}{N} and b_{j}(k)\approx \frac{1}{M}. It is important to ensure that row probabilities add up to 1 and probabilities are not uniform as this may result in convergence to a local maximum.
  3. Compute \alpha_{t}(i), \beta_{t}(i), and \gamma_{t}(j) and \gamma_{t}(i,j).
  4. Re-estimate A, B, and \pi.

The convergence can be assessed as the maximum change achieved in values of A and B between two iterations. The authors of this algorithm have proved that either the initial model defines the optimal point of the likelihood function and \lambda = \hat{\lambda}, or the converged solution provides model parameters that are more likely for a given sequence of observations O. The described algorithm is often called the expectation maximization algorithm and the observation sequence works like a pdf and is the “pooling” factor in the update of \lambda. I have described the discrete version of HMM, however there are continuous models that estimate a density from which the observation come from, rather than a discrete time-series.

I hope some of you may find this tutorial revealing and insightful. I will share the implementation of this HMM with you next time.

Reference: L.R.Rabiner. An introduction to Hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257-268, 1989.

Advertisements

Gaussian Mixture Model with Application to Anomaly Detection

There are many flavors of clustering algorithms available to data scientists today. To name just a few would be to list k-means, KNN, LDA, parametric mixture models (e.g. Gaussian Mixture), hidden Markov for time-series and SOMs. Each mentioned model has been thoroughly studied and several variations have been proposed and adopted. I like clustering tasks since looking for and finding patterns in data is one of my favorite challenges.

Gaussian Mixture (GM) model is usually an unsupervised clustering model that is as easy to grasp as the k-means but has more flexibility than k-means. Fundamentally, GM is a parametric model (i.e. we assume a specific distribution for the data) that uses the Expectation Maximization (EM) algorithm to learn the parameters of the distribution. The earlier mentioned flexibility comes in a form of a co-variance matrix that gives the freedom to specify the shape and direction of each cluster. Although GM is not a mixed-membership clustering model, it does allow for overlap in which cluster a data point belongs to. This overlap is the probabilistic assignment that GM converges onto. So, how does GM generates the probabilistic assignments? GM uses EM. Take a look a this tutorial on EM I have written for kdnuggets. The tutorial has enough details and formulas for you, so I won’t repeat it here. In a nutshell, we begin with these ingredients and EM does the rest: the number of component/clusters we want to map the data to, the choice for the co-variance matrix (e.g. diagonal or full), some idea for the initial assignment of means and co-variances, as well as the cluster responsibilities and a stopping criteria. If the term ‘cluster responsibilities’ does not make sense to you, think of these as proportions of data that should be assigned to each cluster.

GM can be used for anomaly detection, and there is an abundance of academic work to support this. If the non-anomalous data is Gaussian with some mean and variance, the points that receive low probability assignments under the chosen prior may be flagged as anomalous. At its heart, anomaly detection is a different beast to classification. Most importantly because the cost function is heavily weighted towards recall rather than precision. Let’s see how GM performs on the shuttle dataset provided for outlier detection studies by Ludwig Maximilian University of Munich. In this dataset, class “2” is the outlier class. Class “1” is the dominant class and a good classifier should achieve an accuracy above 95% given the class imbalance. However, since we are testing GM on correctly predicting all instances of class “2” only, let’s glance over the false positives for now and concentrate on the pure recall. I used the scikit-learn GMM module and reduced the dataset to just the 2nd and the 5th feature, which on this dataset have the greatest explanatory powers. Also, I separately trained six models, each class against the class “2”. Note that scikit-learn GMM does not require initial assignments for mean and co-variance as it randomly performs this initialization. Also note that because of this random initialization you are not guaranteed the same outcome on each model run. You can see the code for this work on my github space.

So, how did GM do? Remarkably well. Below I am showing the results for each trained model. I also plot the data spread for class 1 and class 2 to show you that the anomaly class is not purely an outlier in this dataset. Since in anomaly detection task the cost of false negatives is more expensive than the cost of false positives, we can see that GM performed well and made a single miss-classification in a model trained on classes 7 and 2.

Data spread for class 1 and 2 (top)  and model results (bottom).
Data spread for class 1 and 2 (top) and model results (bottom).

High Performance Python

Greetings All,

I have recently come across a nice and short youtube talk given by Ben Lerner on how to improve performance of Python code. Here is a quick rundown of the techniques he mentions which I found particularly useful:

Avoid accidentally quadratic code – take advantage of the build-in data types that give O(n) for free:

The running complexity of the code below is O(n^2). The square bit comes from searching the second list for each item in the first list. Nice.
A typical job interview question by a smug interviewer will be – can we do better than that? “Yes, we can”, you can reply smiling even “smug-lier” – we will just use python sets.

import numpy as np
import time

np.random.seed(1)

list1=range(60000)
random=np.random.randint(0,59999,60000)
list2 = random.tolist() # WORST PERFORMANCE with a list
intersect=[]

start_time = time.time()

for l1 in list1:
    if l1 in list2:
        intersect.append(l1)

print "List one has %s elements in common with list two" %(len(intersect))
print("--- %s seconds ---" % (time.time() - start_time))

list2=random # BETTER PERFORMANCE with NumPy array
intersect=[]
start_time = time.time()

for l1 in list1:
    if l1 in list2:
        intersect.append(l1)

print "List one has %s elements in common with list two" %(len(intersect))
print("--- %s seconds ---" % (time.time() - start_time))

list2=set(random) # BEST PERFORMANCE WITH set
intersect=[]
start_time = time.time()

for l1 in list1:
    if l1 in list2:
        intersect.append(l1)

print "List one has %s elements in common with list two" %(len(intersect))
print("--- %s seconds ---" % (time.time() - start_time))


List in a set
List in a set

Note that the only thing we have changed is to put list2 in a set. This example is trivial and can be done with two sets and an intersection(). But sometimes you need to perform a more complex task than finding an intersection, and keeping the “set” trick in mind will be helpful. Just remember that putting a list in a set will remove duplicates.

Function calls and global variables are expensive

In general, the advice is to avoid unnecessary function calls and avoid using global variables. Function calls require stack pointer saves, and global variables stick around.

Use map/filter/reduce instead of for-loops

map applies a function to every item on an iterable (i.e. map(function,iterable))
filter is similar, and will return those items for which the function is true on an iterable.
reduce will cumulatively apply a function on an iterable to reduce it to a single item output.

Use cProfile to analyse slow code

cProfile can be used to generate code statistics to pin down that loop or that function call that slows down your code. It is very easy to use.

In-line C with Cython

The main reason NumPy is fast is that about 60% of its code is C. In-lining C with python is not hard with Cython, which adds a small number of keywords to the Python language to give access to static data types of C or C++. Finally, Cython can be used to simply compile Python into C.

Hope this gives you enough useful pointers. Happy flying!

Two Ways to Perform Linear Regression in Python with Numpy and Scikit-Learn

Greetings,

This is a short post to share two ways (there are many more) to perform pain-free linear regression in python. I will use numpy.poly1d and sklearn.linearmodel.

Using numpy.polyfit we can fit any data to a specified degree polynomial by minimizing the least square error method (LSE). LSE is the most common cost function for fitting linear models. It is calculated as a the sum of squared differences between the predicted and the actual values. So, let’s say we have some target variable y that is generated according to some function of x. In real life situations we almost never deal with a perfectly predictable function that generates the target variable, but have a lot of noise. However, let’s assume for simplicity that our function is perfectly predictable. Our task is to come up with a model (i.e. learn the polynomial coefficients and select the best polynomial degree) to explain the target variable y. The function for the data is:

f(x)=32x^4-x^2+6x-\sqrt{x}

I will start by generating the train and the test data from 1,000 points:

 

import numpy as np
from sklearn import linear_model
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt

# implement liner regression with numpy and sklearn

# 1. Generate some random data
np.random.seed(seed=0)
size=1000
x_data = np.random.random(size)
y_data=6*x_data-32*x_data**4-x_data**2-np.sqrt(x_data) # -32x^4-x^2+6x-sqrt(x)

# 2. Split equally between test and train
x_train=x_data[:size/2]
x_test = x_data[size/2:]

y_train=y_data[:size/2]
y_test = y_data[size/2:]

 

We will test polynomials with degrees 1, 2 and 3. To use numpy.polyfit, we only need 2 lines of code. To use the sklearn, we need to transform x to a polynomial form with the degrees we are testing at. For that, we need to use sklearn.preprocessing.PolynomialFeatures. So, we proceed to generate 3 models and calculate the final LSE as follows:

 

plot_config=['bs', 'b*', 'g^']
plt.plot(x_test, y_test, 'ro', label="Actual")
# 3. Set the polynomial degree to be fitted betwee 1 and 3
top_degree=3
d_degree = np.arange(1,top_degree+1)
for degree in d_degree:
    
    poly_fit = np.poly1d(np.polyfit(x_train,y_train, degree))
    # print poly_fit.coeffs
    
    # 4. Fit the numpy polynomial
    predict=poly_fit(x_test)
    error=np.mean((predict-y_test)**2)
    
    print "Error with poly1d at degree %s is %s" %(degree, error)
    
    # 5. Create a fit a polynomial with sk-learn LinearRegression
    model = Pipeline([('poly', PolynomialFeatures(degree=degree)),('linear', linear_model.LinearRegression())])
    model=model.fit(x_train[:,np.newaxis], y_train[:,np.newaxis])
    
    predict_sk=model.predict(x_test[:,np.newaxis])
    error_sk=np.mean((predict_sk.flatten()-y_test)**2)
    
    print "Error with sk-learn.LinearRegression at degree %s is %s" %(degree, error_sk)
    
    #plt.plot(x_test, y_test, '-', label="Actual")
    plt.plot(x_test, predict, plot_config[degree-1], label="Fit "+str(degree)+ " degree poly")
    plt.legend(bbox_to_anchor=(0., 1.02, 1., .102), loc=3,
               ncol=2, mode="expand", borderaxespad=0.)

plt.show()

 

Showing the final results (from numpy.polyfit only) are very good at degree 3. We could have produced an almost perfect fit at degree 4. The two method (numpy and sklearn) produce identical accuracy. Under the hood, both, sklearn and numpy.polyfit use linalg.lstsq to solve for coefficients. 

Linear Regression with numpy
Linear Regression with numpy
Compare LSE from numpy and sklearn.
Compare LSE from numpy and sklearn

K-Means Clustering and How it Relates to the Gaussian Mixture Model

Greetings,

I would like to write a post on the Gaussian Mixture models. The post would be a tutorial with a, hopefully, intuitive explanation of when and how to use Gaussian Mixture (GM) models. However, before I begin with GM, I really ought to start with KM…

K-Means Clustering

K-means clustering is the most fundamental ‘vanilla’ type clustering algorithm. The name says it all: the ‘k’ is the k-number of clusters, and the ‘mean’ is our target assignment. So, for a given data set, we come up with k estimations of means that form the cluster centers. The learning of these k centers is iterative, where the optimization criteria is to minimize the total Euclidean distance between the data points and k centers. Once we have k cluster centers, we can assign new observations to a cluster which mean/center is closest to the unobserved point.

KM can be used in supervised and unsupervised learning problems. When class assignments are present in the training data, the validation set can be created to adjust KM parameters. For example, the choice for k, the convergence criteria, the initial assignment of cluster centroids – all of these can be tuned with a validation set. In an unsupervised setting, the KM may need to be repeated several times to help finding the most optimal solution, since KM is very prone to the local optimum.

There are many positive aspects of KM. But, one of the not-so-good points of this clustering model is the ‘hard assignments’ to clusters. In particular, KM partitions the data space into a Voronoi tessellation. Any point that falls into some partition is assigned to that partition alone. There is no ‘maybe’ with KM…

To illustrate the KM algorithm, I have written a short example in python and posted it on my github. In the ‘k_means.py’ file you can see that I generate two sets of 2D random data using numpy multivariate_normal module. I proceed to discover the two clusters with KMeans from sklearn.cluster. The data and the discovered cluster centers are:

Data with two centroids.
Data with two centroids.

KMeans.predict module is used to generate predictions. Below I am showing side-by-side the actual clusters and the predicted ones. Any point that is closer to the left centroid will be assigned to the left cluster, and the same applies for the right cluster.

Actual vs. Predicted
Actual vs. Predicted

In reality, where the two clusters meet we would like some probabilistic assignments. That is, we would like to be able to say that there is an x% chance that the point belongs to the left cluster, but there is also a y% chance that it belongs to the right cluster. KM cannot do that. Also, in terms of cluster spread and direction, if you think of clusters as spheres, KM outputs clusters with identical shapes. Our blue and red clusters have small positive co-variance and KM cannot detect it. But the Mixture Gaussian model can. Let’s look at GM next.