Two Ways to Perform Linear Regression in Python with Numpy and Scikit-Learn


This is a short post to share two ways (there are many more) to perform pain-free linear regression in python. I will use numpy.poly1d and sklearn.linearmodel.

Using numpy.polyfit we can fit any data to a specified degree polynomial by minimizing the least square error method (LSE). LSE is the most common cost function for fitting linear models. It is calculated as a the sum of squared differences between the predicted and the actual values. So, let’s say we have some target variable y that is generated according to some function of x. In real life situations we almost never deal with a perfectly predictable function that generates the target variable, but have a lot of noise. However, let’s assume for simplicity that our function is perfectly predictable. Our task is to come up with a model (i.e. learn the polynomial coefficients and select the best polynomial degree) to explain the target variable y. The function for the data is:


I will start by generating the train and the test data from 1,000 points:


import numpy as np
from sklearn import linear_model
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt

# implement liner regression with numpy and sklearn

# 1. Generate some random data
x_data = np.random.random(size)
y_data=6*x_data-32*x_data**4-x_data**2-np.sqrt(x_data) # -32x^4-x^2+6x-sqrt(x)

# 2. Split equally between test and train
x_test = x_data[size/2:]

y_test = y_data[size/2:]


We will test polynomials with degrees 1, 2 and 3. To use numpy.polyfit, we only need 2 lines of code. To use the sklearn, we need to transform x to a polynomial form with the degrees we are testing at. For that, we need to use sklearn.preprocessing.PolynomialFeatures. So, we proceed to generate 3 models and calculate the final LSE as follows:


plot_config=['bs', 'b*', 'g^']
plt.plot(x_test, y_test, 'ro', label="Actual")
# 3. Set the polynomial degree to be fitted betwee 1 and 3
d_degree = np.arange(1,top_degree+1)
for degree in d_degree:
    poly_fit = np.poly1d(np.polyfit(x_train,y_train, degree))
    # print poly_fit.coeffs
    # 4. Fit the numpy polynomial
    print "Error with poly1d at degree %s is %s" %(degree, error)
    # 5. Create a fit a polynomial with sk-learn LinearRegression
    model = Pipeline([('poly', PolynomialFeatures(degree=degree)),('linear', linear_model.LinearRegression())])[:,np.newaxis], y_train[:,np.newaxis])
    print "Error with sk-learn.LinearRegression at degree %s is %s" %(degree, error_sk)
    #plt.plot(x_test, y_test, '-', label="Actual")
    plt.plot(x_test, predict, plot_config[degree-1], label="Fit "+str(degree)+ " degree poly")
    plt.legend(bbox_to_anchor=(0., 1.02, 1., .102), loc=3,
               ncol=2, mode="expand", borderaxespad=0.)


Showing the final results (from numpy.polyfit only) are very good at degree 3. We could have produced an almost perfect fit at degree 4. The two method (numpy and sklearn) produce identical accuracy. Under the hood, both, sklearn and numpy.polyfit use linalg.lstsq to solve for coefficients. 

Linear Regression with numpy
Linear Regression with numpy
Compare LSE from numpy and sklearn.
Compare LSE from numpy and sklearn

K-Means Clustering and How it Relates to the Gaussian Mixture Model


I would like to write a post on the Gaussian Mixture models. The post would be a tutorial with a, hopefully, intuitive explanation of when and how to use Gaussian Mixture (GM) models. However, before I begin with GM, I really ought to start with KM…

K-Means Clustering

K-means clustering is the most fundamental ‘vanilla’ type clustering algorithm. The name says it all: the ‘k’ is the k-number of clusters, and the ‘mean’ is our target assignment. So, for a given data set, we come up with k estimations of means that form the cluster centers. The learning of these k centers is iterative, where the optimization criteria is to minimize the total Euclidean distance between the data points and k centers. Once we have k cluster centers, we can assign new observations to a cluster which mean/center is closest to the unobserved point.

KM can be used in supervised and unsupervised learning problems. When class assignments are present in the training data, the validation set can be created to adjust KM parameters. For example, the choice for k, the convergence criteria, the initial assignment of cluster centroids – all of these can be tuned with a validation set. In an unsupervised setting, the KM may need to be repeated several times to help finding the most optimal solution, since KM is very prone to the local optimum.

There are many positive aspects of KM. But, one of the not-so-good points of this clustering model is the ‘hard assignments’ to clusters. In particular, KM partitions the data space into a Voronoi tessellation. Any point that falls into some partition is assigned to that partition alone. There is no ‘maybe’ with KM…

To illustrate the KM algorithm, I have written a short example in python and posted it on my github. In the ‘’ file you can see that I generate two sets of 2D random data using numpy multivariate_normal module. I proceed to discover the two clusters with KMeans from sklearn.cluster. The data and the discovered cluster centers are:

Data with two centroids.
Data with two centroids.

KMeans.predict module is used to generate predictions. Below I am showing side-by-side the actual clusters and the predicted ones. Any point that is closer to the left centroid will be assigned to the left cluster, and the same applies for the right cluster.

Actual vs. Predicted
Actual vs. Predicted

In reality, where the two clusters meet we would like some probabilistic assignments. That is, we would like to be able to say that there is an x% chance that the point belongs to the left cluster, but there is also a y% chance that it belongs to the right cluster. KM cannot do that. Also, in terms of cluster spread and direction, if you think of clusters as spheres, KM outputs clusters with identical shapes. Our blue and red clusters have small positive co-variance and KM cannot detect it. But the Mixture Gaussian model can. Let’s look at GM next.

Learning Predictive Rules on the Poker Hand Data Set

Hello again,

In my last post I have shared a rather simple python code to build a decision tree classifier to recognize a hand in a poker game. The simplicity of my solution stemmed from the fact that I added features that guided the tree model to learn accurate class representations. In particular, I added binary features to signal ‘four-of-a-kind’, ‘three-of-a-kind’, etc. I was able to add these features because I had a full domain knowledge. But what could I do if I did not have any idea of what these features should be?

Well, the authors of the data set have published several papers on this. One of them is Evolutionary Data Mining with Automatic Rule Generalization from 2002 where they introduced improvements made to the RAGA system capable of extracting predictive rules. They used evolutionary search algorithm (genetic algorithm and genetic programming) to synthesize new rules, select rules with best support and coverage, and apply a generalization step to the best rules. Essentially the recipe includes the following functionality:

  1. Synthesis of population of rules and pruning invalid rules.
  2. Fitness evaluation function.
  3. Rules cross-over and mutation.
  4. Rules generalization function.

The rules synthesis can be achieved with a combination of features on “=” and “!=”. The fitness evaluation can be a standard rules support and confidence metrics. The cross-over and mutation can be a step where single features are “AND”ed together to create longer rules. The rule-generalization step is the most complex. It would involve grouping commonly occurring item-sets into new symbols. For example, if we have several rules that include three ranks being the same, we could combine it into a single item-set:

Sets combination
Sets combination

After several generations, such symbols should evolve to more general rules (i.e. no reference to ‘7’, just R1=R2=R3). Here, the introduction of new symbols into the item set is the feature engineering step that I performed manually in my decision tree code.

Learning Binary Decision Trees from the Poker Hand Data Set

Greetings, my blog readers!

What a year this has been so far! I am finally finding some time and energy to write new posts on my blog. I am really looking forward to blogging more about machine learning and data science!

In this post I will share with you my approach to building a model to learn to recognize a poker hand. This problem and the training and testing data are available from the UCI machine learning repository here. My approach to this task is based on recognizing the following about the data:

  1. The training set of all positive examples is incomplete.
  2. It is possible to express all positive examples as a set of if/else/then rules.
  3. Given the training data and a classifier capable of learning the high-order rules, we should be able to achieve a 100% accuracy on test data. Thus the fact that the set of positive examples is incomplete does not matter.
  4. To evaluate our classifier we can stick with a simple accuracy ratio since our target accuracy is 100%.

The poker hand data set

There are 10 classes in the poker data set. Class 0 represents “nothing in hand” and is the largest class. The set of positive examples can be extended if we generate an exhaustive set of combinations representing a class. For example, we can fully specify the Flush by listing all possibilities of five cards of the same suit. We would need to list \frac{13!}{5!(13-5)!}\cdot 4 -40=5108 combinations, where minus 40 comes from subtracting the 40 possibilities of a straight flush.

I am not going to list all possibilities. Instead, I will train several binary decision tree models. Each tree will recognize one class vs. the rest. What do you think will happen if we present the raw training data to a decision tree model? Take, for example, scikit-learn DecisionTreeClassifier. Because our data is numeric, the tree will contain branches and nodes that do not represent meaningful rules that can be translated to categorical data. In other words, the decision tree model will try to locally optimize the splitting criteria and fail to see the big picture. If we want to use the decision tree model, we need to add more features. Specifically, we need to add features that will let the tree to spot the higher order separations that exist in this data.

Engineering more features

So, what features do we need to add? We need to add binary features indicating the Flush, 4 of a kind, 3 of a kind, a pair, full house, etc. Classes 6 and 8 can be learned by combining the features we are going to introduce. One can say that features engineering for this problem is cheating as it defeats the purpose of the task. I agree with this completely, and the main lesson of this post is that when presented with such data as the poker hand data set, the best choice of the classifier must take into account the higher-order abstracts. If adding new features is not possible, then using an un/supervised approach to first discover such constructs may be required.

Python code

Let’s take a look at a possible solution in Python (2.7). You can download it from my github space here. The “” contains the solution. I have provided enough comments to make it clear what I am doing at each step. Please let me know if you can think of a more elegant way to concatenate_all_ranks and suits. The code has an ability to graph the trees using graphiz. Here is an example of a tree to pick-out the Royal Flush:

Royal Flush tree
Royal Flush tree

The code achieves 100% accuracy. No wonder! We told the tree what to learn…