Hidden Technical Debt of Machine Learning – Play Now Pay Later

Last week I was lucky enough to attend the Strata Conference London 2017 for one day. The venue and the event are impressive in scale, participants and content. The quality of tutorials and talks, in general, was very good, and I have walked away with a few new ideas I wanted to share on my blog.

One of the most important lessons from the conference for me was from a reference to the NIPS’16 paper titled Hidden Technical Debt in Machine Learning System, written by Google researchers. The paper is about the long-term maintenance costs introduced by building machine learning (ML) models and systems. The argument is that such cost is hidden as it is not immediately apparent from the point of putting an ML model in production. For data scientists it is important to be aware of the complexity of the models they develop and what impact these models will have on their organisation and how much it will cost to maintain them.

According to the authors, there are three levels of technical complexity which contribute to technical debt in ML: the model itself can be complex and behave non-linearly to a given set of parameters, the model can be taking input from otherwise disparate systems, and the model’s output or its behavior can be complex and difficult to predict before it is released.

ML Model Complexity

ML models entangle input signals from different systems together, making it difficult to avoid the CACE principle: Change Anything Change Everything. This principle applies to all aspects of ML, from parameters (think xgboost!), to input data to convergence thresholds and sampling methods. Isolation and servicing of modelling components is one of the proposed solutions.

The Cost  of Data Dependencies

Large ML systems have large and complex data dependencies, where data quality and any data assumptions can significantly affect the ML system output. ML system input data can be unstable, meaning it changes qualitatively and quantitatively over time. In some cases, the degree of dependency on one set of data vs. another may change. The ML systems are unique because usually their data dependencies are finer (e.g. the input should not just be an integer, but an integer in a certain range). A lot of thinking and possibly investment should go into understanding such dependencies and controlling them. Check-out kensu.io – a start-up company I have come across at the conference, the creators of Adalog – a product designed purely for such task.

The Feedback Loop and Dealing with Changes

Live ML systems learn in real time and influence their own behavior. Sometimes it is necessary to choose static parameters, like prediction thresholds, for a model that is trained or parameterised on  data that is dynamic in nature. Thus leading to the previous set of thresholds being no longer valid on updated data. The authors highlight that comprehensive monitoring of ML system behavior is critical for long-term system reliability.

In summary, maintainable ML systems are costly and require an even higher level of technical competence and foresight among its developers.  ML models testing, validation and monitoring should be considered as an absolute must in organisations that are eager to rip their full benefits.

(Jan-17) Did You Know That?

A brand new idea for my blog in 2017 is a monthly Did You Know That digest where I am going to share with you m things (where m<=3) that I recently learnt and found to be useful. I am going to keep such digests short and simple, as not to overwhelm you with verbiage and unnecessary details. This month’s top 3 Did you know that? are:

  • scikit-learn SGDClassifier – one learner, many tricks up its sleeve;
  • GraphViz is integrated in scikit-learn – no need no import it separately!
  • Zeppelin notebook from Apache – worth a look if you are into Python notebooks;

scikit-learn SGDClassifier

This is a multi-classifier module that implements stochastic gradient descent. The loss parameter controls which model is used to train and perform classification. For example, loss=hinge will give a linear SVM, and loss=log will give a logistic regression. When should you use it? When your training data set does not fit into memory. Note that SGD also allows mini-batch learning.

GraphViz is Integrated in scikit-learn Decision Trees

If you read all my blog post, you may have come across this one where I put together some code to train a binary decision tree to recognize a hand in poker. The code is available on my github space. If you read the code, you will see that I defined graph_decision_tree method with all the hula-loops to graph and save the images. But did you know that you don’t need to do all this work since sklearn.tree has export_graphviz module? If dsTree is an instance of DecisionTreeClassifier, then one can simply do:

from sklearn.tree import export_graphviz

export_graphviz(dsTree, out_file='dsTree.dot',
       feature_names=['feature1', 'feature2'])

The .dot file can be converted to a .png file (if you have installed GraphViz) like this:

dot -Tpng tree.dot -o tree.png

Zeppelin Notebook from Apache

If you are using Apache Spark you may be glad to learn that Apache has a notebook to go along with it. Zeppelin notebook offers similar functionality to Jupyter in terms of data visualization, paragraph writing and notebook sharing. I recommend that you to check it out.

Going from Sum of Squared Errors to the Maximum Likelihood

Greetings, my blog readers!

In this post I will tackle the link between the cost function and the maximum likelihood hypothesis. The idea for this post came to me when I was reading Python Machine Learning by Sebastian Raschka; the chapter on learning the weights on logistic regression classifier. I like this book and strongly recommend it to anyone interested in the field. It has solid Python coding example, many practical insights and a good coverage of what scikit-learn has to offer.

So, in chapter 3, section on  Learning the weights of the logistic cost function, the author seamlessly moves from the need to minimize the sum-squared-error cost function to maximizing the (log) likelihood. I think that it is very important to have a clear intuitive understanding on why the two are equivalent in case of regression. So, let’s dive in!

Sum of Squared Errors

Using notation from [2], if z^{(i)} is a training input variable,  then \phi (z^{(i)}) is the target class predicted by the learning hypothesis \phi. It is clear that we want to minimize the distance between the predicted and the actual class value y^{(i)}, thus the sum-squared-error is defined as \frac{1}{2}\sum_{i}(y^{(i)}-\phi (z^{(i)}))^{2}. The 1/2 in front of the sum is added to simplify the 1st derivative, which is taken to find the coefficients of regression when performing gradient descent. But one could easily define the cost function without the 1/2 term.

The sum-squared error is a cost function. We want to find a classifier (alternatively, a learning hypothesis) that minimizes this cost. Let’s call it J. This is a very intuitive optimal value problem describing what we are after. It is like asking TomTom to find the shortest route to Manchester from London to save money on fuel. No problems! Done. A gradient descend (batch or stochastic) will produce a vector of weights \mathbf{w}, which when multiplied into the matrix of training input variables \mathbf{x} gives the optimal solution. So, the meaning of equation below should now be, hopefully, clear:

J(w)= \frac{1}{2}\sum_{i}(y^{(i)}-\phi (z^{(i)}))^{2}   (1)

The MAP Hypothesis

The MAP hypothesis is a maximum a posteriori hypothesis. It is the most probable hypothesis given the observed data (note, I am using material from chapter 6.2 from [1] for this section). Out of all the candidate hypotheses in \Phi , we would like to find the most probable hypothesis \phi that fits the training data Y well.

h_{MAP} \equiv argmax_{\phi \in \Phi} P(\phi | Y)    (2)

It should become intuitively obvious that, in case of a regression model, the most probable hypothesis is also the one that minimizes J . In (2) we have a conditional probability. Using Bayes theorem, we can further develop it as:

h_{MAP} \equiv argmax_{\phi \in \Phi} P(\phi | Y)

= argmax_{\phi \in \Phi} \frac{P(Y|\phi)P(\phi)}{P(Y)}

= argmax_{\phi \in \Phi} P(Y|\phi)P(\phi)   (3)

P(Y) is the probability of observing the training data. We drop this term because it is a constant and is independent of \phi . P(\phi) is a prior probability of some hypothesis. If we assume that every hypothesis in \Phi is equally probable, then we can drop P(\phi) term from (3) as well (as it also becomes a constant). We are left with a maximum likelihood hypothesis, which is a hypothesis under which the likelihood of  observing the training data is maximized:

h_{ML} \equiv argmax_{\phi \in \Phi} P(Y|\phi)   (4)

How did we go from looking for a hypothesis given the data to the data given the hypothesis? A hypothesis that results in the most probable frequency distribution for the training data is also the one that is the most accurate about it. Being the most accurate implies having the best fit (if we were to generate new data under MAP, it would fit the training data best, among all other possible hypotheses). Thus, MAP and ML are equivalent here.

Bringing in the Binary Nature of Logistic Regression

Under the assumption of independent n training data points, we can rewrite (4) as a product over all observations:

h_{ML} \equiv argmax_{\phi \in \Phi} \prod_{i=1}^{n} P(y_{i}|\phi)   (5)

Because logistic regression is a binary classification, each data point can be either from a positive or a negative target class. We can use Bernoulli distribution to model this probability frequency:

 h_{ML} \equiv argmax_{\phi \in \Phi} \prod_{i=1}^{n}  \left( \phi(z^{(i)}) \right) ^{y^{(i)}} \left( 1-\phi(z^{(i)}) \right) ^{1-y^{(i)}}     (6)

Let’s recollect that in logistic regression the hypothesis is a logit function that takes a weighted sum of predictors and coefficients as input. Also, a single hypothesis \phi consists of a set of coefficients \mathbf{w} and the objective of h_{ML} is to find the coefficients that minimize the cost function. The problem here is that logit function is not smooth and convex and has many local minimums. In order to be able to use an optimal search procedure like gradient descend, we need to make the cost function convex. This is often done by taking the logarithm. Taking the log and negating (6) gives the familiar formula for the cost function J, which can also be found in chapter 3 of [2]:

 J(w) = \sum_{i=1}^{n}  \left[  -y^{(i)} \log(\phi (z^{(i)})) - (1-y^{(i)}) \log(1- \phi(z^{(i)})) \right]     (7)

Summary

In this post I made an attempt to show you the connection between the usually seen sum-of-squared errors cost function (which is minimized) and the maximum likelihood hypothesis (which is maximized).

References:

[1] Tom Mitchel, Machine Learning, McGraw-Hill, 1997.*

[2] Sebastian Raschka, Python Machine Learning, Packt Publishing, 2015.

* If there is one book on machine learning that you should read, it should be this one.

It is all in the Optimization Function

At one of the meetups on data science I recently attended, a question about when AI would reach the level of human thinking was posed. I was surprised to see people raising hands in response to a year of 2030, 2045, 2065, … I did not raise my hand at all because I don’t believe it will happen. Ever.

Am I naive? Ill-informed? No, I would like to think not. I totally respect the fact that human brain’s wiring is simple, and boils down to fat, water and electricity. We think we are smart, but we really aren’t. A computer program can be written to mimic us. Several successful examples already exist. We are easily fooled by such examples and attribute more intelligence and feelings to them than we ought to. I myself briefly thought that the two robots that were programmed to “look out” for each other by Tufts University research team really do care about one another. Their cute voices had something to do with it. The robots are driven by some optimal reward function, and they are programmed to optimize it. The robots don’t care about each other. True universal care is too broad to be “coded-up”.

Cat Pictures Please is a great short science fiction story written by Naomi Kritzer. It is written as an inner monologue of an AI system that was developed to help people out. I like this story for two reasons. Firstly, it is a good example of the most basic difference between humans and AI – we are lazy, irrational and slow. The AI system is logical, methodical and fast. Secondly, it highlights what is at the core of all AI – a reward or a payoff that must be optimized. In Cat Pictures Please a part of AI’s reward somehow becomes pictures of cats… If a robot works out that charging itself generates the greatest long-term reward it will end up charging itself most of the time. Note, I am writing ‘works out’, but I really mean converges onto. If humans are mostly composed of fat, water and electricity, the AI is down to search and optimization function.

Searching is what evolutionary computing, reinforcement learning, gradient descent/ascend (thus pretty much all types of regression) and unsupervised learning is about. Minimizing some cost function or alternatively optimizing a reward function is what can make an AI system “happy”. Both must be accurately programmed to work. Undeniably, many functions can be automated and perfected with searching for the best reward approach. Thus many things are within the AI’s reach. I am not saying that if autonomous weapons are unleashed upon us,  we are going to be just fine. But what I do believe in is that AI will never be able to truly think like us. It will never be able to act and rely on luck or its gut feeling. It will never be able to demonstrate such degree of delusion that its bluffing can achieve results, as humans can. AI will never become superstitious, doctrinal and lazy. Even if AI will one day become genius, it won’t reach human’s highs of stupidity. Since as Albert Einstein once said, the difference between stupidity and genius is that genius has its limits.