Going from Sum of Squared Errors to the Maximum Likelihood

In this post I will tackle the link between the cost function and the maximum likelihood hypothesis. The idea for this post came to me when I was reading Python Machine Learning by Sebastian Raschka; the chapter on learning the weights on logistic regression classifier. I like this book and strongly recommend it to anyone interested in the field. It has solid Python coding example, many practical insights and a good coverage of what scikit-learn has to offer.

So, in chapter 3, section on  Learning the weights of the logistic cost function, the author seamlessly moves from the need to minimize the sum-squared-error cost function to maximizing the (log) likelihood. I think that it is very important to have a clear intuitive understanding on why the two are equivalent in case of regression. So, let’s dive in!

Sum of Squared Errors

Using notation from [2], if $z^{(i)}$ is a training input variable,  then $\phi (z^{(i)})$ is the target class predicted by the learning hypothesis $\phi$. It is clear that we want to minimize the distance between the predicted and the actual class value $y^{(i)}$, thus the sum-squared-error is defined as $\frac{1}{2}\sum_{i}(y^{(i)}-\phi (z^{(i)}))^{2}$. The 1/2 in front of the sum is added to simplify the 1st derivative, which is taken to find the coefficients of regression when performing gradient descent. But one could easily define the cost function without the 1/2 term.

The sum-squared error is a cost function. We want to find a classifier (alternatively, a learning hypothesis) that minimizes this cost. Let’s call it $J$. This is a very intuitive optimal value problem describing what we are after. It is like asking TomTom to find the shortest route to Manchester from London to save money on fuel. No problems! Done. A gradient descend (batch or stochastic) will produce a vector of weights $\mathbf{w}$, which when multiplied into the matrix of training input variables $\mathbf{x}$ gives the optimal solution. So, the meaning of equation below should now be, hopefully, clear:

$J(w)= \frac{1}{2}\sum_{i}(y^{(i)}-\phi (z^{(i)}))^{2}$   (1)

The MAP Hypothesis

The MAP hypothesis is a maximum a posteriori hypothesis. It is the most probable hypothesis given the observed data (note, I am using material from chapter 6.2 from [1] for this section). Out of all the candidate hypotheses in $\Phi$, we would like to find the most probable hypothesis $\phi$ that fits the training data $Y$ well.

$h_{MAP} \equiv argmax_{\phi \in \Phi} P(\phi | Y)$    (2)

It should become intuitively obvious that, in case of a regression model, the most probable hypothesis is also the one that minimizes $J$. In (2) we have a conditional probability. Using Bayes theorem, we can further develop it as:

$h_{MAP} \equiv argmax_{\phi \in \Phi} P(\phi | Y)$

$= argmax_{\phi \in \Phi} \frac{P(Y|\phi)P(\phi)}{P(Y)}$

$= argmax_{\phi \in \Phi} P(Y|\phi)P(\phi)$   (3)

$P(Y)$ is the probability of observing the training data. We drop this term because it is a constant and is independent of $\phi$. $P(\phi)$ is a prior probability of some hypothesis. If we assume that every hypothesis in $\Phi$ is equally probable, then we can drop $P(\phi)$ term from (3) as well (as it also becomes a constant). We are left with a maximum likelihood hypothesis, which is a hypothesis under which the likelihood of  observing the training data is maximized:

$h_{ML} \equiv argmax_{\phi \in \Phi} P(Y|\phi)$   (4)

How did we go from looking for a hypothesis given the data to the data given the hypothesis? A hypothesis that results in the most probable frequency distribution for the training data is also the one that is the most accurate about it. Being the most accurate implies having the best fit (if we were to generate new data under MAP, it would fit the training data best, among all other possible hypotheses). Thus, MAP and ML are equivalent here.

Bringing in the Binary Nature of Logistic Regression

Under the assumption of independent n training data points, we can rewrite (4) as a product over all observations:

$h_{ML} \equiv argmax_{\phi \in \Phi} \prod_{i=1}^{n} P(y_{i}|\phi)$   (5)

Because logistic regression is a binary classification, each data point can be either from a positive or a negative target class. We can use Bernoulli distribution to model this probability frequency:

$h_{ML} \equiv argmax_{\phi \in \Phi} \prod_{i=1}^{n} \left( \phi(z^{(i)}) \right) ^{y^{(i)}} \left( 1-\phi(z^{(i)}) \right) ^{1-y^{(i)}}$     (6)

Let’s recollect that in logistic regression the hypothesis is a logit function that takes a weighted sum of predictors and coefficients as input. Also, a single hypothesis $\phi$ consists of a set of coefficients $\mathbf{w}$ and the objective of $h_{ML}$ is to find the coefficients that minimize the cost function. The problem here is that logit function is not smooth and convex and has many local minimums. In order to be able to use an optimal search procedure like gradient descend, we need to make the cost function convex. This is often done by taking the logarithm. Taking the log and negating (6) gives the familiar formula for the cost function J, which can also be found in chapter 3 of [2]:

$J(w) = \sum_{i=1}^{n} \left[ -y^{(i)} \log(\phi (z^{(i)})) - (1-y^{(i)}) \log(1- \phi(z^{(i)})) \right]$     (7)

Summary

In this post I made an attempt to show you the connection between the usually seen sum-of-squared errors cost function (which is minimized) and the maximum likelihood hypothesis (which is maximized).

References:

[1] Tom Mitchel, Machine Learning, McGraw-Hill, 1997.*

[2] Sebastian Raschka, Python Machine Learning, Packt Publishing, 2015.

* If there is one book on machine learning that you should read, it should be this one.

How To Make Your Mark As A Woman In Big Data

A great post from KDnuggets.com:

I am a data scientist

Three years ago this week, I wrote a blog post, “Data science is statistics”. I was fiercely against the term at that time, as I felt that we already had a data science, and it was called Statistic…

Source: I am a data scientist

Code Like a Girl

Today is a Blog Action Day when all bloggers are encouraged to blog about issues that are important to them, primarily to raise awareness among their readers. You may gather from the title of this blog what issue this post will be about.

We live in interesting times. The rights that women enjoy around the world are so non-uniformly distributed. Take, for example Sweden or Norway where talking about gender division almost no longer makes sense – they have broken down and extinguished gender bias from their society. Then look at the Middle East. I find it mind-boggling that in today’s day and age a woman cannot walk in a street unsupervised by a man? What can she possibly do if she is unsupervised? Commit a crime, disgrace the family? Most likely just run away…

Anyway, the main point of this blog is not as far reaching as the problems of the Middle East. Instead I will aim closer to home. A few years ago Always ran a #LikeAGirl campaign to highlight the inherent prejudice women face on the daily basis. When participants were asked to do something “like a girl”, they would emphasize the weaker, feebler, and less confident characteristics of whatever they were asked to portrait. And it was not just the male participants. Women of all ages took part and they too interpreted “like a girl” as meaning weaker, less capable of… No one responded to the request by saying: “Like a girl? You mean just normally, like a human being that just happens to be a young female?”.

So, why is this happening? And to what extend women are responsible for this? In my view, this is happening because women and girls allow this to happen and they are very responsible for the extend of the stigma. Note that this is not a blame. This is a conclusive statement based on what I have personally experienced and observed. I too went through a period of self-doubt, questioning if Computer Science was the right career choice for a female. I also used to feel uncomfortable in university computer labs occupied primarily by guys. I rejected the idea of blending-in by trying to be more manly, but I did avoid wearing earrings or make-up in class as to not emphasize the obvious. I did talk myself out of participating in coding competitions because almost all contestants would be men. Therefore, I too am responsible for carrying the gender divide around with me in a pocket for daily use, for a long time. But, luckily, there was a point at which I stopped. Because I got tired of comparing myself and questioning my decisions. I also got tired of seeing how easily, both, men and women apply the double-standards. If you are girl, do everything like a girl. Live your whole life like a girl. Make sure you do emphasize the obvious and don’t try to blend in. “Embracing who you are” became such a cliché statement, but it concisely summaries what I am saying. Drop the gender divide yourself before you ask others to do the same. And lastly, code like a girl too. A lot.

Python gotchas

Here is the thing – I am a big fan of Python programming language. Now that there is an Intel distribution of Python, I don’t think I ever want to write in any other language again…

Having said that, Python has its moments. Most of the examples below are based on Fluent Python book by Luciano Romalho. I highly recommend it to all Python programmers.

Here are some “gotchas” I am taking about:

***********************************
*   Leaking Variables
*   Times what?
*   An Inside Job
*   Deeply Shallow
*   Out of Order
*   We are all Sharing
***********************************

Leaking Variables

In Python 2.x variables created inside list comprehension are leaked, offering nasty surprise.

x = "I need you later"
ctoten = [-1, -2, -3, -4, -5, -6, -7, -8, -9, -10]
abs_ctoten = [abs(x) for x in ctoten]
print("Oh no! ", x) # prints x to be -10


Note that this problem does not exist in generator expressions (aka. genexps):

y = "abcde"
w = "see you later"
upper_y = array.array('c',(str.upper(w) for w in y))
print ("Still here: ", w) # prints w to be "see you later"



Times what?

Let’s say I need a string of 20 a’s. I can simply create it like this:

twenty_as = "a"*20


Great. I now need a list of three lists. I proceed to create it with * and end up with another surprise!

abc_list = [['a', 'b','c']]*3
print abc_list
abc_list[1][1]='x'
print abc_list  # prints ['a', 'x', 'c'], ['a', 'x', 'c'], ['a', 'x', 'c']]



This happens because the abc_list is made of references to the same [‘a’, ‘b’, ‘c’] list. The solution is to ensure that each list a separate/new copy:

abc_list = [['a', 'b','c'] for i in range(3)]


An Inside Job

Tuples are immutable and one can take an advantage of this when an immutability is required. However, if you put a mutable object inside a tuple, keep in mind that it can still be changed.

imm = (1,2)
imm[0]+=1 # will throw an exception
imm2 = (1, 2, [3, 4])
imm2[2]+=[10] # succeeds to modify the inner list and throws an exception


Deeply Shallow

You did not think I was going to write a post on Python’s dark corners without touching on deep copying, did you?
Here is a nice little trick for you to create a shallow copy with a slicing operator. It works the first time, but fails the second time when we need a deep copy instead.

list1 = [1,2,3]
list2 = list1[:] # shallow copy
list2[2] = 5

print ([(l, k) for l, k in zip(list1, list2)]) # all good

list1 = [1, 2, 3, [8,9]]
list2=list1[:]  # shallow copy again
list2[3][0] = 7

print ([(l, k) for l, k in zip(list1, list2)]) # shows that both are modified


Out of Order

Unless you are using collections.OrderedDict, the order of Python’s dicts’s keys and values cannot be relied on. This has to do which how Python’s dicts are stored in the memory. Also, dicts equality is determined on the basis of key-item pairs, and not their order in the dict. Take a look at the example below. The output of this code is implementation dependent. Finally, adding new items to dicts will likely to reorder the keys. Python’s sets also do not guarantee a particular order will be maintained. There is no “orderedset” in the standard library, but if you need one, you can find a PyPi package (e.g. orderedset).

FRUIT_CODES = [
("orange", 1),
("apple", 45),
("banana", 70),
("grapes", 81),
("pineapple", 86),
("kiwi", 52),
("papaya", 413),
("mango", 55),
("lemon", 62),
("nectarine", 910)
]

orig = copy.copy(FRUIT_CODES)
sorted1 = sorted(FRUIT_CODES, key=lambda x:x[0])
sorted2 = sorted(FRUIT_CODES, key=lambda x:x[1])

fruit_dict = dict(FRUIT_CODES)
fruit_sorted_dict1 = dict(sorted1)
fruit_sorted_dict2 = dict(sorted2)

print fruit_dict.keys() == fruit_sorted_dict1.keys() and fruit_sorted_dict1.keys() == fruit_sorted_dict2.keys() # prints False or True (implementation dependent)
print fruit_dict == fruit_sorted_dict1 and fruit_sorted_dict1 == fruit_sorted_dict2 # prints True



We are all Sharing

In Python, mutable types are passed to functions by sharing. This means that a function/method can modify the parameter, but it cannot replace it with another object. Here is a typical “gotcha” with functions being able to modify its parameters:

def plusone(my_list):
my_list.append(1)  # can modify

def newlife(my_list, your_list):
my_list=your_list  # cannot replace with a new object

first_list = [2, 3, 4]
plusone(first_list)
print first_list # prints [2, 3, 4, 1]

second_list = [5, 6, 7]
newlife(first_list, second_list)
print first_list # prints [2, 3, 4, 1]


This should give you enough “food for thought”. Happy programming everyone! 🙂