Test-Driven Data Analysis (and its possible application to the LS Regression)

I have recently attended a PyData meetup in London where Nicolas Radcliffe gave a nice talk on the concept of Test-Driven Data Analysis (TDDA). Here is a link to the slides that he presented.

Essentially, the idea behind TDDA is born from a well-known idea of Test Driven Software Development (TDSD or simply TDD). In TDSD, the test cases are written before the core code. This is often done to avoid developing solutions for specific scenarios and missing the edge cases. Take for example division. Everyone (hopefully) provides for division by zero, but what about integer division when a floating number is expected? How many times have you been caught out by this in a dynamic programming language like Python, staring at the code and wondering why every number comes out as zero? Developers who use TDSD specify at the start what the code should behave like and the constraints placed on the results.

Right now the practice of data science and data analysis very much lacks the structure of TDSD. I liked Nicolas’s 5th slide where he schematically shows the pillars of the data scientist’s workflow: trying, eyeballing and trying again until it makes sense. However, each stage of data analysis is prone to error. Starting from choosing the modelling approach down to interpretation of the results. Errors can be made in data pre-processing, model implementation, output gathering and even when plotting graphs. One of the main points of the talk was reproducible research. If every analysis is wrapped in a verification procedure that ensures that the same outputs are produced for the same inputs, then analysis bugs can be greatly reduced.

Another strong point of the talk was about using constraints. Much of the test-driven software development relies on constraints placed on the allowed inputs&outputs. In the data analysis, such constraints can be placed on data types, data value ranges, legality of duplicates, or even statistical and distributional properties. For example, can a value of an insurance claim be negative? What about a negative correlation between the insurance costs and the values of the claim?

I am hugely in favor of everything that makes software development more disciplined and the final result more robust. Thus you can see why I liked Nicolas’s ideas so much. It made me think about how and where I can incorporate TDDA in my work. Let’s take the work horse of data science – the least squares (standard multiple) regression. Can TDDA safeguard from common mistakes of applying regression and interpreting its results? I think it can. The quality of regression analysis and the ability of reproducing and reusing the regression model results will depend on the following parameters, all of which can be the constraints of TDDA:

  • Inputs: number of input cases vs. the number of independent variables: too many or too few input cases can invalidate regression analysis. When the number of input variables is small compared to the number of independent variables (i.e. the attributes of regression), the regression converges on perfect but meaningless solution. Another possibility is that the variance-covariance matrix can be non-invertable. Alternatively, when one tries to fit regression to too much data, almost any amount of correlation between the dependent and the independent variable can become statistically significant. A large number of false positives is an unfortunate side-effect of big data used in the wrong way. A possible constraint can be a simple check if the number of cases is ≥ 50 + 8 * ‘# of independent variables’.
  • Inputs: presence of singularity and multicollinearity: two regression attributes are singular if one of them is redundant (e.g. age and year of birth). Multicolliniear attributes are highly correlated, and in the case of standard multiple regression (as opposed to the step-wise regression, for example), can significantly impact/reduce the regression coefficients.
  • Inputs: presence of outliers: everyone knows what outliers can do to the statistics like the mean and standard deviation. Likewise, outliers can significantly impact the coefficients of regressions. A constraint on outliers absence can be placed on the data.
  • Output: normality and linearity of residuals: if regression was performed on non-transformed data, the residuals should always be examined. Skewed residuals may point to absence of normality in the input data, and non-normal input data may invalidate your prediction intervals if regression model is used to make predictions. Non-linear residuals may indicate that the originally suggested linear relationship between the dependent and the independent variables is inaccurate. Absence of normality is often dealt with data transformations (e.g. taking square roots or logarithms). Non-linearity will require re-visiting the original model (e.g. squaring or cubing the attributes).
  • Output: homoscedasticity of residuals: the absence of homoscedasticity between the error and the predicted variable may occur if the response is related to some transformation of the dependent variable instead of its original form. In other words, the relationship of the two variables varies over time. For example, the relationship between age and income is very likely to have this non-linear shape, since income of young people varies less than income of people who are over forty. I should note that the presence of heteroscedasticity does not invalidate regression results, but it may weaken the strength of its interpretation. Again, data transformation can come to the rescue (e.g. taking the logarithm).

Here you go – we have come up with five constraints that can be placed on regression input and output to ensure robust data analysis. Luckily most analytical platforms like SAS or SPSS have built-in checks that flag when these constraints are not satisfied. But in case if you are playing with a regression tool from another package – watch out. There is no need to increase that figure of 90% (see slide # 10)…

Code Like a Girl

Today is a Blog Action Day when all bloggers are encouraged to blog about issues that are important to them, primarily to raise awareness among their readers. You may gather from the title of this blog what issue this post will be about.

We live in interesting times. The rights that women enjoy around the world are so non-uniformly distributed. Take, for example Sweden or Norway where talking about gender division almost no longer makes sense – they have broken down and extinguished gender bias from their society. Then look at the Middle East. I find it mind-boggling that in today’s day and age a woman cannot walk in a street unsupervised by a man? What can she possibly do if she is unsupervised? Commit a crime, disgrace the family? Most likely just run away…

Anyway, the main point of this blog is not as far reaching as the problems of the Middle East. Instead I will aim closer to home. A few years ago Always ran a #LikeAGirl campaign to highlight the inherent prejudice women face on the daily basis. When participants were asked to do something “like a girl”, they would emphasize the weaker, feebler, and less confident characteristics of whatever they were asked to portrait. And it was not just the male participants. Women of all ages took part and they too interpreted “like a girl” as meaning weaker, less capable of… No one responded to the request by saying: “Like a girl? You mean just normally, like a human being that just happens to be a young female?”.

So, why is this happening? And to what extend women are responsible for this? In my view, this is happening because women and girls allow this to happen and they are very responsible for the extend of the stigma. Note that this is not a blame. This is a conclusive statement based on what I have personally experienced and observed. I too went through a period of self-doubt, questioning if Computer Science was the right career choice for a female. I also used to feel uncomfortable in university computer labs occupied primarily by guys. I rejected the idea of blending-in by trying to be more manly, but I did avoid wearing earrings or make-up in class as to not emphasize the obvious. I did talk myself out of participating in coding competitions because almost all contestants would be men. Therefore, I too am responsible for carrying the gender divide around with me in a pocket for daily use, for a long time. But, luckily, there was a point at which I stopped. Because I got tired of comparing myself and questioning my decisions. I also got tired of seeing how easily, both, men and women apply the double-standards. If you are girl, do everything like a girl. Live your whole life like a girl. Make sure you do emphasize the obvious and don’t try to blend in. “Embracing who you are” became such a cliché statement, but it concisely summaries what I am saying. Drop the gender divide yourself before you ask others to do the same. And lastly, code like a girl too. A lot.

Python gotchas

Here is the thing – I am a big fan of Python programming language. Now that there is an Intel distribution of Python, I don’t think I ever want to write in any other language again…

Having said that, Python has its moments. Most of the examples below are based on Fluent Python book by Luciano Romalho. I highly recommend it to all Python programmers.

Here are some “gotchas” I am taking about:

***********************************
*   Leaking Variables 
*   Times what? 
*   An Inside Job
*   Deeply Shallow
*   Out of Order
*   We are all Sharing  
***********************************

Leaking Variables

In Python 2.x variables created inside list comprehension are leaked, offering nasty surprise.

x = "I need you later"
ctoten = [-1, -2, -3, -4, -5, -6, -7, -8, -9, -10]
abs_ctoten = [abs(x) for x in ctoten]
print("Oh no! ", x) # prints x to be -10

Note that this problem does not exist in generator expressions (aka. genexps):

y = "abcde"
w = "see you later"
upper_y = array.array('c',(str.upper(w) for w in y))
print ("Still here: ", w) # prints w to be "see you later"

Times what?

Let’s say I need a string of 20 a’s. I can simply create it like this:

twenty_as = "a"*20

Great. I now need a list of three lists. I proceed to create it with * and end up with another surprise!

abc_list = [['a', 'b','c']]*3
print abc_list
abc_list[1][1]='x'
print abc_list  # prints ['a', 'x', 'c'], ['a', 'x', 'c'], ['a', 'x', 'c']]

This happens because the abc_list is made of references to the same [‘a’, ‘b’, ‘c’] list. The solution is to ensure that each list a separate/new copy:

abc_list = [['a', 'b','c'] for i in range(3)]

An Inside Job

Tuples are immutable and one can take an advantage of this when an immutability is required. However, if you put a mutable object inside a tuple, keep in mind that it can still be changed.

imm = (1,2)
imm[0]+=1 # will throw an exception
imm2 = (1, 2, [3, 4])
imm2[2]+=[10] # succeeds to modify the inner list and throws an exception

Deeply Shallow

You did not think I was going to write a post on Python’s dark corners without touching on deep copying, did you?
Here is a nice little trick for you to create a shallow copy with a slicing operator. It works the first time, but fails the second time when we need a deep copy instead.

list1 = [1,2,3]
list2 = list1[:] # shallow copy
list2[2] = 5

print ([(l, k) for l, k in zip(list1, list2)]) # all good

list1 = [1, 2, 3, [8,9]]
list2=list1[:]  # shallow copy again
list2[3][0] = 7

print ([(l, k) for l, k in zip(list1, list2)]) # shows that both are modified

Out of Order

Unless you are using collections.OrderedDict, the order of Python’s dicts’s keys and values cannot be relied on. This has to do which how Python’s dicts are stored in the memory. Also, dicts equality is determined on the basis of key-item pairs, and not their order in the dict. Take a look at the example below. The output of this code is implementation dependent. Finally, adding new items to dicts will likely to reorder the keys. Python’s sets also do not guarantee a particular order will be maintained. There is no “orderedset” in the standard library, but if you need one, you can find a PyPi package (e.g. orderedset).

FRUIT_CODES = [
    ("orange", 1),
    ("apple", 45),
    ("banana", 70),
    ("grapes", 81),
    ("pineapple", 86),
    ("kiwi", 52),
    ("papaya", 413),
    ("mango", 55),
    ("lemon", 62),
    ("nectarine", 910)
]

orig = copy.copy(FRUIT_CODES)
sorted1 = sorted(FRUIT_CODES, key=lambda x:x[0])
sorted2 = sorted(FRUIT_CODES, key=lambda x:x[1])

fruit_dict = dict(FRUIT_CODES)
fruit_sorted_dict1 = dict(sorted1)
fruit_sorted_dict2 = dict(sorted2)

print fruit_dict.keys() == fruit_sorted_dict1.keys() and fruit_sorted_dict1.keys() == fruit_sorted_dict2.keys() # prints False or True (implementation dependent)
print fruit_dict == fruit_sorted_dict1 and fruit_sorted_dict1 == fruit_sorted_dict2 # prints True

We are all Sharing

In Python, mutable types are passed to functions by sharing. This means that a function/method can modify the parameter, but it cannot replace it with another object. Here is a typical “gotcha” with functions being able to modify its parameters:

def plusone(my_list):
    my_list.append(1)  # can modify
    

def newlife(my_list, your_list):
    my_list=your_list  # cannot replace with a new object

first_list = [2, 3, 4]
plusone(first_list)
print first_list # prints [2, 3, 4, 1]

second_list = [5, 6, 7]
newlife(first_list, second_list)
print first_list # prints [2, 3, 4, 1]

This should give you enough “food for thought”. Happy programming everyone! 🙂

samplepy – a new Python Sampling Package

Hello my blog readers,

This post is to introduce a new Python package samplepy. This package was written to simplify sampling tasks that so often creep-up in machine learning. The package implements Importance, Rejection and Metropolis-Hastings sampling algorithms.

samplepy has a very simple API. The package can be installed with pip by simply running pip install samplepy. Once installed, you can use it to sample from any univariate distribution as following (showing rejection sampling use):

 

from samplepy import Rejection
import matplotlib.pyplot as plt
import numpy as np

# define a unimodal function to sample under
f = lambda x: 2.0*np.exp(-2.0*x)
# instantiate Rejection sampling with f and required interval
rej = Rejection(f, [0.01, 3.0])
# create a sample of 10K points
sample = rej.sample(10000, 1)  

# plot the original function and the created sample set
x = np.arange(0.01, 3.0, (3.0-0.01)/10000)
fx = f(x)

figure, axis = plt.subplots()
axis.hist(sample, normed=1, bins=40)
axis2 = axis.twinx()
axis2.plot(x, fx, 'g', label="f(x)=2.0*exp(-2*x)")
plt.legend(loc=1)
plt.show()

 

Sample from f(x)=2.0*exp(-2*x) over [0.01, 3.0]
Sample from f(x)=2.0*exp(-2*x) over [0.01, 3.0]

The three sampling method (i.e. Rejection, Importance and MH) are quite different and will achieve slightly different results for the same function. Performance is another important difference factor, with Metropolis-Hastings probably being the slowest. Let’s compare how the three sampling algorithm deliver on a bi-modal univariate function:

f(x)=exp(-x^{2})*(2+\sin(5x)+\sin(2x))

 

from samplepy import Rejection, Importance, MH
import matplotlib.pyplot as plt
import numpy as np


f = lambda x: np.exp(-1.0*x**2)*(2.0+np.sin(5.0*x)+np.sin(2.0*x))
interval = [-3.0, 3.0]
rej = Rejection(f, interval)  # instantiate Rejection sampling with f and interval
sample = rej.sample(10000, 1)    # create a sample of 10K points

x = np.arange(interval[0], interval[1], (interval[1]-interval[0])/10000)
fx = f(x)

figure, axis = plt.subplots()
axis.hist(sample, normed=1, bins=40)
axis2 = axis.twinx()
axis2.plot(x, fx, 'g', label="Rejection")
plt.legend(loc=1)
plt.show()

mh = MH(f,interval)
sample = mh.sample(20000, 100, 1)  # Make sure we have enough points in the sample!

figure, axis = plt.subplots()
axis.hist(sample, normed=1, bins=40)
axis2 = axis.twinx()
axis2.plot(x, fx, 'g', label="MH")
plt.legend(loc=1)
plt.show()

imp = Importance(f, interval)
sample = imp.sample(10000, 0.0001, 0.0010) # create a sample where essentially no extra importance is given to any quantile

figure, axis = plt.subplots()
axis.hist(sample, normed=1, bins=40)
axis2 = axis.twinx()
axis2.plot(x, fx, 'g', label="Importance")
plt.legend(loc=1)
plt.show()

rejection
mhimportance

Hopefully this gives you enough examples to get you started using samplepy!

A Tutorial on Hidden Markov Model with a Stock Price Example – Part 2

This is the 2nd part of the tutorial on Hidden Markov models. In this post we will look at a possible implementation of the described algorithms and estimate model performance on Yahoo stock price time-series.

Implementation of HMM in Python

I am providing an example implementation on my GitHub space. Please note that all code is provided with a disclaimer that you are free to use it at your own risk. HMM.py contains the main implementation. There are a few things to point out in this file:

  1. As is pointed out in the referenced document in part 1, \alpha and \beta require to be scaled for longer observations since the product of probabilities quickly tends to zero, resulting in underflow. The code has a flag for scaling, and defaults to positive.
  2. While testing out this model I noticed that initial assignment of \hat{a_{ij}} and \hat{b_{j}}(k) makes a big difference to the final solution. I am providing two possible assignments. One, as described in part 1, setting approximately to the same values, adding to 1. And two, using Dirichlet distribution to create non-uniform assignment that adds-up to 1. I’ve noticed that the former results in more meaningful model parameters.
  3. You will notice a small hack in HMMBaumWelch method where I am setting M to the maximum observation in O. This is needed because not all observation sequences will contain all values (i.e. 0, 1, 2). And the transition and emission matrices are accessed as if all the values exist.
  4. I am using yahoo_finance Python module to source the stock prices for Yahoo. It seems to work and is pretty easy to use. You should be able to download this module here.

HMM Model performance to predict Yahoo stock price move

On my github space, HMM_test.py contains a possible test example code. I am testing the model as following: train the model on a specified window of daily historical moves (e.g. 10 days) and using the model parameters determine the predicted current model state. Then, using the predicted state, determine the next likely state from \hat{a}:

prediction_state = np.argmax(a[path[-1],:])

The most likely emitted value from that state can be found as follows:

prediction = np.argmax(b[prediction_state,:])

I then compare the predicted move to the historical one. So, how did HMM do? Well, not so well. The model parameters \lambda=(A,B, \pi) are very sensitive to the convergence tolerance, initial assignment and, obviously the training time-window. Calculating accuracy ratio as the number of correctly predicted directions is pretty much around 50%-56% (tests on more recent data produce the higher accuracy in this range). Thus, we might as well be throwing a coin to make buy or sell predictions. It is possible that EOD price moves are not granular enough to provide a coherent market dynamic model.