Don’t Get in a Pickle with a Python namedtuple

In this blog I will show you what happens when you want to pickle an object that contains a Python namedtuple.

Python’s namedtuple is high-performance data type that lets us define a custom type which behaves like a tuple. For example, the following piece of code defines a new type Viewer, creates an instance of it and initialises its attributes:

    from collections import namedtuple

    Viewer = namedtuple('Viewer', 'gender age points')
    viewer = Viewer('X', 25, 356)

In the above, line 3 defines a new type Viewer, and line 4 defines and initialises a new variable viewer of type Viewer. viewer behaves like a tuple in a sense that it has built-in methods count() and index() and allows access to attributes via indexing or named arguments. For example:

    print(viewer[2])         # prints 356
    print(viewer.age)        # prints 25
    print(viewer.count('X')) # prints 1

Note that unlike with a list or a dict, to work with namedtuples we need to perform two operations: (1) define the new type, (2) create a new instance of it. Also note that the same two steps are followed when we work with classes. And a namedtuple is just a dynamically named class type. But how exactly does this dynamic part works? It works because when we define a new type (line 3 in the first code snippet), we are actually calling a factory function namedtuple that does the dynamic ‘stuff’ for us (i.e. returns a sub-class of a tuple that is named as what we specify in the function call).

Let’s see what happens when we create a class with a namedtuple member.

import pickle
from collections import namedtuple
import datetime as dt


class ViewerClass(object):

    # class-level type definition
    vt = namedtuple(
        'vt', 'start_date mon_views mon_streams name dob'
    )

    def __init__(
        self, start_date, mon_views, mon_streams, name, dob
    ):
        self._my_vt = ViewerClass.vt(
            start_date, mon_views, mon_streams, name, dob
        )

    def get_start_date(self):
        return self._my_vt.start_date

    def get_monthly_views(self):
        return self._my_vt.mon_views

    def get_monthly_streams(self):
        return self._my_vt.mon_streams

    def get_registration_details(self):
        return (
            'Name:'
            + self._my_vt.name
            + ' DOB:'
            + str(self._my_vt.dob)
        )

    def update_monthly_stream(self, new_mon_streams):
        self._my_vt.mon_streams = new_mon_streams

    def update_monthly_views(self, new_mon_views):
        self._my_vt.mon_views = new_mon_views


if __name__ == '__main__':

    viewer1 = ViewerClass(
        dt.date(2019, 1, 1),
        5,
        6234.80,
        'John',
        dt.date(1989, 12, 3),
    )
    print(
        "Viewer {} has streamed for {} seconds this month.".format(
            viewer1.get_registration_details(),
            viewer1.get_monthly_streams(),
        )
    )

    viewer2 = ViewerClass(
        dt.date(2019, 2, 1),
        5,
        5234.80,
        'Mary',
        dt.date(1989, 11, 11),
    )
    print(
        "Viewer {} has streamed for {} seconds this month.".format(
            viewer2.get_registration_details(),
            viewer2.get_monthly_streams(),
        )
    )

    print(type(viewer1))
    print(type(viewer1._my_vt))

The output of the print statements points to a potential problem that can occur if we try to pickle the viewer objects:

It turns out that the protected variable is of type ‘__main__.vt’ but not ‘__main__.ViewerClass.vt’. And if we try to pickle viewer1 we are going to get this error:

_pickle.PicklingError: Can’t pickle <class ‘__main__.vt’>: attribute lookup vt on __main__ failed

This error should make sense because vt is not defined within __main__, but is defined within __main__.ViewerClass, and thus is not visible to pickle as a subclass of a class.

There are several ways to fix this.

First, we can move the definition of vt outside of ViewerClass to the __main__. This will let pickle find vt at the level it is looking for it:

# module-level type definition
vt = namedtuple(
    'vt', 'start_date mon_views mon_streams name dob'
)


class ViewerClass(object):
    def __init__(
        self, start_date, mon_views, mon_streams, name, dob
    ):
        self._my_vt = vt(
            start_date, mon_views, mon_streams, name, dob
        )

    ...

Second solution involves changing a built-in private variable __qual_name__ to that of the class name:

import pickle
from collections import namedtuple
import datetime as dt


class ViewerClass(object):

    # class-level definition
    vt = namedtuple(
        'vt', 'start_date mon_views mon_streams name dob'
    )
    vt.__qualname__ = 'ViewerClass.vt'

    def __init__(
        self, start_date, mon_views, mon_streams, name, dob
    ):
        self._my_vt = ViewerClass.vt(
            start_date, mon_views, mon_streams, name, dob
        )

    ...

This fixes the issue and makes viewer1._my_vt of type ‘__main__.ViewerClass.vt’, under which pickle can look it up.

I must say that I prefer the first solution, since sub-classing from the ViewerClass may prove to be problematic, and we should avoid modifying private variables.

Advertisements

(Jan-17) Did You Know That?

A brand new idea for my blog in 2017 is a monthly Did You Know That digest where I am going to share with you m things (where m<=3) that I recently learnt and found to be useful. I am going to keep such digests short and simple, as not to overwhelm you with verbiage and unnecessary details. This month’s top 3 Did you know that? are:

  • scikit-learn SGDClassifier – one learner, many tricks up its sleeve;
  • GraphViz is integrated in scikit-learn – no need no import it separately!
  • Zeppelin notebook from Apache – worth a look if you are into Python notebooks;

scikit-learn SGDClassifier

This is a multi-classifier module that implements stochastic gradient descent. The loss parameter controls which model is used to train and perform classification. For example, loss=hinge will give a linear SVM, and loss=log will give a logistic regression. When should you use it? When your training data set does not fit into memory. Note that SGD also allows mini-batch learning.

GraphViz is Integrated in scikit-learn Decision Trees

If you read all my blog post, you may have come across this one where I put together some code to train a binary decision tree to recognize a hand in poker. The code is available on my github space. If you read the code, you will see that I defined graph_decision_tree method with all the hula-loops to graph and save the images. But did you know that you don’t need to do all this work since sklearn.tree has export_graphviz module? If dsTree is an instance of DecisionTreeClassifier, then one can simply do:

from sklearn.tree import export_graphviz

export_graphviz(dsTree, out_file='dsTree.dot',
       feature_names=['feature1', 'feature2'])

The .dot file can be converted to a .png file (if you have installed GraphViz) like this:

dot -Tpng tree.dot -o tree.png

Zeppelin Notebook from Apache

If you are using Apache Spark you may be glad to learn that Apache has a notebook to go along with it. Zeppelin notebook offers similar functionality to Jupyter in terms of data visualization, paragraph writing and notebook sharing. I recommend that you to check it out.

Python gotchas

Here is the thing – I am a big fan of Python programming language. Now that there is an Intel distribution of Python, I don’t think I ever want to write in any other language again…

Having said that, Python has its moments. Most of the examples below are based on Fluent Python book by Luciano Romalho. I highly recommend it to all Python programmers.

Here are some “gotchas” I am taking about:

***********************************
*   Leaking Variables 
*   Times what? 
*   An Inside Job
*   Deeply Shallow
*   Out of Order
*   We are all Sharing  
***********************************

Leaking Variables

In Python 2.x variables created inside list comprehension are leaked, offering nasty surprise.

x = "I need you later"
ctoten = [-1, -2, -3, -4, -5, -6, -7, -8, -9, -10]
abs_ctoten = [abs(x) for x in ctoten]
print("Oh no! ", x) # prints x to be -10

Note that this problem does not exist in generator expressions (aka. genexps):

y = "abcde"
w = "see you later"
upper_y = array.array('c',(str.upper(w) for w in y))
print ("Still here: ", w) # prints w to be "see you later"

Times what?

Let’s say I need a string of 20 a’s. I can simply create it like this:

twenty_as = "a"*20

Great. I now need a list of three lists. I proceed to create it with * and end up with another surprise!

abc_list = [['a', 'b','c']]*3
print abc_list
abc_list[1][1]='x'
print abc_list  # prints ['a', 'x', 'c'], ['a', 'x', 'c'], ['a', 'x', 'c']]

This happens because the abc_list is made of references to the same [‘a’, ‘b’, ‘c’] list. The solution is to ensure that each list a separate/new copy:

abc_list = [['a', 'b','c'] for i in range(3)]

An Inside Job

Tuples are immutable and one can take an advantage of this when an immutability is required. However, if you put a mutable object inside a tuple, keep in mind that it can still be changed.

imm = (1,2)
imm[0]+=1 # will throw an exception
imm2 = (1, 2, [3, 4])
imm2[2]+=[10] # succeeds to modify the inner list and throws an exception

Deeply Shallow

You did not think I was going to write a post on Python’s dark corners without touching on deep copying, did you?
Here is a nice little trick for you to create a shallow copy with a slicing operator. It works the first time, but fails the second time when we need a deep copy instead.

list1 = [1,2,3]
list2 = list1[:] # shallow copy
list2[2] = 5

print ([(l, k) for l, k in zip(list1, list2)]) # all good

list1 = [1, 2, 3, [8,9]]
list2=list1[:]  # shallow copy again
list2[3][0] = 7

print ([(l, k) for l, k in zip(list1, list2)]) # shows that both are modified

Out of Order

Unless you are using collections.OrderedDict, the order of Python’s dicts’s keys and values cannot be relied on. This has to do which how Python’s dicts are stored in the memory. Also, dicts equality is determined on the basis of key-item pairs, and not their order in the dict. Take a look at the example below. The output of this code is implementation dependent. Finally, adding new items to dicts will likely to reorder the keys. Python’s sets also do not guarantee a particular order will be maintained. There is no “orderedset” in the standard library, but if you need one, you can find a PyPi package (e.g. orderedset).

FRUIT_CODES = [
    ("orange", 1),
    ("apple", 45),
    ("banana", 70),
    ("grapes", 81),
    ("pineapple", 86),
    ("kiwi", 52),
    ("papaya", 413),
    ("mango", 55),
    ("lemon", 62),
    ("nectarine", 910)
]

orig = copy.copy(FRUIT_CODES)
sorted1 = sorted(FRUIT_CODES, key=lambda x:x[0])
sorted2 = sorted(FRUIT_CODES, key=lambda x:x[1])

fruit_dict = dict(FRUIT_CODES)
fruit_sorted_dict1 = dict(sorted1)
fruit_sorted_dict2 = dict(sorted2)

print fruit_dict.keys() == fruit_sorted_dict1.keys() and fruit_sorted_dict1.keys() == fruit_sorted_dict2.keys() # prints False or True (implementation dependent)
print fruit_dict == fruit_sorted_dict1 and fruit_sorted_dict1 == fruit_sorted_dict2 # prints True

We are all Sharing

In Python, mutable types are passed to functions by sharing. This means that a function/method can modify the parameter, but it cannot replace it with another object. Here is a typical “gotcha” with functions being able to modify its parameters:

def plusone(my_list):
    my_list.append(1)  # can modify
    

def newlife(my_list, your_list):
    my_list=your_list  # cannot replace with a new object

first_list = [2, 3, 4]
plusone(first_list)
print first_list # prints [2, 3, 4, 1]

second_list = [5, 6, 7]
newlife(first_list, second_list)
print first_list # prints [2, 3, 4, 1]

This should give you enough “food for thought”. Happy programming everyone! 🙂

samplepy – a new Python Sampling Package

Hello my blog readers,

This post is to introduce a new Python package samplepy. This package was written to simplify sampling tasks that so often creep-up in machine learning. The package implements Importance, Rejection and Metropolis-Hastings sampling algorithms.

samplepy has a very simple API. The package can be installed with pip by simply running pip install samplepy. Once installed, you can use it to sample from any univariate distribution as following (showing rejection sampling use):

 

from samplepy import Rejection
import matplotlib.pyplot as plt
import numpy as np

# define a unimodal function to sample under
f = lambda x: 2.0*np.exp(-2.0*x)
# instantiate Rejection sampling with f and required interval
rej = Rejection(f, [0.01, 3.0])
# create a sample of 10K points
sample = rej.sample(10000, 1)  

# plot the original function and the created sample set
x = np.arange(0.01, 3.0, (3.0-0.01)/10000)
fx = f(x)

figure, axis = plt.subplots()
axis.hist(sample, normed=1, bins=40)
axis2 = axis.twinx()
axis2.plot(x, fx, 'g', label="f(x)=2.0*exp(-2*x)")
plt.legend(loc=1)
plt.show()

 

Sample from f(x)=2.0*exp(-2*x) over [0.01, 3.0]
Sample from f(x)=2.0*exp(-2*x) over [0.01, 3.0]

The three sampling method (i.e. Rejection, Importance and MH) are quite different and will achieve slightly different results for the same function. Performance is another important difference factor, with Metropolis-Hastings probably being the slowest. Let’s compare how the three sampling algorithm deliver on a bi-modal univariate function:

f(x)=exp(-x^{2})*(2+\sin(5x)+\sin(2x))

 

from samplepy import Rejection, Importance, MH
import matplotlib.pyplot as plt
import numpy as np


f = lambda x: np.exp(-1.0*x**2)*(2.0+np.sin(5.0*x)+np.sin(2.0*x))
interval = [-3.0, 3.0]
rej = Rejection(f, interval)  # instantiate Rejection sampling with f and interval
sample = rej.sample(10000, 1)    # create a sample of 10K points

x = np.arange(interval[0], interval[1], (interval[1]-interval[0])/10000)
fx = f(x)

figure, axis = plt.subplots()
axis.hist(sample, normed=1, bins=40)
axis2 = axis.twinx()
axis2.plot(x, fx, 'g', label="Rejection")
plt.legend(loc=1)
plt.show()

mh = MH(f,interval)
sample = mh.sample(20000, 100, 1)  # Make sure we have enough points in the sample!

figure, axis = plt.subplots()
axis.hist(sample, normed=1, bins=40)
axis2 = axis.twinx()
axis2.plot(x, fx, 'g', label="MH")
plt.legend(loc=1)
plt.show()

imp = Importance(f, interval)
sample = imp.sample(10000, 0.0001, 0.0010) # create a sample where essentially no extra importance is given to any quantile

figure, axis = plt.subplots()
axis.hist(sample, normed=1, bins=40)
axis2 = axis.twinx()
axis2.plot(x, fx, 'g', label="Importance")
plt.legend(loc=1)
plt.show()

rejection
mhimportance

Hopefully this gives you enough examples to get you started using samplepy!

A Tutorial on Hidden Markov Model with a Stock Price Example – Part 2

This is the 2nd part of the tutorial on Hidden Markov models. In this post we will look at a possible implementation of the described algorithms and estimate model performance on Yahoo stock price time-series.

Implementation of HMM in Python

I am providing an example implementation on my GitHub space. Please note that all code is provided with a disclaimer that you are free to use it at your own risk. HMM.py contains the main implementation. There are a few things to point out in this file:

  1. As is pointed out in the referenced document in part 1, \alpha and \beta require to be scaled for longer observations since the product of probabilities quickly tends to zero, resulting in underflow. The code has a flag for scaling, and defaults to positive.
  2. While testing out this model I noticed that initial assignment of \hat{a_{ij}} and \hat{b_{j}}(k) makes a big difference to the final solution. I am providing two possible assignments. One, as described in part 1, setting approximately to the same values, adding to 1. And two, using Dirichlet distribution to create non-uniform assignment that adds-up to 1. I’ve noticed that the former results in more meaningful model parameters.
  3. You will notice a small hack in HMMBaumWelch method where I am setting M to the maximum observation in O. This is needed because not all observation sequences will contain all values (i.e. 0, 1, 2). And the transition and emission matrices are accessed as if all the values exist.
  4. I am using yahoo_finance Python module to source the stock prices for Yahoo. It seems to work and is pretty easy to use. You should be able to download this module here.

HMM Model performance to predict Yahoo stock price move

On my github space, HMM_test.py contains a possible test example code. I am testing the model as following: train the model on a specified window of daily historical moves (e.g. 10 days) and using the model parameters determine the predicted current model state. Then, using the predicted state, determine the next likely state from \hat{a}:

prediction_state = np.argmax(a[path[-1],:])

The most likely emitted value from that state can be found as follows:

prediction = np.argmax(b[prediction_state,:])

I then compare the predicted move to the historical one. So, how did HMM do? Well, not so well. The model parameters \lambda=(A,B, \pi) are very sensitive to the convergence tolerance, initial assignment and, obviously the training time-window. Calculating accuracy ratio as the number of correctly predicted directions is pretty much around 50%-56% (tests on more recent data produce the higher accuracy in this range). Thus, we might as well be throwing a coin to make buy or sell predictions. It is possible that EOD price moves are not granular enough to provide a coherent market dynamic model.