Calling Python Functions from Excel – Overview of Addins

In this blog I will compare three utilities that allow connecting Python code with Excel, exposing Python functions as Excel functions, and opening a data exchange channel between Excel and Python. Why Excel you may ask? Excel is a great interface choice when you need:

1. A simple Windows interface for storing and manipulating data
2. An easy to put together Proof of Concept for your machine learning model where you let the end user work with a batch or a stochastic model output
3. A low maintenance interface for a permanent solution in an environment where server-run APIs are too costly to build and maintain
4. Linking legacy Excel-based analytics with Python and its many data and modelling libraries (pandas, numpy, scipy, scikit-learn, etc.)

TL;DR – see the Summary table with main features compared.

PyXLL

PyXLL is an established product that has been developed and maintained by Tony Roberts since 2010, and it is currently on its 5th major version.

Ease of Use

It is straightforward to download and install the PyXLL addin, with minimal changes to its main config file to tell it where Python is, as well as where the Python modules are.

Functions can be defined in Python as usual, with a few additional changes to make them PyXLL compatible by adding a decorator with explicit types in the signature.

PyXLL supports all standard Python data types, as well as pandas dataframes and numpy arrays. For the last two it does require you to figure out what these types are called in PyXLL, like ‘dataframe’ and ‘numpy_array’ which is a small hurdle and knowing these is required to properly define the decorators.

Portability

Can you integrate existing Python code with Excel using PyXLL? Yes, you can. Can you do it without having to change your code? No. To expose existing Python code you would need to import the pyxll library and decorate each function with @xl_func decorator. Additional specification is needed to indicate input and output data types for arrays and dataframes, which can appear a bit non-Pythonic (e.g. type hints to accept an array and return a 2-d array of str would need – var x: string[][]).

Object Caching

PyXLL supports object caching in Excel, which is handy for passing to Excel objects like classes or large pandas dataframes. Through its config it is possible to control how the recalculation of caches happens when Excel re-opens. Cached objects can also be serialised and saved as part of Excel metadata. I am not sure this is a hugely useful feature since very large objects can always be stored in in-memory databases like sqllite.

Support for Python Async and Real Time Data

PyXLL supports asynchronous functions and Real Time Data streaming. There is an example of an async RTD class being used on the PyXLL documentation site. Note that you need to import PyXLL’s RTD class and inherit from it to ‘switch on’ this functionality. The rest looks like a standard Excel RTD interface, i.e. the need to define connect() and disconnect() methods.

RTD is useful when developing stochastic machine learning models, and async can be useful when working within the reinforcement learning framework. Note that PyXLL also supports Excel async functions.

Logging and Debugging

While testing PyXLL I checked the contents of its logs and found it to be easy to read. PyXLL allows you to customize log formatting, verbosity and location via its config file, as well as the max size the logs can grow to. All of which seems straightforward and simple. I noticed that my own code logs and the addin logs were going to the same log file.

Support for jupyter notebooks and Plotting

Since 2020 PyXLL supports two cool features such as integrating Jupyter notebooks and Python-generated plots into Excel. Both features are useful if you are looking for a seamless merge between flexible coding environments like Jupyter and Excel. However, if your end user is not a Python developer, this will not add much value.

You need to download and install the version of the addin that matches the version of Python you have locally. And yes, you need to have a Python locally installed as PyXLL does not come with one. After downloading the addin, installation is quite simple with pip and the command line tool. Once installed you need to edit a config file to tell PyXLL where the Python executable is and where the modules to be exposed to Excel are. A lot of PyXLL configurations are done via a single config file.

Documentation and Support

PyXLL is a mature addin and there is a great deal of documentation and examples on its website. There is also a YouTube channel with a few interesting PyXLL usage videos (a chat bot example). There is 24/7 support via email for all users.

Licencing and Pricing

xlwings

xlwings has been around since 2014 and it is part of the whole ecosystem built around Excel, developed and maintained by two ex-investment bankers, Felix Zumstein and Björn Stiel. The main library is part of the Python Anaconda distribution, and can also be installed with pip. Advanced features of xlwings PRO are available on a paid subscription basis, if used for commercial purposes. This review will focus on using the main library to write Excel UDFs in Python.

Ease of Use

Since xlwings is part of the Anaconda distribution, I did not have to install additional software or libraries. Installing the addin is easy on the command line.

Once that was done, testing a few simple UDFs in Excel took a surprisingly long time. Firstly, configuring xlwings was not straightforward, and the documentation does not cover all possible pitfalls. In my case I am using a virtual environment, and xlwings could not find numpy or pandas after I set the interpreter path to the virtual env (as suggested in xlwings Troubleshooting section). Then, my anti-virus software would shut down Excel when I tried to register the UDF, which I resolved by manually telling it to ignore Excel activity. Finally, after working with a 3rd version of recovered Excel file, where the VBA reference to xlwings was lost (having been set at the start as helpfully mentioned in the installation steps), all tested UDF only returned ‘Object required’. This was resolved by re-adding the VBA reference. So, the start-up time has been considerably longer than with PyXLL or xlSlim. However, once I got it to work, it worked smoothly. xlwings supports all standard Python data types, as well as pandas dataframes, series and numpy arrays.

Portability

At the time of writing, xlwings requires at least Python 3.7. You do need to import xlwings library and add multiple decorators to existing methods to turn them into Excel UDFs. The decorators are not tricky in themselves, but they do impact the speed at which one can port an existing codebase. So, as with PyXLL, one cannot turn existing modules to Excel UDFs without any code changes.

Object Caching

Tested xlwings version 0.27.14 does not support object caching.

Support for Python Async and Real Time Data

xlwings provides support for offloading Python execution to an async thread, which is useful for long-running processes. However, there is no support for RTD in the tested version.

Logging and Debugging

xlwings default behaviour is to send standard output and error to a console. This means that as you interact with your UDFs in Excel, periodically, a console window pops-up to inform you about what is happening (e.g. xlwings server is running on an event loop, etc.). COM and other internal errors also appear in the console, while user code Python errors are shown in a pop-up message window.

Adding logging to Python code will send user code logs and xlwing logs to one log file.

Support for jupyter notebooks and Plotting

xlwings lets users interact with Excel from jupyter notebooks by providing view and load functionality. This is useful since it can speed up exploratory data analysis  and simplify code.

xlwings comes with the Anaconda distribution or can be installed via pip for Python 3.7+. Adding the addin to Excel is achieved by running an installation command. However, users still may run into configuration or security problems, as I did.  I found that xlwings github issues are a great source of information on how to resolve these.

Excel Workbook settings can be controlled either through xlwings.conf, directly through the Excel ribbon, or by adding a sheet with the same config name.

Documentation and Support

xlwings is mature and goes back to 2014. Over the years it has built up a loyal user base, and being an open-source library, has seen contributions from other developers. It comes with a great set of examples on its main website, fully documented API reference, a YouTube channel and even an O’Reilly published book where several chapters are dedicated to the addin.

Its professional version gets dedicated support as well as access to a video training course.

Licencing and Pricing

The library is free and open-sourced. Its PRO version which, among other things, provides Excel-embedded code and a custom installer, if used for commercial use, starts at \$590 per user per.

Summary

Disclaimer: I have helped to test some of xlSlim’s functionality and have made small contributions to its documentation and examples.

Don’t Get in a Pickle with a Python namedtuple

In this blog I will show you what happens when you want to pickle an object that contains a Python namedtuple.

Python’s namedtuple is high-performance data type that lets us define a custom type which behaves like a tuple. For example, the following piece of code defines a new type Viewer, creates an instance of it and initialises its attributes:

    from collections import namedtuple

Viewer = namedtuple('Viewer', 'gender age points')
viewer = Viewer('X', 25, 356)


In the above, line 3 defines a new type Viewer, and line 4 defines and initialises a new variable viewer of type Viewer. viewer behaves like a tuple in a sense that it has built-in methods count() and index() and allows access to attributes via indexing or named arguments. For example:

    print(viewer[2])         # prints 356
print(viewer.age)        # prints 25
print(viewer.count('X')) # prints 1


Note that unlike with a list or a dict, to work with namedtuples we need to perform two operations: (1) define the new type, (2) create a new instance of it. Also note that the same two steps are followed when we work with classes. And a namedtuple is just a dynamically named class type. But how exactly does this dynamic part works? It works because when we define a new type (line 3 in the first code snippet), we are actually calling a factory function namedtuple that does the dynamic ‘stuff’ for us (i.e. returns a sub-class of a tuple that is named as what we specify in the function call).

Let’s see what happens when we create a class with a namedtuple member.

import pickle
from collections import namedtuple
import datetime as dt

class ViewerClass(object):

# class-level type definition
vt = namedtuple(
'vt', 'start_date mon_views mon_streams name dob'
)

def __init__(
self, start_date, mon_views, mon_streams, name, dob
):
self._my_vt = ViewerClass.vt(
start_date, mon_views, mon_streams, name, dob
)

def get_start_date(self):
return self._my_vt.start_date

def get_monthly_views(self):
return self._my_vt.mon_views

def get_monthly_streams(self):
return self._my_vt.mon_streams

def get_registration_details(self):
return (
'Name:'
+ self._my_vt.name
+ ' DOB:'
+ str(self._my_vt.dob)
)

def update_monthly_stream(self, new_mon_streams):
self._my_vt.mon_streams = new_mon_streams

def update_monthly_views(self, new_mon_views):
self._my_vt.mon_views = new_mon_views

if __name__ == '__main__':

viewer1 = ViewerClass(
dt.date(2019, 1, 1),
5,
6234.80,
'John',
dt.date(1989, 12, 3),
)
print(
"Viewer {} has streamed for {} seconds this month.".format(
viewer1.get_registration_details(),
viewer1.get_monthly_streams(),
)
)

viewer2 = ViewerClass(
dt.date(2019, 2, 1),
5,
5234.80,
'Mary',
dt.date(1989, 11, 11),
)
print(
"Viewer {} has streamed for {} seconds this month.".format(
viewer2.get_registration_details(),
viewer2.get_monthly_streams(),
)
)

print(type(viewer1))
print(type(viewer1._my_vt))


The output of the print statements points to a potential problem that can occur if we try to pickle the viewer objects:

It turns out that the protected variable is of type ‘__main__.vt’ but not ‘__main__.ViewerClass.vt’. And if we try to pickle viewer1 we are going to get this error:

_pickle.PicklingError: Can’t pickle <class ‘__main__.vt’>: attribute lookup vt on __main__ failed

This error should make sense because vt is not defined within __main__, but is defined within __main__.ViewerClass, and thus is not visible to pickle as a subclass of a class.

There are several ways to fix this.

First, we can move the definition of vt outside of ViewerClass to the __main__. This will let pickle find vt at the level it is looking for it:

# module-level type definition
vt = namedtuple(
'vt', 'start_date mon_views mon_streams name dob'
)

class ViewerClass(object):
def __init__(
self, start_date, mon_views, mon_streams, name, dob
):
self._my_vt = vt(
start_date, mon_views, mon_streams, name, dob
)

...


Second solution involves changing a built-in private variable __qual_name__ to that of the class name:

import pickle
from collections import namedtuple
import datetime as dt

class ViewerClass(object):

# class-level definition
vt = namedtuple(
'vt', 'start_date mon_views mon_streams name dob'
)
vt.__qualname__ = 'ViewerClass.vt'

def __init__(
self, start_date, mon_views, mon_streams, name, dob
):
self._my_vt = ViewerClass.vt(
start_date, mon_views, mon_streams, name, dob
)

...


This fixes the issue and makes viewer1._my_vt of type ‘__main__.ViewerClass.vt’, under which pickle can look it up.

I must say that I prefer the first solution, since sub-classing from the ViewerClass may prove to be problematic, and we should avoid modifying private variables.

Approximate Bayesian Computation

This is my first post in 2018. In this post I will share with you a very simple way of performing inference using Approximate Bayesian Computation (ABC) – not to be confused with Approximate Bootstrap Confidence interval, which is also “ABC”.

Let’s say we have observed some data, and are interested to test if there was a change in behaviour in whatever generated the data. For example, we could be monitoring the total amount that is spent/transferred from some account, and we would like to see if there was a shift in how much is being spent/transferred. Figure below shows what the data could look like. After we have eye-balled the graph, we think that all observations after item 43 belong to the changed behaviour (cutoff=43), and we separate the two by colour.

The first question that we can ask is about the means of the blue and the red regions: are they the same? In the figure above I am showing the mean and standard deviations for the two sets. We can run a basic bootstrap with replacement to check if the difference in the means is possibly accidental.

In the figure above basic_bootstrap generates a distribution of means of randomly sampled sets. The confidence interval is first computed as non-parametric. But a quick comparison with 95th CI using normal standard scores shows that the simulated and the non-simulated confidence intervals around the means are very close. Most importantly, the confidence intervals for the blue and red region means overlap, and thus we would have to accept the null hypothesis that the population means are the same and differences seen here are accidental.

Note how unsatisfying this result is. If we use some other test, like one-way ANOVA from scipy.stats.f_oneway, we get a p-value that is too high to accept an alternative hypothesis. However, if we plot the CDFs of the blue and the red data, we can clearly see that larger values are prevailing in the latter:

Approximate Bayesian Computation

Approximate Bayesian Computation (ABC) relates to probabilistic programming methods and allows us to quantify uncertainty more exactly than a simple CI. A pretty good summary of ABC can be found on Wikipedia. If we are monitoring transactions occurring over time, we may be interested in generating alerts when an amount is above a threshold (for example, your bank could have a monitoring system in place to safeguard you against credit card fraud). If, instead of comparing means of red and blue region, we decided to answer the question about how likely are we to see more trades above the threshold in the red vs. the blue data regions, we could use ABC.

To execute an ABC test on the difference in the number of trades above a threshold in the blue and red data regions, we begin by choosing the threshold! Take a look at the CDF plot above. We see that approximately half of red data is above 20. Whereas only 25% of blue data is above 20. Let’s set our threshold at 20. The ABC is a simple simulation algorithm where we repeatedly perform sample and compare steps. What can we sample here? We will sample from two normal distributions, each with the means set to the fraction of trades above our threshold. I will use Normal distribution, but it is purely a choice of convenience. Ok, what can we compare here? We will compare the number of trades that could have been above the threshold when the data they come from is sampled from the distributions we have chosen as our priors. And we store away the ones that are consistent with it. If we repeat this many times under the two parameterisations, we should build-up two distributions that can be used to answer the main question – how likely are we to obtain more trades above the chosen threshold in the red vs. the blue data sets. The code below does exactly that.

We obtain a very high probability of seeing more trades above the threshold in the red vs. the blue region.

Python gotchas

Here is the thing – I am a big fan of Python programming language. Now that there is an Intel distribution of Python, I don’t think I ever want to write in any other language again…

Having said that, Python has its moments. Most of the examples below are based on Fluent Python book by Luciano Romalho. I highly recommend it to all Python programmers.

Here are some “gotchas” I am taking about:

***********************************
*   Leaking Variables
*   Times what?
*   An Inside Job
*   Deeply Shallow
*   Out of Order
*   We are all Sharing
***********************************

Leaking Variables

In Python 2.x variables created inside list comprehension are leaked, offering nasty surprise.

x = "I need you later"
ctoten = [-1, -2, -3, -4, -5, -6, -7, -8, -9, -10]
abs_ctoten = [abs(x) for x in ctoten]
print("Oh no! ", x) # prints x to be -10


Note that this problem does not exist in generator expressions (aka. genexps):

y = "abcde"
w = "see you later"
upper_y = array.array('c',(str.upper(w) for w in y))
print ("Still here: ", w) # prints w to be "see you later"



Times what?

Let’s say I need a string of 20 a’s. I can simply create it like this:

twenty_as = "a"*20


Great. I now need a list of three lists. I proceed to create it with * and end up with another surprise!

abc_list = [['a', 'b','c']]*3
print abc_list
abc_list[1][1]='x'
print abc_list  # prints ['a', 'x', 'c'], ['a', 'x', 'c'], ['a', 'x', 'c']]



This happens because the abc_list is made of references to the same [‘a’, ‘b’, ‘c’] list. The solution is to ensure that each list a separate/new copy:

abc_list = [['a', 'b','c'] for i in range(3)]


An Inside Job

Tuples are immutable and one can take an advantage of this when an immutability is required. However, if you put a mutable object inside a tuple, keep in mind that it can still be changed.

imm = (1,2)
imm[0]+=1 # will throw an exception
imm2 = (1, 2, [3, 4])
imm2[2]+=[10] # succeeds to modify the inner list and throws an exception


Deeply Shallow

You did not think I was going to write a post on Python’s dark corners without touching on deep copying, did you?
Here is a nice little trick for you to create a shallow copy with a slicing operator. It works the first time, but fails the second time when we need a deep copy instead.

list1 = [1,2,3]
list2 = list1[:] # shallow copy
list2[2] = 5

print ([(l, k) for l, k in zip(list1, list2)]) # all good

list1 = [1, 2, 3, [8,9]]
list2=list1[:]  # shallow copy again
list2[3][0] = 7

print ([(l, k) for l, k in zip(list1, list2)]) # shows that both are modified


Out of Order

Unless you are using collections.OrderedDict, the order of Python’s dicts’s keys and values cannot be relied on. This has to do which how Python’s dicts are stored in the memory. Also, dicts equality is determined on the basis of key-item pairs, and not their order in the dict. Take a look at the example below. The output of this code is implementation dependent. Finally, adding new items to dicts will likely to reorder the keys. Python’s sets also do not guarantee a particular order will be maintained. There is no “orderedset” in the standard library, but if you need one, you can find a PyPi package (e.g. orderedset).

FRUIT_CODES = [
("orange", 1),
("apple", 45),
("banana", 70),
("grapes", 81),
("pineapple", 86),
("kiwi", 52),
("papaya", 413),
("mango", 55),
("lemon", 62),
("nectarine", 910)
]

orig = copy.copy(FRUIT_CODES)
sorted1 = sorted(FRUIT_CODES, key=lambda x:x[0])
sorted2 = sorted(FRUIT_CODES, key=lambda x:x[1])

fruit_dict = dict(FRUIT_CODES)
fruit_sorted_dict1 = dict(sorted1)
fruit_sorted_dict2 = dict(sorted2)

print fruit_dict.keys() == fruit_sorted_dict1.keys() and fruit_sorted_dict1.keys() == fruit_sorted_dict2.keys() # prints False or True (implementation dependent)
print fruit_dict == fruit_sorted_dict1 and fruit_sorted_dict1 == fruit_sorted_dict2 # prints True



We are all Sharing

In Python, mutable types are passed to functions by sharing. This means that a function/method can modify the parameter, but it cannot replace it with another object. Here is a typical “gotcha” with functions being able to modify its parameters:

def plusone(my_list):
my_list.append(1)  # can modify

def newlife(my_list, your_list):
my_list=your_list  # cannot replace with a new object

first_list = [2, 3, 4]
plusone(first_list)
print first_list # prints [2, 3, 4, 1]

second_list = [5, 6, 7]
newlife(first_list, second_list)
print first_list # prints [2, 3, 4, 1]


This should give you enough “food for thought”. Happy programming everyone! 🙂

samplepy – a new Python Sampling Package

This post is to introduce a new Python package samplepy. This package was written to simplify sampling tasks that so often creep-up in machine learning. The package implements Importance, Rejection and Metropolis-Hastings sampling algorithms.

samplepy has a very simple API. The package can be installed with pip by simply running pip install samplepy. Once installed, you can use it to sample from any univariate distribution as following (showing rejection sampling use):

from samplepy import Rejection
import matplotlib.pyplot as plt
import numpy as np

# define a unimodal function to sample under
f = lambda x: 2.0*np.exp(-2.0*x)
# instantiate Rejection sampling with f and required interval
rej = Rejection(f, [0.01, 3.0])
# create a sample of 10K points
sample = rej.sample(10000, 1)

# plot the original function and the created sample set
x = np.arange(0.01, 3.0, (3.0-0.01)/10000)
fx = f(x)

figure, axis = plt.subplots()
axis.hist(sample, normed=1, bins=40)
axis2 = axis.twinx()
axis2.plot(x, fx, 'g', label="f(x)=2.0*exp(-2*x)")
plt.legend(loc=1)
plt.show()


The three sampling method (i.e. Rejection, Importance and MH) are quite different and will achieve slightly different results for the same function. Performance is another important difference factor, with Metropolis-Hastings probably being the slowest. Let’s compare how the three sampling algorithm deliver on a bi-modal univariate function:

$f(x)=exp(-x^{2})*(2+\sin(5x)+\sin(2x))$

from samplepy import Rejection, Importance, MH
import matplotlib.pyplot as plt
import numpy as np

f = lambda x: np.exp(-1.0*x**2)*(2.0+np.sin(5.0*x)+np.sin(2.0*x))
interval = [-3.0, 3.0]
rej = Rejection(f, interval)  # instantiate Rejection sampling with f and interval
sample = rej.sample(10000, 1)    # create a sample of 10K points

x = np.arange(interval[0], interval[1], (interval[1]-interval[0])/10000)
fx = f(x)

figure, axis = plt.subplots()
axis.hist(sample, normed=1, bins=40)
axis2 = axis.twinx()
axis2.plot(x, fx, 'g', label="Rejection")
plt.legend(loc=1)
plt.show()

mh = MH(f,interval)
sample = mh.sample(20000, 100, 1)  # Make sure we have enough points in the sample!

figure, axis = plt.subplots()
axis.hist(sample, normed=1, bins=40)
axis2 = axis.twinx()
axis2.plot(x, fx, 'g', label="MH")
plt.legend(loc=1)
plt.show()

imp = Importance(f, interval)
sample = imp.sample(10000, 0.0001, 0.0010) # create a sample where essentially no extra importance is given to any quantile

figure, axis = plt.subplots()
axis.hist(sample, normed=1, bins=40)
axis2 = axis.twinx()
axis2.plot(x, fx, 'g', label="Importance")
plt.legend(loc=1)
plt.show()



Hopefully this gives you enough examples to get you started using samplepy!