# Calling Python Functions from Excel – Overview of Addins

In this blog I will compare three utilities that allow connecting Python code with Excel, exposing Python functions as Excel functions, and opening a data exchange channel between Excel and Python. Why Excel you may ask? Excel is a great interface choice when you need:

1. A simple Windows interface for storing and manipulating data
2. An easy to put together Proof of Concept for your machine learning model where you let the end user work with a batch or a stochastic model output
3. A low maintenance interface for a permanent solution in an environment where server-run APIs are too costly to build and maintain
4. Linking legacy Excel-based analytics with Python and its many data and modelling libraries (pandas, numpy, scipy, scikit-learn, etc.)

TL;DR – see the Summary table with main features compared.

# PyXLL

PyXLL is an established product that has been developed and maintained by Tony Roberts since 2010, and it is currently on its 5th major version.

Ease of Use

It is straightforward to download and install the PyXLL addin, with minimal changes to its main config file to tell it where Python is, as well as where the Python modules are.

Functions can be defined in Python as usual, with a few additional changes to make them PyXLL compatible by adding a decorator with explicit types in the signature.

PyXLL supports all standard Python data types, as well as pandas dataframes and numpy arrays. For the last two it does require you to figure out what these types are called in PyXLL, like ‘dataframe’ and ‘numpy_array’ which is a small hurdle and knowing these is required to properly define the decorators.

Portability

Can you integrate existing Python code with Excel using PyXLL? Yes, you can. Can you do it without having to change your code? No. To expose existing Python code you would need to import the pyxll library and decorate each function with @xl_func decorator. Additional specification is needed to indicate input and output data types for arrays and dataframes, which can appear a bit non-Pythonic (e.g. type hints to accept an array and return a 2-d array of str would need – var x: string[][]).

Object Caching

PyXLL supports object caching in Excel, which is handy for passing to Excel objects like classes or large pandas dataframes. Through its config it is possible to control how the recalculation of caches happens when Excel re-opens. Cached objects can also be serialised and saved as part of Excel metadata. I am not sure this is a hugely useful feature since very large objects can always be stored in in-memory databases like sqllite.

Support for Python Async and Real Time Data

PyXLL supports asynchronous functions and Real Time Data streaming. There is an example of an async RTD class being used on the PyXLL documentation site. Note that you need to import PyXLL’s RTD class and inherit from it to ‘switch on’ this functionality. The rest looks like a standard Excel RTD interface, i.e. the need to define connect() and disconnect() methods.

RTD is useful when developing stochastic machine learning models, and async can be useful when working within the reinforcement learning framework. Note that PyXLL also supports Excel async functions.

Logging and Debugging

While testing PyXLL I checked the contents of its logs and found it to be easy to read. PyXLL allows you to customize log formatting, verbosity and location via its config file, as well as the max size the logs can grow to. All of which seems straightforward and simple. I noticed that my own code logs and the addin logs were going to the same log file.

Support for jupyter notebooks and Plotting

Since 2020 PyXLL supports two cool features such as integrating Jupyter notebooks and Python-generated plots into Excel. Both features are useful if you are looking for a seamless merge between flexible coding environments like Jupyter and Excel. However, if your end user is not a Python developer, this will not add much value.

You need to download and install the version of the addin that matches the version of Python you have locally. And yes, you need to have a Python locally installed as PyXLL does not come with one. After downloading the addin, installation is quite simple with pip and the command line tool. Once installed you need to edit a config file to tell PyXLL where the Python executable is and where the modules to be exposed to Excel are. A lot of PyXLL configurations are done via a single config file.

Documentation and Support

PyXLL is a mature addin and there is a great deal of documentation and examples on its website. There is also a YouTube channel with a few interesting PyXLL usage videos (a chat bot example). There is 24/7 support via email for all users.

Licencing and Pricing

# xlwings

xlwings has been around since 2014 and it is part of the whole ecosystem built around Excel, developed and maintained by two ex-investment bankers, Felix Zumstein and Björn Stiel. The main library is part of the Python Anaconda distribution, and can also be installed with pip. Advanced features of xlwings PRO are available on a paid subscription basis, if used for commercial purposes. This review will focus on using the main library to write Excel UDFs in Python.

Ease of Use

Since xlwings is part of the Anaconda distribution, I did not have to install additional software or libraries. Installing the addin is easy on the command line.

Once that was done, testing a few simple UDFs in Excel took a surprisingly long time. Firstly, configuring xlwings was not straightforward, and the documentation does not cover all possible pitfalls. In my case I am using a virtual environment, and xlwings could not find numpy or pandas after I set the interpreter path to the virtual env (as suggested in xlwings Troubleshooting section). Then, my anti-virus software would shut down Excel when I tried to register the UDF, which I resolved by manually telling it to ignore Excel activity. Finally, after working with a 3rd version of recovered Excel file, where the VBA reference to xlwings was lost (having been set at the start as helpfully mentioned in the installation steps), all tested UDF only returned ‘Object required’. This was resolved by re-adding the VBA reference. So, the start-up time has been considerably longer than with PyXLL or xlSlim. However, once I got it to work, it worked smoothly. xlwings supports all standard Python data types, as well as pandas dataframes, series and numpy arrays.

Portability

At the time of writing, xlwings requires at least Python 3.7. You do need to import xlwings library and add multiple decorators to existing methods to turn them into Excel UDFs. The decorators are not tricky in themselves, but they do impact the speed at which one can port an existing codebase. So, as with PyXLL, one cannot turn existing modules to Excel UDFs without any code changes.

Object Caching

Tested xlwings version 0.27.14 does not support object caching.

Support for Python Async and Real Time Data

xlwings provides support for offloading Python execution to an async thread, which is useful for long-running processes. However, there is no support for RTD in the tested version.

Logging and Debugging

xlwings default behaviour is to send standard output and error to a console. This means that as you interact with your UDFs in Excel, periodically, a console window pops-up to inform you about what is happening (e.g. xlwings server is running on an event loop, etc.). COM and other internal errors also appear in the console, while user code Python errors are shown in a pop-up message window.

Adding logging to Python code will send user code logs and xlwing logs to one log file.

Support for jupyter notebooks and Plotting

xlwings lets users interact with Excel from jupyter notebooks by providing view and load functionality. This is useful since it can speed up exploratory data analysis  and simplify code.

xlwings comes with the Anaconda distribution or can be installed via pip for Python 3.7+. Adding the addin to Excel is achieved by running an installation command. However, users still may run into configuration or security problems, as I did.  I found that xlwings github issues are a great source of information on how to resolve these.

Excel Workbook settings can be controlled either through xlwings.conf, directly through the Excel ribbon, or by adding a sheet with the same config name.

Documentation and Support

xlwings is mature and goes back to 2014. Over the years it has built up a loyal user base, and being an open-source library, has seen contributions from other developers. It comes with a great set of examples on its main website, fully documented API reference, a YouTube channel and even an O’Reilly published book where several chapters are dedicated to the addin.

Its professional version gets dedicated support as well as access to a video training course.

Licencing and Pricing

The library is free and open-sourced. Its PRO version which, among other things, provides Excel-embedded code and a custom installer, if used for commercial use, starts at \$590 per user per.

# Summary

Disclaimer: I have helped to test some of xlSlim’s functionality and have made small contributions to its documentation and examples.

# Introduction to Correspondence Analysis

In this blog I will introduce the Correspondence Analysis – a visualisation technique for categorical data. All the code has been compiled in my github repository.

Correspondence Analysis (CA) has been around for a very long time. It was first developed in the 1930-ies, and made popular by M. Greenacre in the 1980-ies. It is an established statistical analysis techniques with dedicated annual symposiums and sufficient amount of literature covering theory and applications. Inspite of its popularity, I have only recently discovered it, and thought that it is worthwhile to document the fundamentals on my blog.

### What Exactly is Correspondence Analysis?

CA is a visualisation technique that can be applied to categorical data for data exploration. Unlike numerical data, categorical features are harder to analyse and visualise. CA uses a matrix decomposition method, namely SVD, and thus you may see CA being likened to the Principle Components Analysis (PCA). However, CA is not, strictly speaking, a PCA for categorical data, mostly because the primary objective of CA is to provide a visualisation of associations among categorical features.

How does one visualise categorical data? CA is based on a simple concept of a contingency table. A contingency table is a tabulation of frequencies of how categorical values are distributed by variables. This blog will be using examples from P. Yelland’s article on CA published in the Mathematica journal[1]. I will translate his Mathematica code to Python (because Python is awesome). In [1] we find CA applied to textual analysis where passages of a few authors analysed by the frequency of letters. The five authors and the letters are shown below:

authors = ["Charles Darwin", "Rene Descartes","Thomas Hobbes", "Mary Shelley", "Mark Twain"]
initials=['CD1','CD2','CD3','RD1','RD2','RD3','TB1','TB2','TB3','MS1','MS2','MS3','MT1','MT2','MT3']
chars=["B", "C", "D", "F", "G", "H", "I", "L", "M", "N","P", "R", "S", "U", "W", "Y"]


The contingency table build from how often these letters appear in three passages per author are:

sampleCrosstab=[[34, 37, 44, 27, 19, 39, 74, 44, 27, 61, 12, 65, 69,22, 14, 21],
[18, 33, 47, 24, 14, 38, 66, 41, 36,72, 15, 62, 63, 31, 12, 18],
[32, 43, 36, 12, 21, 51, 75, 33, 23, 60, 24, 68, 85,18, 13, 14],
[13, 31, 55, 29, 15, 62, 74, 43, 28,73, 8, 59, 54, 32, 19, 20],
[8, 28, 34, 24, 17, 68, 75, 34, 25, 70, 16, 56, 72,31, 14, 11],
[9, 34, 43, 25, 18, 68, 84, 25, 32, 76,14, 69, 64, 27, 11, 18],
[15, 20, 28, 18, 19, 65, 82, 34, 29, 89, 11, 47, 74,18, 22, 17],
[18, 14, 40, 25, 21, 60, 70, 15, 37,80, 15, 65, 68, 21, 25, 9],
[19, 18, 41, 26, 19, 58, 64, 18, 38, 78, 15, 65, 72,20, 20, 11],
[13, 29, 49, 31, 16, 61, 73, 36, 29,69, 13, 63, 58, 18, 20, 25],
[17, 34, 43, 29, 14, 62, 64, 26, 26, 71, 26, 78, 64, 21, 18, 12],
[13, 22, 43, 16, 11, 70, 68, 46, 35,57, 30, 71, 57, 19, 22, 20],
[16, 18, 56, 13, 27, 67, 61, 43, 20, 63, 14, 43, 67,34, 41, 23],
[15, 21, 66, 21, 19, 50, 62, 50, 24, 68, 14, 40, 58, 31, 36, 26],
[19, 17, 70, 12, 28, 53, 72, 39, 22, 71, 11, 40, 67,25, 41, 17]]


Can you spot any differences in the use of letters by author from sampleCrosstab? It is almost impossible to do so by just looking at it. Instead, CA resorts to the $\chi^2$ statistic.

### Chi-Squared Statistic and Chi-Squared Distances

Pearson’s $\chi^2$ test of independence can be used to say with reasonable certainty if the distribution of letters differs from one author to another. $\chi^2$ is defined as:

$\chi^2 = \sum_{I}\sum_{J}\frac{(n_{ij}-(\frac{n_{i.}n_{.j}}{n}))^2}{\frac{n_{i.}n_{.j}}{n}}$ (1)

Where $n$ is the total number of frequencies, $n_{ij}$ is the letter frequency in row $i$ and column $j$, and $n_{i.}$ and $n_{.j}$ are the total frequencies in row $i$ and column $j$ respectively. The product of $n_{i.}$ and $n_{.j}$ normalised by $n$ is the expected frequency for $n_{ij}$ under the independence assumption. Let’s call it independenceModel. The greater is $\chi^2$, the greater is the certainty that the use of these letters is different by author. We can calculate this statistic in Python as following:

grandTotal = np.sum(sampleCrosstab)
correspondenceMatrix = np.divide(sampleCrosstab,grandTotal)
rowTotals = np.sum(correspondenceMatrix, axis=1)
columnTotals = np.sum(correspondenceMatrix, axis=0)

independenceModel = np.outer(rowTotals, columnTotals)

#Calculate manually
chiSquaredStatistic = grandTotal*np.sum(np.square(correspondenceMatrix-independenceModel)/independenceModel)
print(chiSquaredStatistic)

# Quick check - compare to scipy Chi-Squared test
statistic, prob, dof, ex = chi2_contingency(sampleCrosstab)
print(statistic)
print(np.round(prob, decimals=2))



In the above code correspondenceMatrix holds normalised frequencies. The $\chi^2$ statistic is 448.50, which is very unlikely to be observed under the null hypothesis (that the letter frequencies follow the same distribution). Having established this, we can continue with the CA as we now know that it should be able to show us some meaningful associations.

For the purposes of CA, the differences between the distributions of letters in the text samples are measured by $\chi^2$-distances, which are weighted Euclidean distances between normalized rows. These are calculated by dividing row entries by their respective row totals. The weights are inversely proportional to the square roots of the column totals. $\chi^2$-distances between row i and row k are defined as:

$\chi^2_{distance_{ik}} = \sqrt{\sum_{J}\frac{(p_{ij}/p_{i.} - p_{kj}/p_{k.})^2}{p_{.j}}}$ (2)

# pre-calculate normalised rows
norm_correspondenceMatrix = np.divide(correspondenceMatrix,rowTotals[:, None])

chiSquaredDistances = np.zeros((correspondenceMatrix.shape[0],correspondenceMatrix.shape[0]))

norm_columnTotals = np.sum(norm_correspondenceMatrix, axis=0)
for row in range(correspondenceMatrix.shape[0]):
chiSquaredDistances[row]=np.sqrt(np.sum(np.square(norm_correspondenceMatrix
-norm_correspondenceMatrix[row])/columnTotals, axis=1))
# Save distances to the DataFrame
dfchiSquaredDistances = pd.DataFrame(data=np.round(chiSquaredDistances*100).astype(int), columns=authorSamples)

print(dfchiSquaredDistances)



In (2) I switched to notation with $p_{ij}$, which is simply every entry in correspondenceMatrix (i.e. letter frequencies normalised by the grand total). dfchiSquaredDistances contains:

### Chi-Squared Distances In Graphical Form

CA provides a means of representing a table of $\chi^2$-distances in a graphical form. This is where the similarity with the PCA analysis comes in. To calculate such representation we need to transform the distances to points in a Cartesian coordinate system. This is achieved by a singular value decomposition (SVD) of a matrix of standardised residuals:

$\Omega = \frac{p_{ij}-\mu_{ij}}{\sqrt{\mu_{ij}}}$ (3)

standardizedResiduals = np.divide((correspondenceMatrix-independenceModel),np.sqrt(independenceModel))

u,s,vh = np.linalg.svd(standardizedResiduals, full_matrices=False)



We are after the row scores, which are coordinates of points in a high-dimensional space (14 dimensions in this case). These points are arranged so that the Euclidean distance between two points is equal to the $\chi^2$-distance between the two rows to which they correspond. The row scores are defined as:

$R = \delta_{r}\cdot U\cdot S$ (4)

where $U$ and $S$ are the left singular vectors matrix and singular values on the diagonal matrix from SVD. The $\delta_{r}$ is diagonal matrix made of the reciprocals of the square roots of the row totals.

deltaR = np.diag(np.divide(1.0,np.sqrt(rowTotals)))

rowScores=np.dot(np.dot(deltaR,u),np.diag(s))

dfFirstTwoComponents = pd.DataFrame(data=[l[0:2] for l in rowScores], columns=['X', 'Y'], index=initials)

print(dfFirstTwoComponents)



Extracting the first two components gives us:

Plotting these as points:

The plot clearly shows letters associations by author. Mark Twain and Charles Darwin’s samples stand out as significantly different from the rest.

Source and Reference: [1] P.Yelland, An Introduction to Correspondence Analysis. The Mathematica Journal 12, 2010 Wolfram Media, Inc.

# Hidden Technical Debt of Machine Learning – Play Now Pay Later

Last week I was lucky enough to attend the Strata Conference London 2017 for one day. The venue and the event are impressive in scale, participants and content. The quality of tutorials and talks, in general, was very good, and I have walked away with a few new ideas I wanted to share on my blog.

One of the most important lessons from the conference for me was from a reference to the NIPS’16 paper titled Hidden Technical Debt in Machine Learning System, written by Google researchers. The paper is about the long-term maintenance costs introduced by building machine learning (ML) models and systems. The argument is that such cost is hidden as it is not immediately apparent from the point of putting an ML model in production. For data scientists it is important to be aware of the complexity of the models they develop and what impact these models will have on their organisation and how much it will cost to maintain them.

According to the authors, there are three levels of technical complexity which contribute to technical debt in ML: the model itself can be complex and behave non-linearly to a given set of parameters, the model can be taking input from otherwise disparate systems, and the model’s output or its behavior can be complex and difficult to predict before it is released.

### ML Model Complexity

ML models entangle input signals from different systems together, making it difficult to avoid the CACE principle: Change Anything Change Everything. This principle applies to all aspects of ML, from parameters (think xgboost!), to input data to convergence thresholds and sampling methods. Isolation and servicing of modelling components is one of the proposed solutions.

### The Cost  of Data Dependencies

Large ML systems have large and complex data dependencies, where data quality and any data assumptions can significantly affect the ML system output. ML system input data can be unstable, meaning it changes qualitatively and quantitatively over time. In some cases, the degree of dependency on one set of data vs. another may change. The ML systems are unique because usually their data dependencies are finer (e.g. the input should not just be an integer, but an integer in a certain range). A lot of thinking and possibly investment should go into understanding such dependencies and controlling them. Check-out kensu.io – a start-up company I have come across at the conference, the creators of Adalog – a product designed purely for such task.

### The Feedback Loop and Dealing with Changes

Live ML systems learn in real time and influence their own behavior. Sometimes it is necessary to choose static parameters, like prediction thresholds, for a model that is trained or parameterised on  data that is dynamic in nature. Thus leading to the previous set of thresholds being no longer valid on updated data. The authors highlight that comprehensive monitoring of ML system behavior is critical for long-term system reliability.

In summary, maintainable ML systems are costly and require an even higher level of technical competence and foresight among its developers.  ML models testing, validation and monitoring should be considered as an absolute must in organisations that are eager to rip their full benefits.

# (Jan-17) Did You Know That?

A brand new idea for my blog in 2017 is a monthly Did You Know That digest where I am going to share with you m things (where m<=3) that I recently learnt and found to be useful. I am going to keep such digests short and simple, as not to overwhelm you with verbiage and unnecessary details. This month’s top 3 Did you know that? are:

• scikit-learn SGDClassifier – one learner, many tricks up its sleeve;
• GraphViz is integrated in scikit-learn – no need no import it separately!
• Zeppelin notebook from Apache – worth a look if you are into Python notebooks;

### scikit-learn SGDClassifier

This is a multi-classifier module that implements stochastic gradient descent. The loss parameter controls which model is used to train and perform classification. For example, loss=hinge will give a linear SVM, and loss=log will give a logistic regression. When should you use it? When your training data set does not fit into memory. Note that SGD also allows mini-batch learning.

### GraphViz is Integrated in scikit-learn Decision Trees

If you read all my blog post, you may have come across this one where I put together some code to train a binary decision tree to recognize a hand in poker. The code is available on my github space. If you read the code, you will see that I defined graph_decision_tree method with all the hula-loops to graph and save the images. But did you know that you don’t need to do all this work since sklearn.tree has export_graphviz module? If dsTree is an instance of DecisionTreeClassifier, then one can simply do:

from sklearn.tree import export_graphviz

export_graphviz(dsTree, out_file='dsTree.dot',
feature_names=['feature1', 'feature2'])


The .dot file can be converted to a .png file (if you have installed GraphViz) like this:

dot -Tpng tree.dot -o tree.png

### Zeppelin Notebook from Apache

If you are using Apache Spark you may be glad to learn that Apache has a notebook to go along with it. Zeppelin notebook offers similar functionality to Jupyter in terms of data visualization, paragraph writing and notebook sharing. I recommend that you to check it out.

# Data Analytics Models in Quantitative Finance and Risk Management

Sharing a KDNuggets article on some examples of how PCA, Monte Carlo and linear regression are used in quantitative finance and risk management: