## Introduction to Correspondence Analysis

In this blog I will introduce the Correspondence Analysis – a visualisation technique for categorical data. All the code has been compiled in my github repository.

Correspondence Analysis (CA) has been around for a very long time. It was first developed in the 1930-ies, and made popular by M. Greenacre in the 1980-ies. It is an established statistical analysis techniques with dedicated annual symposiums and sufficient amount of literature covering theory and applications. Inspite of its popularity, I have only recently discovered it, and thought that it is worthwhile to document the fundamentals on my blog.

### What Exactly is Correspondence Analysis?

CA is a visualisation technique that can be applied to categorical data for data exploration. Unlike numerical data, categorical features are harder to analyse and visualise. CA uses a matrix decomposition method, namely SVD, and thus you may see CA being likened to the Principle Components Analysis (PCA). However, CA is not, strictly speaking, a PCA for categorical data, mostly because the primary objective of CA is to provide a visualisation of associations among categorical features.

How does one visualise categorical data? CA is based on a simple concept of a contingency table. A contingency table is a tabulation of frequencies of how categorical values are distributed by variables. This blog will be using examples from P. Yelland’s article on CA published in the Mathematica journal[1]. I will translate his Mathematica code to Python (because Python is awesome). In [1] we find CA applied to textual analysis where passages of a few authors analysed by the frequency of letters. The five authors and the letters are shown below:

authors = ["Charles Darwin", "Rene Descartes","Thomas Hobbes", "Mary Shelley", "Mark Twain"]
initials=['CD1','CD2','CD3','RD1','RD2','RD3','TB1','TB2','TB3','MS1','MS2','MS3','MT1','MT2','MT3']
chars=["B", "C", "D", "F", "G", "H", "I", "L", "M", "N","P", "R", "S", "U", "W", "Y"]


The contingency table build from how often these letters appear in three passages per author are:

sampleCrosstab=[[34, 37, 44, 27, 19, 39, 74, 44, 27, 61, 12, 65, 69,22, 14, 21],
[18, 33, 47, 24, 14, 38, 66, 41, 36,72, 15, 62, 63, 31, 12, 18],
[32, 43, 36, 12, 21, 51, 75, 33, 23, 60, 24, 68, 85,18, 13, 14],
[13, 31, 55, 29, 15, 62, 74, 43, 28,73, 8, 59, 54, 32, 19, 20],
[8, 28, 34, 24, 17, 68, 75, 34, 25, 70, 16, 56, 72,31, 14, 11],
[9, 34, 43, 25, 18, 68, 84, 25, 32, 76,14, 69, 64, 27, 11, 18],
[15, 20, 28, 18, 19, 65, 82, 34, 29, 89, 11, 47, 74,18, 22, 17],
[18, 14, 40, 25, 21, 60, 70, 15, 37,80, 15, 65, 68, 21, 25, 9],
[19, 18, 41, 26, 19, 58, 64, 18, 38, 78, 15, 65, 72,20, 20, 11],
[13, 29, 49, 31, 16, 61, 73, 36, 29,69, 13, 63, 58, 18, 20, 25],
[17, 34, 43, 29, 14, 62, 64, 26, 26, 71, 26, 78, 64, 21, 18, 12],
[13, 22, 43, 16, 11, 70, 68, 46, 35,57, 30, 71, 57, 19, 22, 20],
[16, 18, 56, 13, 27, 67, 61, 43, 20, 63, 14, 43, 67,34, 41, 23],
[15, 21, 66, 21, 19, 50, 62, 50, 24, 68, 14, 40, 58, 31, 36, 26],
[19, 17, 70, 12, 28, 53, 72, 39, 22, 71, 11, 40, 67,25, 41, 17]]


Can you spot any differences in the use of letters by author from sampleCrosstab? It is almost impossible to do so by just looking at it. Instead, CA resorts to the $\chi^2$ statistic.

### Chi-Squared Statistic and Chi-Squared Distances

Pearson’s $\chi^2$ test of independence can be used to say with reasonable certainty if the distribution of letters differs from one author to another. $\chi^2$ is defined as:

$\chi^2 = \sum_{I}\sum_{J}\frac{(n_{ij}-(\frac{n_{i.}n_{.j}}{n}))^2}{\frac{n_{i.}n_{.j}}{n}}$ (1)

where $n$ is the total number of frequencies, $n_{ij}$ is the letter frequency in row $i$ and column $j$, and $n_{i.}$ and $n_{.j}$ are the total frequencies in row $i$ and column $j$ respectively. The product of $n_{i.}$ and $n_{.j}$ normalised by $n$ is the expected frequency for $n_{ij}$ under the independence assumption. Let’s call it independenceModel. The greater is $\chi^2$, the greater is the certainty that the use of these letters is different by author. We can calculate this statistic in Python as following:

grandTotal = np.sum(sampleCrosstab)
correspondenceMatrix = np.divide(sampleCrosstab,grandTotal)
rowTotals = np.sum(correspondenceMatrix, axis=1)
columnTotals = np.sum(correspondenceMatrix, axis=0)

independenceModel = np.outer(rowTotals, columnTotals)

#Calculate manually
chiSquaredStatistic = grandTotal*np.sum(np.square(correspondenceMatrix-independenceModel)/independenceModel)
print(chiSquaredStatistic)

# Quick check - compare to scipy Chi-Squared test
statistic, prob, dof, ex = chi2_contingency(sampleCrosstab)
print(statistic)
print(np.round(prob, decimals=2))



In the above code correspondenceMatrix holds normalised frequencies. The $\chi^2$ statistic is 448.50, which is very unlikely to be observed under the null hypothesis (that the letter frequencies follow the same distribution). Having established this, we can continue with the CA as we now know that it should be able to show us some meaningful associations.

For the purposes of CA, the differences between the distributions of letters in the text samples are measured by $\chi^2$-distances, which are weighted Euclidean distances between normalized rows. These are calculated by dividing row entries by their respective row totals. The weights are inversely proportional to the square roots of the column totals. $\chi^2$-distances between row i and row k are defined as:

$\chi^2_{distance_{ik}} = \sqrt{\sum_{J}\frac{(p_{ij}/p_{i.} - p_{kj}/p_{k.})^2}{p_{.j}}}$ (2)

# pre-calculate normalised rows
norm_correspondenceMatrix = np.divide(correspondenceMatrix,rowTotals[:, None])

chiSquaredDistances = np.zeros((correspondenceMatrix.shape[0],correspondenceMatrix.shape[0]))

norm_columnTotals = np.sum(norm_correspondenceMatrix, axis=0)
for row in range(correspondenceMatrix.shape[0]):
chiSquaredDistances[row]=np.sqrt(np.sum(np.square(norm_correspondenceMatrix
-norm_correspondenceMatrix[row])/columnTotals, axis=1))
# Save distances to the DataFrame
dfchiSquaredDistances = pd.DataFrame(data=np.round(chiSquaredDistances*100).astype(int), columns=authorSamples)

print(dfchiSquaredDistances)



In (2) I switched to notation with $p_{ij}$, which is simply every entry in correspondenceMatrix (i.e. letter frequencies normalised by the grand total). dfchiSquaredDistances contains:

### Chi-Squared Distances In Graphical Form

CA provides a means of representing a table of $\chi^2$-distances in a graphical form. This is where the similarity with the PCA analysis comes in. To calculate such representation we need to transform the distances to points in a Cartesian coordinate system. This is achieved by a singular value decomposition (SVD) of a matrix of standardised residuals:

$\Omega = \frac{p_{ij}-\mu_{ij}}{\sqrt{\mu_{ij}}}$ (3)

standardizedResiduals = np.divide((correspondenceMatrix-independenceModel),np.sqrt(independenceModel))

u,s,vh = np.linalg.svd(standardizedResiduals, full_matrices=False)



We are after the row scores, which are coordinates of points in a high-dimensional space (14 dimensions in this case). These points are arranged so that the Euclidean distance between two points is equal to the $\chi^2$-distance between the two rows to which they correspond. The row scores are defined as:

$R = \delta_{r}\cdot U\cdot S$ (4)

where $U$ and $S$ are the left singular vectors matrix and singular values on the diagonal matrix from SVD. The $\delta_{r}$ is diagonal matrix made of the reciprocals of the square roots of the row totals.

deltaR = np.diag(np.divide(1.0,np.sqrt(rowTotals)))

rowScores=np.dot(np.dot(deltaR,u),np.diag(s))

dfFirstTwoComponents = pd.DataFrame(data=[l[0:2] for l in rowScores], columns=['X', 'Y'], index=initials)

print(dfFirstTwoComponents)



Extracting the first two components gives us:

Plotting these as points:

The plot clearly shows letters associations by author. Mark Twain and Charles Darwin’s samples stand out as significantly different from the rest.

Source and Reference: [1] P.Yelland, An Introduction to Correspondence Analysis. The Mathematica Journal 12, 2010 Wolfram Media, Inc.

Posted in Data science, Machine Learning | Tagged

## Approximate Bayesian Computation

This is my first post in 2018. In this post I will share with you a very simple way of performing inference using Approximate Bayesian Computation (ABC) – not to be confused with Approximate Bootstrap Confidence interval, which is also “ABC”.

Let’s say we have observed some data, and are interested to test if there was a change in behaviour in whatever generated the data. For example, we could be monitoring the total amount that is spent/transferred from some account, and we would like to see if there was a shift in how much is being spent/transferred. Figure below shows what the data could look like. After we have eye-balled the graph, we think that all observations after item 43 belong to the changed behaviour (cutoff=43), and we separate the two by colour.

The first question that we can ask is about the means of the blue and the red regions: are they the same? In the figure above I am showing the mean and standard deviations for the two sets. We can run a basic bootstrap with replacement to check if the difference in the means is possibly accidental.

In the figure above basic_bootstrap generates a distribution of means of randomly sampled sets. The confidence interval is first computed as non-parametric. But a quick comparison with 95th CI using normal standard scores shows that the simulated and the non-simulated confidence intervals around the means are very close. Most importantly, the confidence intervals for the blue and red region means overlap, and thus we would have to accept the null hypothesis that the population means are the same and differences seen here are accidental.

Note how unsatisfying this result is. If we use some other test, like one-way ANOVA from scipy.stats.f_oneway, we get a p-value that is too high to accept an alternative hypothesis. However, if we plot the CDFs of the blue and the red data, we can clearly see that larger values are prevailing in the latter:

### Approximate Bayesian Computation

Approximate Bayesian Computation (ABC) relates to probabilistic programming methods and allows us to quantify uncertainty more exactly than a simple CI. A pretty good summary of ABC can be found on Wikipedia. If we are monitoring transactions occurring over time, we may be interested in generating alerts when an amount is above a threshold (for example, your bank could have a monitoring system in place to safeguard you against credit card fraud). If, instead of comparing means of red and blue region, we decided to answer the question about how likely are we to see more trades above the threshold in the red vs. the blue data regions, we could use ABC.

To execute an ABC test on the difference in the number of trades above a threshold in the blue and red data regions, we begin by choosing the threshold! Take a look at the CDF plot above. We see that approximately half of red data is above 20. Whereas only 25% of blue data is above 20. Let’s set our threshold at 20. The ABC is a simple simulation algorithm where we repeatedly perform sample and compare steps. What can we sample here? We will sample from two normal distributions, each with the means set to the fraction of trades above our threshold. I will use Normal distribution, but it is purely a choice of convenience. Ok, what can we compare here? We will compare the number of trades that could have been above the threshold when the data they come from is sampled from the distributions we have chosen as our priors. And we store away the ones that are consistent with it. If we repeat this many times under the two parameterisations, we should build-up two distributions that can be used to answer the main question – how likely are we to obtain more trades above the chosen threshold in the red vs. the blue data sets. The code below does exactly that.

We obtain a very high probability of seeing more trades above the threshold in the red vs. the blue region.

Posted in Numerical Analysis, Python Programming | Tagged ,

## Absolute vs. Proportional Returns

It will be a safe assumption to make that people who read my blogs work with data. In finance, the data is often in form of asset prices or other market indicators like implied volatility. Analyzing price data often requires calculating returns (aka. moves). Very often we work with proportional returns or log returns. Proportional returns are calculated relative to the price level. For example, given any two historical prices $x_{t}$ and $x_{t+h}$, the proportional change is:

$m_{t,prop} = \frac{x_{t+h}-x_{t}}{x_t}$

The above can be shortened as $m_{t, prop} = \frac{x_{t+h}}{x_t}-1$. In contrast, absolute moves are defined simply as the difference between two historical price observations: $m_{t,abs} = x_{t+h}-x_{t}$.

How do you know which type of return is appropriate for your data? The answer depends on the price dynamic and the simulation/analysis task at hand. Historical simulation, often used in Value-at-Risk (VaR), requires calculating PnL strip from some sensitivity and a set of historical returns. For example, a VaR model for foreign exchange options may be specified to take into account PnL impact from changes in implied volatility skew. Here, the PnL is historically simulated using sensitivities of a volatility curve or surface and historical implied volatility returns for some surface parameter, like low risk reversal. You have a choice in how to calculate the volatility returns. The right choice can be determined with a simple regression.

Essentially, we need to look for evidence of dependency of price returns on price levels. In FX, liquid options on G21 currency pairs do not exhibit such dependency, while emerging market pairs do. I have not been able to locate a free source of implied FX volatility, but I have found two instruments that are good enough to demonstrate the concept. CBOE LOVOL Index is a low volatility index and can be downloaded for free from Quandl. For this example I took the close of day prices from 2012-2017. After plotting $log_{10}(ABS(x_{t}))$ vs. $log_{10}(ABS(m_{t,abs}))$ we look for the value of the slope of the fitted linear line. A slope closer to zero indicates no dependency, while a positive or negative slope shows that the two variables are dependent.

CBOE LVOL

In the absence of dependency, absolute returns can be used, while proportional return are otherwise more appropriate. Take a look at a plot of VXMT CBOE Mid-term Volatility Index. The fitted linear line has a slope of approximately 1.7. Historical simulation of VXMT is calling for proportional rather than absolute price moves.

CBOE VXMT

Posted in Home, Numerical Analysis | Tagged ,

## Hidden Technical Debt of Machine Learning – Play Now Pay Later

Last week I was lucky enough to attend the Strata Conference London 2017 for one day. The venue and the event are impressive in scale, participants and content. The quality of tutorials and talks, in general, was very good, and I have walked away with a few new ideas I wanted to share on my blog.

One of the most important lessons from the conference for me was from a reference to the NIPS’16 paper titled Hidden Technical Debt in Machine Learning System, written by Google researchers. The paper is about the long-term maintenance costs introduced by building machine learning (ML) models and systems. The argument is that such cost is hidden as it is not immediately apparent from the point of putting an ML model in production. For data scientists it is important to be aware of the complexity of the models they develop and what impact these models will have on their organisation and how much it will cost to maintain them.

According to the authors, there are three levels of technical complexity which contribute to technical debt in ML: the model itself can be complex and behave non-linearly to a given set of parameters, the model can be taking input from otherwise disparate systems, and the model’s output or its behavior can be complex and difficult to predict before it is released.

### ML Model Complexity

ML models entangle input signals from different systems together, making it difficult to avoid the CACE principle: Change Anything Change Everything. This principle applies to all aspects of ML, from parameters (think xgboost!), to input data to convergence thresholds and sampling methods. Isolation and servicing of modelling components is one of the proposed solutions.

### The Cost  of Data Dependencies

Large ML systems have large and complex data dependencies, where data quality and any data assumptions can significantly affect the ML system output. ML system input data can be unstable, meaning it changes qualitatively and quantitatively over time. In some cases, the degree of dependency on one set of data vs. another may change. The ML systems are unique because usually their data dependencies are finer (e.g. the input should not just be an integer, but an integer in a certain range). A lot of thinking and possibly investment should go into understanding such dependencies and controlling them. Check-out kensu.io – a start-up company I have come across at the conference, the creators of Adalog – a product designed purely for such task.

### The Feedback Loop and Dealing with Changes

Live ML systems learn in real time and influence their own behavior. Sometimes it is necessary to choose static parameters, like prediction thresholds, for a model that is trained or parameterised on  data that is dynamic in nature. Thus leading to the previous set of thresholds being no longer valid on updated data. The authors highlight that comprehensive monitoring of ML system behavior is critical for long-term system reliability.

In summary, maintainable ML systems are costly and require an even higher level of technical competence and foresight among its developers.  ML models testing, validation and monitoring should be considered as an absolute must in organisations that are eager to rip their full benefits.

Posted in Machine Learning | Tagged ,

## AI, Data Science and ML – 2016 developments and trends for 2017 – KDnuggets article

Sharing a link to the KDNuggets article with thoughts provided by a few data scientists on main 2016 developments and what the trends for 2017 look like. Reinforcement Learning is mentioned more than once!

http://www.kdnuggets.com/2017/01/ai-data-science-machine-learning-key-trends.html

Posted in Home | Tagged