Build Your Business a Product Tool with Word2Vec and xlSlim

Introduction

The idea behind word2vec is based on a linguistic hypothesis called the distributional hypothesis, which states that words that occur in the same or similar contexts tend to have similar meanings. This hypothesis applies to context-to-words relations, and to words-to-context. Thus, similar words should occur in similar contexts, and similar contexts should have similar words.

In NLP, the meaning of words can be represented by their embeddings – numerical vector representations in the multi-dimensional vector space. Embeddings can be ‘learnt’ from text using linear algebra techniques like Singular Value Decomposition (SVD), Non-Negative Matrix Factorization or other decompositions.

The algorithm for word2vec is an example of learning word embedding from text that allows for performing arithmetic on the learned embeddings. Here embeddings are learnt through gradient descent by training a shallow neural network. There is a ready implementation of word2vec in Python in the genism library. Apart from word2vec, one can use the Embeddings module from TensorFlow keras. Yet another approach involves using FastText – an optimization on word2vec where words’ morphology and sub-words are leveraged.

In this blog we pretend that we are data scientists at a small online retailer and develop an easy to use Excel-based tool using word2vec to help our company to do the following:

  • Find products that are similar to a given product based on on-line browsing behaviour of our customers. This can be used to offer similar products for customers who call us to place an order for an item that is out of stock.
  • Find products to recommend as next purchase based on on-line browsing behaviour of our customers.

The above functionality will be offered via simple functions in Excel, exposed via the xlSlim addin. Let’s get started!

The Data

The data for this blog comes from Kaggle. There are two data files available to download, and we will use the one from Oct-2019.  The data contains anonymized records for the items customers viewed, added to their cart and purchased. We only use records for items that customers viewed.

Since word2vec is an NLP model, let’s make the connection between viewing items and words in sentences explicit:

In each online session of a customer, viewed items’ product ids are treated as words and the session is treated as a sentence/context.

The required data pre-processing will transform viewed item records into a lists of lists, each list being a unique customer session. It is quite simple and includes steps for:

  • Keeping records for viewed items only (40,779,399 records)
  • Removing records with N/A category codes (reduced to 27,542,941 records)
  • Remove duplicate views of the same item on the same date by the same customer (reduced to 17,309,221 records)
  • Reduce data to only records of at least two viewed items in a given day (reduced to 15,260,646 records)

The list of lists for a session-generated viewed product is the input into the genism Word2Vec model.

The Model

We use the genism implementation of Word2Vec. The model is initialised with the following parameters:

  • min_count = 2, meaning the product id must appear at least twice in the corpus (i.e. in the list of lists) to be used
  • vector_size = maximum length of an online session in the number of viewed items. This happens to be 1,013. This parameter determines the length of each word vector, i.e. the number of embeddings.

Note that by default, we train a CBOW architecture.

The model is trained using all data and saved as a .model object using genism built-in utilities. All model code is available on my github repository. The code blocks below shows the implementation of the methods to get similar products and recommended product:

def get_similar_product(model:Word2Vec, df:pd.DataFrame, product_id:int)->pd.DataFrame:
"""
Parameters
———-
model : instance of Word2Vec
df : dataframe with preprocessed data
product_id : int
unique product id for which we need to find similar product ids
Returns
——-
dataframe with similar products
"""
try:
sim_product = model.wv.most_similar(positive=[str(product_id)])
return df.loc[df['product_id'].isin([int(word[0])
for word in sim_product])][[
'category_code',
'brand',
'product_id']].drop_duplicates()
except KeyError:
return f"Cannot find the specified product with id {product_id}"
def recommend_next_purchase(model: Word2Vec, df:pd.DataFrame, user_id:int)->pd.DataFrame:
"""
Parameters
———-
model : instance of Word2Vec
df : dataframe with preprocessed data
user_id : int
unique user id for whom we make recommendations
Returns
——-
dataframe with recommended products
"""
try:
# Find the products the user browsed
viewed_products = df.loc[df['user_id']==user_id]['product_id'].unique()
# Get recommendations for next purchase
output_words = model.predict_output_word([str(product) for product in viewed_products])
return df.loc[df['product_id'].isin([int(word[0])
for word in output_words])][[
'category_code',
'brand',
'product_id']].drop_duplicates()
except KeyError:
return f"Cannot find the specified user with id {user_id}"

It takes approximately 5 minutes to train and save the model. The idea is that we, the data scientists, pre-train our word2vec model and provide the end users with an Excel file that has a few functions. One function will let the user to get a product or a category recommendation for a user id. Another function can be used to get similar products for a product id. Note that the config file controls the model and data parameters which would otherwise be hard coded in the Python code. The path to the config file is the only hard-coded global parameter.

User Application

For the user interface we implement two functions, one to find similar products, and another to recommend either the category or products to buy next.

A simple model interface where both functions are exposed in Excel can be built using the xlSlim addin. xlSlim is very easy to use and one can set-up an Excel based tool from Python methods in minutes. Note that we can update the model with new data, but the end user would not care or know about it, as long as the interface code has access to the model object.

My github page has the Excel file, with the formulas are as shown below:

The formulas register the Python module that implements our two functions and loads the model and the data objects.

get_similar_product_from_handle()

and 

recommend_next_purchase_from_handle()

are made available to Excel by xlSlim. We pass to these functions a handle to the model object, and a handle to the dataframe. It is the need to work with the handle to the dataframe that requires to use .._from_handle() versions of the exposed method.

Note that the file needs to be opened from the xlSlim application, which seamlessly loads and activates the addin.

Entering a known product_id (a child’s carriage) for a similar product gives the following output:

Entering a known user_id (a user who has viewed computer memory and video cards) for product recommendations gives the following output:

Conclusion

Word2Vec is a powerful model that goes beyond human language applications. Using Excel and an addin that can expose Python functionality as Excel functions (e.g. xlSlim) is a quick and easy way to build user interfaces either to be used as a proof of concept or as a permanent flexible solution. Not all companies have the IT and data engineering resources to run servers and support web-based APIs, so, Excel is definitely a viable alternative.

Don’t Get in a Pickle with a Python namedtuple

In this blog I will show you what happens when you want to pickle an object that contains a Python namedtuple.

Python’s namedtuple is high-performance data type that lets us define a custom type which behaves like a tuple. For example, the following piece of code defines a new type Viewer, creates an instance of it and initialises its attributes:

    from collections import namedtuple

    Viewer = namedtuple('Viewer', 'gender age points')
    viewer = Viewer('X', 25, 356)

In the above, line 3 defines a new type Viewer, and line 4 defines and initialises a new variable viewer of type Viewer. viewer behaves like a tuple in a sense that it has built-in methods count() and index() and allows access to attributes via indexing or named arguments. For example:

    print(viewer[2])         # prints 356
    print(viewer.age)        # prints 25
    print(viewer.count('X')) # prints 1

Note that unlike with a list or a dict, to work with namedtuples we need to perform two operations: (1) define the new type, (2) create a new instance of it. Also note that the same two steps are followed when we work with classes. And a namedtuple is just a dynamically named class type. But how exactly does this dynamic part works? It works because when we define a new type (line 3 in the first code snippet), we are actually calling a factory function namedtuple that does the dynamic ‘stuff’ for us (i.e. returns a sub-class of a tuple that is named as what we specify in the function call).

Let’s see what happens when we create a class with a namedtuple member.

import pickle
from collections import namedtuple
import datetime as dt


class ViewerClass(object):

    # class-level type definition
    vt = namedtuple(
        'vt', 'start_date mon_views mon_streams name dob'
    )

    def __init__(
        self, start_date, mon_views, mon_streams, name, dob
    ):
        self._my_vt = ViewerClass.vt(
            start_date, mon_views, mon_streams, name, dob
        )

    def get_start_date(self):
        return self._my_vt.start_date

    def get_monthly_views(self):
        return self._my_vt.mon_views

    def get_monthly_streams(self):
        return self._my_vt.mon_streams

    def get_registration_details(self):
        return (
            'Name:'
            + self._my_vt.name
            + ' DOB:'
            + str(self._my_vt.dob)
        )

    def update_monthly_stream(self, new_mon_streams):
        self._my_vt.mon_streams = new_mon_streams

    def update_monthly_views(self, new_mon_views):
        self._my_vt.mon_views = new_mon_views


if __name__ == '__main__':

    viewer1 = ViewerClass(
        dt.date(2019, 1, 1),
        5,
        6234.80,
        'John',
        dt.date(1989, 12, 3),
    )
    print(
        "Viewer {} has streamed for {} seconds this month.".format(
            viewer1.get_registration_details(),
            viewer1.get_monthly_streams(),
        )
    )

    viewer2 = ViewerClass(
        dt.date(2019, 2, 1),
        5,
        5234.80,
        'Mary',
        dt.date(1989, 11, 11),
    )
    print(
        "Viewer {} has streamed for {} seconds this month.".format(
            viewer2.get_registration_details(),
            viewer2.get_monthly_streams(),
        )
    )

    print(type(viewer1))
    print(type(viewer1._my_vt))

The output of the print statements points to a potential problem that can occur if we try to pickle the viewer objects:

It turns out that the protected variable is of type ‘__main__.vt’ but not ‘__main__.ViewerClass.vt’. And if we try to pickle viewer1 we are going to get this error:

_pickle.PicklingError: Can’t pickle <class ‘__main__.vt’>: attribute lookup vt on __main__ failed

This error should make sense because vt is not defined within __main__, but is defined within __main__.ViewerClass, and thus is not visible to pickle as a subclass of a class.

There are several ways to fix this.

First, we can move the definition of vt outside of ViewerClass to the __main__. This will let pickle find vt at the level it is looking for it:

# module-level type definition
vt = namedtuple(
    'vt', 'start_date mon_views mon_streams name dob'
)


class ViewerClass(object):
    def __init__(
        self, start_date, mon_views, mon_streams, name, dob
    ):
        self._my_vt = vt(
            start_date, mon_views, mon_streams, name, dob
        )

    ...

Second solution involves changing a built-in private variable __qual_name__ to that of the class name:

import pickle
from collections import namedtuple
import datetime as dt


class ViewerClass(object):

    # class-level definition
    vt = namedtuple(
        'vt', 'start_date mon_views mon_streams name dob'
    )
    vt.__qualname__ = 'ViewerClass.vt'

    def __init__(
        self, start_date, mon_views, mon_streams, name, dob
    ):
        self._my_vt = ViewerClass.vt(
            start_date, mon_views, mon_streams, name, dob
        )

    ...

This fixes the issue and makes viewer1._my_vt of type ‘__main__.ViewerClass.vt’, under which pickle can look it up.

I must say that I prefer the first solution, since sub-classing from the ViewerClass may prove to be problematic, and we should avoid modifying private variables.

Introduction to Correspondence Analysis

In this blog I will introduce the Correspondence Analysis – a visualisation technique for categorical data. All the code has been compiled in my github repository.

Correspondence Analysis (CA) has been around for a very long time. It was first developed in the 1930-ies, and made popular by M. Greenacre in the 1980-ies. It is an established statistical analysis techniques with dedicated annual symposiums and sufficient amount of literature covering theory and applications. Inspite of its popularity, I have only recently discovered it, and thought that it is worthwhile to document the fundamentals on my blog.

What Exactly is Correspondence Analysis?

CA is a visualisation technique that can be applied to categorical data for data exploration. Unlike numerical data, categorical features are harder to analyse and visualise. CA uses a matrix decomposition method, namely SVD, and thus you may see CA being likened to the Principle Components Analysis (PCA). However, CA is not, strictly speaking, a PCA for categorical data, mostly because the primary objective of CA is to provide a visualisation of associations among categorical features.

How does one visualise categorical data? CA is based on a simple concept of a contingency table. A contingency table is a tabulation of frequencies of how categorical values are distributed by variables. This blog will be using examples from P. Yelland’s article on CA published in the Mathematica journal[1]. I will translate his Mathematica code to Python (because Python is awesome). In [1] we find CA applied to textual analysis where passages of a few authors analysed by the frequency of letters. The five authors and the letters are shown below:

authors = ["Charles Darwin", "Rene Descartes","Thomas Hobbes", "Mary Shelley", "Mark Twain"]
initials=['CD1','CD2','CD3','RD1','RD2','RD3','TB1','TB2','TB3','MS1','MS2','MS3','MT1','MT2','MT3']
chars=["B", "C", "D", "F", "G", "H", "I", "L", "M", "N","P", "R", "S", "U", "W", "Y"]

The contingency table build from how often these letters appear in three passages per author are:

sampleCrosstab=[[34, 37, 44, 27, 19, 39, 74, 44, 27, 61, 12, 65, 69,22, 14, 21],
                [18, 33, 47, 24, 14, 38, 66, 41, 36,72, 15, 62, 63, 31, 12, 18],
                [32, 43, 36, 12, 21, 51, 75, 33, 23, 60, 24, 68, 85,18, 13, 14],
                [13, 31, 55, 29, 15, 62, 74, 43, 28,73, 8, 59, 54, 32, 19, 20],
                [8, 28, 34, 24, 17, 68, 75, 34, 25, 70, 16, 56, 72,31, 14, 11], 
                [9, 34, 43, 25, 18, 68, 84, 25, 32, 76,14, 69, 64, 27, 11, 18],
                [15, 20, 28, 18, 19, 65, 82, 34, 29, 89, 11, 47, 74,18, 22, 17], 
                [18, 14, 40, 25, 21, 60, 70, 15, 37,80, 15, 65, 68, 21, 25, 9],
                [19, 18, 41, 26, 19, 58, 64, 18, 38, 78, 15, 65, 72,20, 20, 11], 
                [13, 29, 49, 31, 16, 61, 73, 36, 29,69, 13, 63, 58, 18, 20, 25],
                [17, 34, 43, 29, 14, 62, 64, 26, 26, 71, 26, 78, 64, 21, 18, 12],
                [13, 22, 43, 16, 11, 70, 68, 46, 35,57, 30, 71, 57, 19, 22, 20],
                [16, 18, 56, 13, 27, 67, 61, 43, 20, 63, 14, 43, 67,34, 41, 23], 
                [15, 21, 66, 21, 19, 50, 62, 50, 24, 68, 14, 40, 58, 31, 36, 26],
                [19, 17, 70, 12, 28, 53, 72, 39, 22, 71, 11, 40, 67,25, 41, 17]]

Can you spot any differences in the use of letters by author from sampleCrosstab? It is almost impossible to do so by just looking at it. Instead, CA resorts to the \chi^2 statistic.

Chi-Squared Statistic and Chi-Squared Distances

Pearson’s \chi^2 test of independence can be used to say with reasonable certainty if the distribution of letters differs from one author to another. \chi^2 is defined as:

\chi^2 = \sum_{I}\sum_{J}\frac{(n_{ij}-(\frac{n_{i.}n_{.j}}{n}))^2}{\frac{n_{i.}n_{.j}}{n}} (1)

Where n is the total number of frequencies, n_{ij} is the letter frequency in row i and column j , and n_{i.} and n_{.j} are the total frequencies in row i and column j respectively. The product of n_{i.} and n_{.j} normalised by n is the expected frequency for n_{ij} under the independence assumption. Let’s call it independenceModel. The greater is \chi^2 , the greater is the certainty that the use of these letters is different by author. We can calculate this statistic in Python as following:

grandTotal = np.sum(sampleCrosstab)
correspondenceMatrix = np.divide(sampleCrosstab,grandTotal)
rowTotals = np.sum(correspondenceMatrix, axis=1)
columnTotals = np.sum(correspondenceMatrix, axis=0)

independenceModel = np.outer(rowTotals, columnTotals)

#Calculate manually
chiSquaredStatistic = grandTotal*np.sum(np.square(correspondenceMatrix-independenceModel)/independenceModel)
print(chiSquaredStatistic)

# Quick check - compare to scipy Chi-Squared test
statistic, prob, dof, ex = chi2_contingency(sampleCrosstab)
print(statistic)
print(np.round(prob, decimals=2))

In the above code correspondenceMatrix holds normalised frequencies. The \chi^2 statistic is 448.50, which is very unlikely to be observed under the null hypothesis (that the letter frequencies follow the same distribution). Having established this, we can continue with the CA as we now know that it should be able to show us some meaningful associations.

For the purposes of CA, the differences between the distributions of letters in the text samples are measured by \chi^2 -distances, which are weighted Euclidean distances between normalized rows. These are calculated by dividing row entries by their respective row totals. The weights are inversely proportional to the square roots of the column totals. \chi^2 -distances between row i and row k are defined as:

\chi^2_{distance_{ik}} = \sqrt{\sum_{J}\frac{(p_{ij}/p_{i.} - p_{kj}/p_{k.})^2}{p_{.j}}} (2)

# pre-calculate normalised rows
norm_correspondenceMatrix = np.divide(correspondenceMatrix,rowTotals[:, None])

chiSquaredDistances = np.zeros((correspondenceMatrix.shape[0],correspondenceMatrix.shape[0]))

norm_columnTotals = np.sum(norm_correspondenceMatrix, axis=0)
for row in range(correspondenceMatrix.shape[0]):
    chiSquaredDistances[row]=np.sqrt(np.sum(np.square(norm_correspondenceMatrix
                                                        -norm_correspondenceMatrix[row])/columnTotals, axis=1))
# Save distances to the DataFrame
dfchiSquaredDistances = pd.DataFrame(data=np.round(chiSquaredDistances*100).astype(int), columns=authorSamples)

print(dfchiSquaredDistances)

In (2) I switched to notation with p_{ij} , which is simply every entry in correspondenceMatrix (i.e. letter frequencies normalised by the grand total). dfchiSquaredDistances contains:

Chi-Squared Distances In Graphical Form

CA provides a means of representing a table of \chi^2 -distances in a graphical form. This is where the similarity with the PCA analysis comes in. To calculate such representation we need to transform the distances to points in a Cartesian coordinate system. This is achieved by a singular value decomposition (SVD) of a matrix of standardised residuals:

\Omega = \frac{p_{ij}-\mu_{ij}}{\sqrt{\mu_{ij}}} (3)

standardizedResiduals = np.divide((correspondenceMatrix-independenceModel),np.sqrt(independenceModel))

u,s,vh = np.linalg.svd(standardizedResiduals, full_matrices=False)

We are after the row scores, which are coordinates of points in a high-dimensional space (14 dimensions in this case). These points are arranged so that the Euclidean distance between two points is equal to the \chi^2 -distance between the two rows to which they correspond. The row scores are defined as:

R = \delta_{r}\cdot U\cdot S (4)

where U and S are the left singular vectors matrix and singular values on the diagonal matrix from SVD. The \delta_{r} is diagonal matrix made of the reciprocals of the square roots of the row totals.

deltaR = np.diag(np.divide(1.0,np.sqrt(rowTotals)))

rowScores=np.dot(np.dot(deltaR,u),np.diag(s))

dfFirstTwoComponents = pd.DataFrame(data=[l[0:2] for l in rowScores], columns=['X', 'Y'], index=initials)

print(dfFirstTwoComponents)

Extracting the first two components gives us:

Plotting these as points:

The plot clearly shows letters associations by author. Mark Twain and Charles Darwin’s samples stand out as significantly different from the rest.

Source and Reference: [1] P.Yelland, An Introduction to Correspondence Analysis. The Mathematica Journal 12, 2010 Wolfram Media, Inc.

Approximate Bayesian Computation

Greetings, my blog readers!

This is my first post in 2018. In this post I will share with you a very simple way of performing inference using Approximate Bayesian Computation (ABC) – not to be confused with Approximate Bootstrap Confidence interval, which is also “ABC”.

Let’s say we have observed some data, and are interested to test if there was a change in behaviour in whatever generated the data. For example, we could be monitoring the total amount that is spent/transferred from some account, and we would like to see if there was a shift in how much is being spent/transferred. Figure below shows what the data could look like. After we have eye-balled the graph, we think that all observations after item 43 belong to the changed behaviour (cutoff=43), and we separate the two by colour.

The first question that we can ask is about the means of the blue and the red regions: are they the same? In the figure above I am showing the mean and standard deviations for the two sets. We can run a basic bootstrap with replacement to check if the difference in the means is possibly accidental.

In the figure above basic_bootstrap generates a distribution of means of randomly sampled sets. The confidence interval is first computed as non-parametric. But a quick comparison with 95th CI using normal standard scores shows that the simulated and the non-simulated confidence intervals around the means are very close. Most importantly, the confidence intervals for the blue and red region means overlap, and thus we would have to accept the null hypothesis that the population means are the same and differences seen here are accidental.

Note how unsatisfying this result is. If we use some other test, like one-way ANOVA from scipy.stats.f_oneway, we get a p-value that is too high to accept an alternative hypothesis. However, if we plot the CDFs of the blue and the red data, we can clearly see that larger values are prevailing in the latter:

Approximate Bayesian Computation

Approximate Bayesian Computation (ABC) relates to probabilistic programming methods and allows us to quantify uncertainty more exactly than a simple CI. A pretty good summary of ABC can be found on Wikipedia. If we are monitoring transactions occurring over time, we may be interested in generating alerts when an amount is above a threshold (for example, your bank could have a monitoring system in place to safeguard you against credit card fraud). If, instead of comparing means of red and blue region, we decided to answer the question about how likely are we to see more trades above the threshold in the red vs. the blue data regions, we could use ABC.

To execute an ABC test on the difference in the number of trades above a threshold in the blue and red data regions, we begin by choosing the threshold! Take a look at the CDF plot above. We see that approximately half of red data is above 20. Whereas only 25% of blue data is above 20. Let’s set our threshold at 20. The ABC is a simple simulation algorithm where we repeatedly perform sample and compare steps. What can we sample here? We will sample from two normal distributions, each with the means set to the fraction of trades above our threshold. I will use Normal distribution, but it is purely a choice of convenience. Ok, what can we compare here? We will compare the number of trades that could have been above the threshold when the data they come from is sampled from the distributions we have chosen as our priors. And we store away the ones that are consistent with it. If we repeat this many times under the two parameterisations, we should build-up two distributions that can be used to answer the main question – how likely are we to obtain more trades above the chosen threshold in the red vs. the blue data sets. The code below does exactly that.

We obtain a very high probability of seeing more trades above the threshold in the red vs. the blue region.

Absolute vs. Proportional Returns

Greetings, my blog readers!

It will be a safe assumption to make that people who read my blogs work with data. In finance, the data is often in form of asset prices or other market indicators like implied volatility. Analyzing price data often requires calculating returns (aka. moves). Very often we work with proportional returns or log returns. Proportional returns are calculated relative to the price level. For example, given any two historical prices x_{t} and x_{t+h}, the proportional change is:

m_{t,prop} = \frac{x_{t+h}-x_{t}}{x_t}

The above can be shortened as m_{t, prop} = \frac{x_{t+h}}{x_t}-1. In contrast, absolute moves are defined simply as the difference between two historical price observations: m_{t,abs} = x_{t+h}-x_{t}.

How do you know which type of return is appropriate for your data? The answer depends on the price dynamic and the simulation/analysis task at hand. Historical simulation, often used in Value-at-Risk (VaR), requires calculating PnL strip from some sensitivity and a set of historical returns. For example, a VaR model for foreign exchange options may be specified to take into account PnL impact from changes in implied volatility skew. Here, the PnL is historically simulated using sensitivities of a volatility curve or surface and historical implied volatility returns for some surface parameter, like low risk reversal. You have a choice in how to calculate the volatility returns. The right choice can be determined with a simple regression.

Essentially, we need to look for evidence of dependency of price returns on price levels. In FX, liquid options on G21 currency pairs do not exhibit such dependency, while emerging market pairs do. I have not been able to locate a free source of implied FX volatility, but I have found two instruments that are good enough to demonstrate the concept. CBOE LOVOL Index is a low volatility index and can be downloaded for free from Quandl. For this example I took the close of day prices from 2012-2017. After plotting log_{10}(ABS(x_{t})) vs. log_{10}(ABS(m_{t,abs})) we look for the value of the slope of the fitted linear line. A slope closer to zero indicates no dependency, while a positive or negative slope shows that the two variables are dependent.

CBOE LVOL

In the absence of dependency, absolute returns can be used, while proportional return are otherwise more appropriate. Take a look at a plot of VXMT CBOE Mid-term Volatility Index. The fitted linear line has a slope of approximately 1.7. Historical simulation of VXMT is calling for proportional rather than absolute price moves.

CBOE VXMT