The singular value decomposition of a matrix has many applications. Here I'll focus on an introduction to singular value decomposition and an application in clustering articles by topic. In another notebook (link) I show how singular value decomposition can be used in image compression.

Any matrix \(A\) can be decomposed to three matrices \(U\), \(\Sigma\), and \(V\) such that \(A = U \Sigma V\), this is called singular value decomposition. The columns of \(U\) and \(V\) are orthonormal and \(\Sigma\) is diagonal. Most scientific computing packages have a function to compute the singular value decomposition, I won't go into the details of how to find \(U\), \(\Sigma\) and \(V\) here. Some sources write the decomposition as \(A = U \Sigma V^T\), so that their \(V^T\) is our \(V\). The usage in this notebook is consistent with how numpy's singular value decomposition function returns \(V\).

If \(A = \begin{bmatrix} 1 & 0 \\ 1 & 2 \end{bmatrix}\)

\(A\) can be written as \(U \Sigma V\) where \(U\), \(\Sigma\), and \(V\) are, rounded to 2 decimal places:

\(U = \begin{bmatrix} -0.23 & -0.97 \\ -0.97 & 0.23 \end{bmatrix}\)

\(S = \begin{bmatrix} 2.29 & 0 \\ 0 & 0.87 \end{bmatrix}\)

\(V = \begin{bmatrix} -0.53 & -0.85 \\ -0.85 & 0.53 \end{bmatrix}\)

Although the singular value decomposition has interesting properties from a linear algebra standpoint, I'm going to focus here on some of its applications and skip the derivation and geometric interpretations.

Let \(A\) be a \(m \times n\) matrix with column vectors \(\vec{a}_1, \vec{a}_2, ..., \vec{a}_n\). In the singular value decomposition of \(A\), \(U\) will be \(m \times m\), \(\Sigma\) will be \(m \times n\) and \(V\) will be \(n \times n\). We denote the column vectors of \(U\) as \(\vec{u}_1, \vec{u}_2, ..., \vec{u}_m\) and \(V\) as \(\vec{v}_1, \vec{v}_2, ..., \vec{v}_n\), similarly to \(A\). We'll call the values along the diagonal of \(\Sigma\) as \(\sigma_1, \sigma_2, ...\).

We have that \(A = U \Sigma V\) where:

\(U = \begin{bmatrix} \\ \\ \\ \vec{u}_1 & \vec{u}_2 & \dots & \vec{u}_m \\ \\ \\ \end{bmatrix}\)

\(\Sigma = \begin{bmatrix} \sigma_1 & 0 & \dots \\ 0 & \sigma_2 & \dots \\ \vdots & \vdots & \ddots \end{bmatrix}\)

\(V = \begin{bmatrix} \\ \\ \\ \vec{v}_1 & \vec{v}_2 & \dots & \vec{v}_n \\ \\ \\ \end{bmatrix}\)

Because \(\Sigma\) is diagonal, the columns of \(A\) can be written as:

\(\vec{a}_i = \vec{u}_1 * \sigma_1 * V_{1,i} + \vec{u}_2 * \sigma_2 * V_{2,i} + ... = U * \Sigma * \vec{v}_i\)

This is equivalent to creating a vector \(\vec{w}_i\), where the elements of \(\vec{w}_i\) are the elements of \(\vec{v}_i\), weighted by the \(\sigma\)'s:

\(\vec{w}_i = \begin{bmatrix} \sigma_1V_{1,i} \\ \sigma_2V_{2,i} \\ \sigma_3V_{3,i} \\ \vdots \end{bmatrix} = \Sigma * \vec{v}_i\)

Then \(\vec{a}_i = U * \vec{w}_i\). That is to say that every column \(\vec{a}_i\) of \(A\) is expressed by a sum over all the columns of \(U\), weighted by the values in the \(i^{th}\) column of \(V\), and the \(\sigma\)'s. By convention the order of the columns in \(U\) and rows in \(V\) is chosen such that the values in \(\Sigma = \begin{bmatrix} \sigma_1 & 0 & \dots \\ 0 & \sigma_2 & \dots \\ \vdots & \vdots & \ddots \end{bmatrix}\) obey \(\sigma_1 > \sigma_2 > \sigma_3 > ...\). This means that as a whole, the first column of \(U\) and the first row of \(V\) contribute more to the final values of \(A\) than subsequent columns. This has applications in image compression (link to another notebook) and reducing the dimensionality of data by selecting the most import components.

This section isn't required to understand how singular value decomposition is useful, but I've included it for completeness.

If \(A\) is \(m \times n\) (\(m\) rows and \(n\) columns), \(U\) will be \(m \times m\), \(\Sigma\) will be \(m \times n\) and \(V\) will be \(n \times n\). However, there are only \(r = rank(A)\) non-zero values in \(\Sigma\), i.e. \(\sigma_1, ..., \sigma_r \neq 0\); \(\sigma_{r+1}, ..., \sigma_n = 0\). Therefore columns of \(U\) beyond the \(r^{th}\) column and rows of \(V\) beyond the \(r^{th}\) row do not contribute to \(A\) and are usually omitted, leaving \(U\) an \(m \times r\) matrix, \(\Sigma\) an \(r \times r\) diagonal matrix and \(V\) an \(r \times n\) matrix.

Singular value decomposition can be used to classify similar objects (for example, news articles on a particular topic). Note above that similar \(\vec{a_i}\)'s will have similar \(\vec{v_i}\)'s.

Imagine four blog posts, two about skiing and two about hockey. I've made up some data about five different words and the number of times they appear in each post:

In [1]:

```
import pandas as pd
c_names = ['post1', 'post2', 'post3', 'post4']
words = ['ice', 'snow', 'tahoe', 'goal', 'puck']
post_words = pd.DataFrame([[4, 4, 6, 2],
[6, 1, 0, 5],
[3, 0, 0, 5],
[0, 6, 5, 1],
[0, 4, 5, 0]],
index = words,
columns = c_names)
post_words.index.names = ['word:']
post_words
```

Out[1]:

It looks like posts 1 and 4 pertain to skiing, and while posts 2 and 3 are about hockey.

`post_words`

as the matrix \(A\), where the entries represent the number of times a given word appears in the post. The singular value decomposition of \(A\) can be calculated using numpy.

In [2]:

```
import numpy as np
U, sigma, V = np.linalg.svd(post_words)
print "V = "
print np.round(V, decimals=2)
```

In [3]:

```
V_df = pd.DataFrame(V, columns=c_names)
V_df
```

Out[3]:

In [4]:

```
sigma
```

Out[4]:

In [5]:

```
A_approx = np.matrix(U[:, :2]) * np.diag(sigma[:2]) * np.matrix(V[:2, :])
print "A calculated using only the first two components:\n"
print pd.DataFrame(A_approx, index=words, columns=c_names)
print "\nError from actual value:\n"
print post_words - A_approx
```

In [6]:

```
%matplotlib inline
import matplotlib.pyplot as plt
plt.imshow(V, interpolation='none')
plt.xticks(xrange(len(c_names)))
plt.yticks(xrange(len(words)))
plt.ylim([len(words) - 1.5, -.5])
ax = plt.gca()
ax.set_xticklabels(c_names)
ax.set_yticklabels(xrange(1, len(words) + 1))
plt.title("$V$")
plt.colorbar();
```

Another thing the singular value decomposition tells us is what most defines the different categories of posts. The skiing posts have very different values from the hockey posts in the second row of \(V\), i.e. \(V_{2,1} \approx V_{2, 4}\) and \(V_{2,2} \approx V_{2, 3}\) but \(V_{2,1} \neq V_{2, 2}\).

Recall from above that:

\(\vec{a}_i = \vec{u}_1 * \sigma_1 * V_{1,i} + \vec{u}_2 * \sigma_2 * V_{2,i} + ...\)

Thus the posts differ very much in how much the values in \(\vec{u}_2\) contribute to their final word count. Here is \(\vec{u}_2\):

In [7]:

```
pd.DataFrame(U[:,1], index=words)
```

Out[7]:

Moving on from the simple example above, here is an application using singular value decomposition to find similar research papers.

I've collect several different papers for analysis. Unfortunately due to the sorry state of open access for scientific papers I can't share the full article text that was used for analysis. *Cell*, for example, cautions that **"you may not copy, display, distribute, modify, publish, reproduce, store, transmit, post, ..."** Yikes. However I did chose articles such that you should be able to download the pdf's from the publisher for free.

Single-Molecule Protein Unfolding and Translocation by an ATP-Fueled Proteolytic Machine (clpx2)

Insights into dynein motor domain function from a 3.3-A crystal structure (dyn-structure)

Biophysical mechanism of T-cell receptor triggering in a reconsistuted system (tcell)

In [8]:

```
with open('input/stopwords.txt') as f:
stopwords = f.read().strip().split(',')
stopwords = set(stopwords) # use a set for fast membership testing
```

In [9]:

```
import collections
import os
import re
def word_count(fname):
"""Return a collections.Counter instance counting
the words in file fname."""
with open(fname) as f:
file_content = f.read()
words = re.split(r'\W+', file_content.lower())
words = [word for word in words
if len(word) > 3 and word not in stopwords]
word_count = collections.Counter(words)
return word_count
file_list = ['input/papers/' + f for f in os.listdir('input/papers/')
if f.endswith('.txt')]
word_df = pd.DataFrame()
for fname in file_list:
word_counter = word_count(fname)
file_df = pd.DataFrame.from_dict(word_counter,
orient='index')
file_df.columns = [fname.replace('input/papers/', '').replace('.txt', '')]
# normalize word count by the total number of words in the file:
file_df.ix[:, 0] = file_df.values.flatten() / float(file_df.values.sum())
word_df = word_df.join(file_df, how='outer', )
word_df = word_df.fillna(0)
print "Number of unique words: %s" % len(word_df)
```

Here are the results, sorted by the most common words in the first paper:

In [10]:

```
word_df.sort(columns=word_df.columns[0], ascending=False).head(10)
```

Out[10]:

Now to calculate the singular value decomposition of this data.

In [11]:

```
U, sigma, V = np.linalg.svd(word_df)
```

Here is a look at \(V\), with the column names added:

In [12]:

```
v_df = pd.DataFrame(V, columns=word_df.columns)
v_df.apply(lambda x: np.round(x, decimals=2))
```

Out[12]:

Here are the values of \(V\) represented as an image:

In [13]:

```
plt.imshow(V, interpolation='none')
ax = plt.gca()
plt.xticks(xrange(len(v_df.columns.values)))
plt.yticks(xrange(len(v_df.index.values)))
plt.title("$V$")
ax.set_xticklabels(v_df.columns.values, rotation=90)
plt.colorbar();
```

In [14]:

```
def dist(col1, col2, sigma=sigma):
"""Return the norm of (col1 - col2), where the differences in
each dimension are wighted by the values in sigma."""
return np.linalg.norm(np.array(col1 - col2) * sigma)
dist_df = pd.DataFrame(index=v_df.columns, columns=v_df.columns)
for cname in v_df.columns:
dist_df[cname] = v_df.apply(lambda x: dist(v_df[cname].values, x.values))
plt.imshow(dist_df.values, interpolation='none')
ax = plt.gca()
plt.xticks(xrange(len(dist_df.columns.values)))
plt.yticks(xrange(len(dist_df.index.values)))
ax.set_xticklabels(dist_df.columns.values, rotation=90)
ax.set_yticklabels(dist_df.index.values)
plt.title("Similarity between papers\nLower value = more similar")
plt.colorbar()
dist_df
```

Out[14]:

In [15]:

```
levels = [0.06, 0.075]
binned_df = dist_df.copy()
binned_df[(dist_df <= levels[0]) & (dist_df > 0)] = 1
binned_df[(dist_df <= levels[1]) & (dist_df > levels[0])] = 2
binned_df[(dist_df < 1) & (dist_df > levels[1])] = 3
plt.imshow(binned_df.values, interpolation='none')
ax = plt.gca()
plt.xticks(xrange(len(binned_df.columns.values)))
plt.yticks(xrange(len(binned_df.index.values)))
ax.set_xticklabels(binned_df.columns.values, rotation=90)
ax.set_yticklabels(binned_df.index.values)
plt.title("Similarity between papers\nLower value = more similar")
plt.colorbar();
```

In [16]:

```
for paper in dist_df.columns:
sim_papers_df = dist_df.sort(columns=paper)[paper]
sim_papers = sim_papers_df.drop([paper]).index
print 'Papers most similar to ' + paper + ':'
print ', '.join(sim_papers)
print '\n'
```