Category Archives: Data

Book review: Building Machine Learning Systems with Python

April 22, 2014Books, Databooks, code, data, pythonFrank

I recently read the book Building Machine Learning Systems with Python by Willi Richert and Luis Pedro Coelho (disclaimer). Overall I think it is worth reading for someone who is already familiar with coding in python (and the numpy library) and is interested in using python machine learning libraries. I can’t recommended it as strongly for someone who is unfamiliar with python because the code in the book is often unpythonic (in my opinion), and the code available on the book’s website doesn’t match well with the code in the book and requires a fair bit of tweaking before you can actually run the examples.

I especially enjoyed exploring the gensim library, which is touched on in the book. I also liked the approach taken in the book of building and analyzing machine learning systems as an iterative process, exploring models and features to converge on a good solution.

One thing I think could improve the code in the book is better variable names. For example the following code is needlessly cryptic:

dense = np.zeros( (len(topics), 100), float)
for ti,t in enumerate(topics):
    for tj,v in t:
        dense[ti,tj] = v

There is also a blundering use of a list comprehension to reshape a numpy array:

x = np.array([[v] for v in x])

Using the built in reshape method is more memory efficient, easier to read, and 1,000 times faster:

In [25]: x.shape
Out[25]: (506L,)

In [26]: %timeit y = np.array([[v] for v in x])
100 loops, best of 3: 4.07 ms per loop

In [27]: %timeit y = x.reshape((x.size, 1)).copy()
100000 loops, best of 3: 4.09 µs per loop

Singular Value Decomposition

March 20, 2014Datacode, data, pythonFrank

In the Statistical Computing class I took last fall, a matrix decomposition called the singular value decomposition came up briefly as a way to classify similar objects. I wanted to learn more about it but I couldn’t find a resource that was exactly what I wanted, so I decided to create one. I wrote up two IPython Notebooks about singular value decomposition, one is an introduction to the concept and an example application to classifying research articles, the other deals with image compression. The notebooks are viewable at the links below. You can also download the code from these links:

Singular Value Decomposition and Applications

Singular Value Decomposition of an Image

Machine Learning Class

March 9, 2014Datacode, dataFrank

I recently started taking another online class, Machine Learning. Like the database class I took previously (blog entry) it is from Stanford. Unlike the database class, it is offered through coursera, a for-profit company co-founded by the Professor for the course, Andrew Ng. The for-profit aspect really came into focus when I got an email offering “Free $25 Promotional Code for Machine Learning Tutoring!”

The first few weeks covered a lot of things I was already familiar with, but I’m still enjoying hearing the material from a different angle. Although overall the course feels a bit less polished than the Database class, I’m really excited to be taking it and I’m looking forward to the next week of lectures.

Online Introduction to Databases class

March 2, 2014Datacode, dataFrank

I recently finished a free online course from Stanford on Databases (link to course webpage). It’s taught by Professor Jennifer Widom, who gives excellent lectures throughout the course. The exercises are well thought out and fun to do, and the online submission and grading system is very intuitive. Another nice feature is the “progress” tab, which shows a bar graph of all the points you’ve earned on each individual assignment, as well as your current total of points for the course. The progress graph makes writing SQL and XML queries feel like playing a video game.

I really enjoyed the challenge of thinking of ways to build up different sets using nested SQL queries. Besides just learning SQL for its own sake, taking the class has helped me understand the conventions behind some operations in other environments, for example dataframe joins in pandas. I definitely feel much more powerful in regards to accessing and manipulating data after taking the class.

I highly recommend the class. I imagine it will be offered again.

Car usage tracking

Since we bought a new car last year we’ve been keeping a detailed log of every trip to the gas station, including odometer reading, calculated mpg, and location. Over the weekend I wrote a simple web app to visualize the data and provide an interface to update the data. I’ve put a few images from the results below. The page itself is up here. The code is available on github (link).

Graphs of car fill-up data made with D3.js

I think it’s neat to visualize the effects of various road trips on the odometer reading, the two trips to Utah being very steep. Those parts of the graph remind me of the profile of many of the geologic features we went to Utah to see. Also of note is the prolonged steep section over the summer when I was commuting to an internship. Now if anyone asks what kind of gas mileage our car gets, I can just give them a link.

I used D3.js, a JavaScript library for “data driven documents”, especially useful for making plots in a browser. I hadn’t written any JavaScript before, so it was a great learning experience and I’m really happy with the results. My usual approach is to generate and save the plots in python using matplotlib, then serve those files. I’m excited to use D3.js to create other visualizations on the web.

I used python scripts to update the data and output the html table seen on the page.

Again, here are the links:

Live page: https://www.frankcleary.com/mpg/
Code: https://github.com/frankcleary/mpg/

Tips for scientists using computers

This week I gave a presentation to my research group about best practices in scientific computing. It can be hard to know what’s out there, so I thought it would be good to give a brief introduction to some tools as a starting point. My main objective was to show how easy version control is with git, and how it could improve the quality of our science. I also wanted to introduce the magic of interactive data analysis using IPython and the IPython Notebook, along with pandas and other scientific libraries. Historically our primary language has been MATLAB, along with Origin for generating figures, but I think these open source alternatives offer a lot of advantages. I’ve posted the files from the presentation here on github (look in the code directory for example IPython notebooks).

Parallel processing in R

In R a lot of operations will take advantage of parallel processing, provided the BLAS being used supports it. As in python it’s also possible to manually split calculations up between multiple cores. The code below uses the doParallel package and a foreach loop combined with %dopar% to split up a matrix multiplication task. If I limit the BLAS to using one thread, this will speed up the calculation. This sort of construction could also be used to run independent simulations.

I think this is a lot more intuitive than the parallel python implementation (see post). I’m curious to look at other parallel processing python modules to see how they compare.

require(parallel)
require(doParallel)
library(foreach)
library(iterators)

parFun <- function(A, B){
  A%*%B
}

nCores <- 2
n <- 4000 # for this toy code, make sure n is divisible by nCores
a <- rnorm(n*n)
A <- matrix(a, ncol=n)

registerDoParallel(nCores)
systime <- system.time(
  result <- foreach(i = 1:nCores, .combine = cbind) %dopar% {
    rows <-  ((i - 1)*n/nCores + 1):(i*n/nCores)
    out <- parFun(A, A[, rows])
  }
)

print(paste("Cores = ", nCores))
print(systime)

Parallel processing in python

I’m all about efficiency, and I’ve been excited to learn more about parallel processing. I really like seeing my cpu using all four cores at 100%. I paid for all those cores, and I want them computing!

If I have some calculations running in python it’s mildly annoying to check on task manager (in windows – ctrl-shift-esc, click on the “performance” tab), and see my cpu usage at 25%. Only one core of the four cores are being used. The reasons for this are somewhat complex, but this is python so there are a number of modules that make it easy to parallelize operations (post on similar code in R).

Do I have one hard working core, and three lazy ones?

To start learning how to parallelize calculations in python I used parallel python. To test the speed up gained by using parallel processing I wrote an inefficient matrix multiplication function using for loops (code below). Using numpy’s matrix class would be much better here, but numpy operations can do some parallelization on their own. Generating four 400 x 400 matrices and multiplying them by themselves took 76.2 s when run serially, and 21.8 s when distributed across the four cores, almost matching a full four-fold speedup. Also, I see I’m getting the full value from the four cores I paid for:

Using all the cpus to speed up the calculation.

This trivial sort of parallelization is ideal for speeding up simulations, since each simulation is independent of the others.

import pp
import time
import random

def randprod(n):
    """Waste time by inefficiently multiplying an n x n matrix"""
    random.seed(0)
    n = int(n)
    # Generate an n x n matrix of random numbers
    X = [[random.random() for _ in range(n)] for _ in range(n)]
    Y = [[None]*n]*n
    for i in range(n):
        for j in range(n):
            Y[i][j] = sum([X[i][k]*X[k][j] for k in range(n)])
    return Y

if __name__ == "__main__":
    test_n = [400]*4
    # Do 4 serial calculations:
    start_time = time.time()
    series_sums = [randprod(n) for n in test_n]
    elapsed = (time.time() - start_time)
    print "Time for serial processing: %.1f" % elapsed + " s"

    time.sleep(10) # provide a break in windows CPU usage graph    

    # Do the same 4 calculations in parallel, using parallel python
    par_start_time = time.time()
    n_cpus = 4
    job_server = pp.Server(n_cpus)
    jobs = [job_server.submit(randprod, (n,), (), ("random",)) for n in test_n]
    par_sums = [job() for job in jobs]
    par_elapsed = (time.time() - par_start_time)
    print "Time for parallel processing: %.1f" % par_elapsed + " s"

    print par_sums == series_sums

Yelp Wordmap

In L.A. I found myself in a situation I’ve encountered before: Trying to find a place for dinner in a large and unfamiliar city. I want to experience the city, but preferably avoid the dodgy parts. It’s difficult to narrow down the options in a such a case. Luckily I’d seen a post on the Yelp engineering blog about their neat wordmap page, so I decided to turn there. Since it was approaching sunset I choose to look for places where people commented on the views, and found a few patches of density along the coast, about 20 minutes from our hotel:

We headed down there and drove around until the sun set (the views were great!). Then we ended up eating at Fish Camp (which of course we vetted on Yelp first), and it was good.