Monthly Archives: April 2014

Book review: Building Machine Learning Systems with Python

I recently read the book Building Machine Learning Systems with Python by Willi Richert and Luis Pedro Coelho (disclaimer). Overall I think it is worth reading for someone who is already familiar with coding in python (and the numpy library) and is interested in using python machine learning libraries. I can’t recommended it as strongly for someone who is unfamiliar with python because the code in the book is often unpythonic (in my opinion), and the code available on the book’s website doesn’t match well with the code in the book and requires a fair bit of tweaking before you can actually run the examples.

I especially enjoyed exploring the gensim library, which is touched on in the book. I also liked the approach taken in the book of building and analyzing machine learning systems as an iterative process, exploring models and features to converge on a good solution.

One thing I think could improve the code in the book is better variable names. For example the following code is needlessly cryptic:

dense = np.zeros( (len(topics), 100), float)
for ti,t in enumerate(topics):
    for tj,v in t:
        dense[ti,tj] = v

There is also a blundering use of a list comprehension to reshape a numpy array:

x = np.array([[v] for v in x])

Using the built in reshape method is more memory efficient, easier to read, and 1,000 times faster:

In [25]: x.shape
Out[25]: (506L,)

In [26]: %timeit y = np.array([[v] for v in x])
100 loops, best of 3: 4.07 ms per loop

In [27]: %timeit y = x.reshape((x.size, 1)).copy()
100000 loops, best of 3: 4.09 µs per loop


Python script to rename figures

I’m currently in the process of finishing up writing my thesis. I’ve found the simplest way to name the files for the figures in a paper is in the format figure 1.jpg, figure 2.jpg…  That way all the authors know exactly what figure in the paper corresponds to what file. This works well in a paper where the number of figures is generally fixed, but if I decide want to insert a figure between figures 1 and 2 in my thesis every figure’s number increases by one, and I don’t want to rename 20 files by hand. To fix this I wrote a python script to go through a folder and increment by 1 the number of every figure after a certain figure (i.e. turn figure 1.jpg, figure 2.jpg, figure 3.jpg… into figure 1.jpg, figure 3.jpg, figure 4.jpg…).

The code uses regular expressions to find the files that should be renamed, and figure out what number they had originally. To change figure 2 and above, the command line call would be:

$python 2

Here is the script: