Category Archives: Code

Book Review: Effective Python

Effective Python (disclaimer) contains 59 mini-lessons on python best practices. This book is not for programming novices but is great for advancing your python skill or as a reference if you’re moving to python from another language.

I see Effective Python as a compliment to Python for Data Analysis, the latter covering scientific and statistical uses where alternatives include MATLAB or R, and the former covering general software problems where python takes the place of Java or Ruby. Of course those two fields are not mutually exclusive, after reading Effective Python I have a better understanding of what’s going on (or should be going on) behind the scenes of the python libraries I use.

Book review: The Linux Command Line

The Linux Command Line: A Complete Introduction lives up to its name and covers a wide range of command line topics (disclaimer). There are a lot of subtleties with how the bash shell behaves that are more clear to me after reading this book. I didn’t learn as much as I did from How Linux Works (review), but I wish I’d read The Linux Command Line when I first starting working with linux systems.

The last part of the book deals with writing shell scripts and covers more about shell scripting than I ever wanted to know. Reading and writing nontrivial shell scripts is sometimes a reality, so overall it was good to pick up a few useful tips and tricks along with a generally increased understanding of bash programming.

Book review: How Linux Works

How Linux Works: What Every Superuser Should Know is an excellent book that gives a well organized introduction to what’s going on behind the scenes of a linux system (disclaimer). While doing data science I learned linux piece by piece, after reading this book I was able to better understand how all the pieces fit together. I also learned a lot about subjects I hadn’t really touched, for example how the network stack translates information to and from physical bits on a wire.

Most data science happens on linux or unix based systems. Just as knowing a lot about farming will help make you a better chef, knowing a lot about linux will make you a better data scientist. Since I work at a small startup my role often extends from data science to data engineering to operations support for data science and engineering to general operations support. Reading How Linux Works helped make me more effective at all of these roles.

Book review: Think Bayes

Think Bayes is a rare book that isn’t specifically about Python or a Python library but still contains excellent, well organized python code (disclaimer). Think Bayes covers using Bayesian Statistics to calculate probabilities and better understand real problems drawing from a range of real data sets. I definitely recommend that any scientist (and especially any data scientist) read this book, it’s available for free here.

LaTeX resume template

Last May when I had a week between graduation and starting work I spent some time creating a LaTeX version of my resume. I couldn’t find a template that suited my needs right out of the box, but I found a two column layout that came close. I had to make a fair bit of changes to template to be happy with the result, including more customization options for the personal info box and adding the option to put bulleted lists under each job. The the underlying LaTeX template and example document are up on github, so you can clone the project and fill in your own details.

You can also edit the LaTeX directly at writelatex.com, a great website that provides online LaTeX editing and compilation. Here’s a link to the template as an interactive example.

Exporting static IPython Notebooks with style


The IPython Notebook is an incredible tool that I use almost daily. Notebooks can be exported to HTML for easy sharing using nbconvert, however I’ve never been happy with the look of the exported notebooks. In the past I’ve used the nbviewer to render the .ipynb files and display notebooks online, but I’d prefer to host the notebooks myself. I set out to make some changes so I could export HTML notebooks that looked good. The results can be seen in my Introduction to Singular Value Decomposition and NYC School Data Exploration notebooks. See below for how to.

Summary (gists):

View code

Python script to rename figures

I’m currently in the process of finishing up writing my thesis. I’ve found the simplest way to name the files for the figures in a paper is in the format figure 1.jpg, figure 2.jpg…  That way all the authors know exactly what figure in the paper corresponds to what file. This works well in a paper where the number of figures is generally fixed, but if I decide want to insert a figure between figures 1 and 2 in my thesis every figure’s number increases by one, and I don’t want to rename 20 files by hand. To fix this I wrote a python script to go through a folder and increment by 1 the number of every figure after a certain figure (i.e. turn figure 1.jpg, figure 2.jpg, figure 3.jpg… into figure 1.jpg, figure 3.jpg, figure 4.jpg…).

The code uses regular expressions to find the files that should be renamed, and figure out what number they had originally. To change figure 2 and above, the command line call would be:

$python figure_rename.py 2

Here is the script:

Python templates with Jinja2

When I initially wrote the car usage tracking web app I hard coded the html into the python script that worked with the data. This is alright for prototyping a D3.js visualization, but it’s a real mess having the page layout and text intermingled with the python code. Luckily there exists a large number of python templating engines that make it easy to fill in variables in an html layout with data from a python script. In this case I used Jinja2.

Here is what part of the python script looked liked before refactoring using Jinja2.

# print info from current fill up and table with all fillups
print "<p>Miles driven: %s" % (float(data[-1][-4]) - float(data[-2][-4]))
print "<p>Your mpg was: <b>%.2f</b>" % float(data[-1][-1])

Jinja2 allows for passing variables from python in the form of a dictionary, where the key in the dictionary is specified in the html template (surrounded by double curly braces i.e. {{ key }}), and will be replaced with the value associated with that key. It allows simple logic such as for loops, which I used to produce the table of past data:

<table>
  <tr>
    {% for cell in header %}
      <td><b>{{ cell }}</b></td>
    {% endfor %}
  </tr>
  {% for row in fillups %}
    <tr>
      {% for cell in row %}
        <td>{{ cell }}</td>
      {% endfor %}
    </tr>
  {% endfor %} 
</table>

In the python script, the code to produce the template looks like this:

import jinja2

env = jinja2.Environment(loader=jinja2.FileSystemLoader(
 searchpath='templates/')
 )
template = env.get_template('mpg.html')

display_dictionary = {}
# code that fills in display_dictionary with the values to send to the template

print template.render(display_dictionary)

It’s also possible to save the rendered template as a static html file:

output = template.render(display_dictionary)
with open('../index.html', 'w') as f:
    f.write(output);

Where dispdict is a dictionary containing variables like mpg, milesdriven, etc. The keys to this dictionary are in the template, awaiting replacement the the correct value. The code ends up being much cleaner this way: the data calculation is separated cleanly from the html that defines how it is displayed. Now that I’ve gotten familiar with Jinja2 I plan to use it from the beginning in future projects. I’ll probably further refactor the code to generate a static html file each time the data is updated, since creating the page each time it is visited is not necessary in this case.

The all the code for the car usage tracking web app is up on github (link).

Parallel processing in R

In R a lot of operations will take advantage of parallel processing, provided the BLAS being used supports it. As in python it’s also possible to manually split calculations up between multiple cores. The code below uses the doParallel package and a foreach loop combined with %dopar% to split up a matrix multiplication task. If I limit the BLAS to using one thread, this will speed up the calculation. This sort of construction could also be used to run independent simulations.

I think this is a lot more intuitive than the parallel python implementation (see post). I’m curious to look at other parallel processing python modules to see how they compare.

require(parallel)
require(doParallel)
library(foreach)
library(iterators)

parFun <- function(A, B){
  A%*%B
}

nCores <- 2
n <- 4000 # for this toy code, make sure n is divisible by nCores
a <- rnorm(n*n)
A <- matrix(a, ncol=n)

registerDoParallel(nCores)
systime <- system.time(
  result <- foreach(i = 1:nCores, .combine = cbind) %dopar% {
    rows <-  ((i - 1)*n/nCores + 1):(i*n/nCores)
    out <- parFun(A, A[, rows])
  }
)

print(paste("Cores = ", nCores))
print(systime)

Parallel processing in python

I’m all about efficiency, and I’ve been excited to learn more about parallel processing. I really like seeing my cpu using all four cores at 100%. I paid for all those cores, and I want them computing!

If I have some calculations running in python it’s mildly annoying to check on task manager (in windows – ctrl-shift-esc, click on the “performance” tab), and see my cpu usage at 25%. Only one core of the four cores are being used. The reasons for this are somewhat complex, but this is python so there are a number of modules that make it easy to parallelize operations (post on similar code in R).

Do I have one hard working core, and three lazy ones?

To start learning how to parallelize calculations in python I used parallel python. To test the speed up gained by using parallel processing I wrote an inefficient matrix multiplication function using for loops (code below). Using numpy’s matrix class would be much better here, but numpy operations can do some parallelization on their own. Generating four 400 x 400 matrices and multiplying them by themselves took 76.2 s when run serially, and 21.8 s when distributed across the four cores, almost matching a full four-fold speedup. Also, I see I’m getting the full value from the four cores I paid for:

Using all the cpus to speed up the calculation.

This trivial sort of parallelization is ideal for speeding up simulations, since each simulation is independent of the others.

import pp
import time
import random

def randprod(n):
    """Waste time by inefficiently multiplying an n x n matrix"""
    random.seed(0)
    n = int(n)
    # Generate an n x n matrix of random numbers
    X = [[random.random() for _ in range(n)] for _ in range(n)]
    Y = [[None]*n]*n
    for i in range(n):
        for j in range(n):
            Y[i][j] = sum([X[i][k]*X[k][j] for k in range(n)])
    return Y

if __name__ == "__main__":
    test_n = [400]*4
    # Do 4 serial calculations:
    start_time = time.time()
    series_sums = [randprod(n) for n in test_n]
    elapsed = (time.time() - start_time)
    print "Time for serial processing: %.1f" % elapsed + " s"

    time.sleep(10) # provide a break in windows CPU usage graph    

    # Do the same 4 calculations in parallel, using parallel python
    par_start_time = time.time()
    n_cpus = 4
    job_server = pp.Server(n_cpus)
    jobs = [job_server.submit(randprod, (n,), (), ("random",)) for n in test_n]
    par_sums = [job() for job in jobs]
    par_elapsed = (time.time() - par_start_time)
    print "Time for parallel processing: %.1f" % par_elapsed + " s"

    print par_sums == series_sums