Image Archives

Channel Islands National Park

We visited Channel Islands National Park last weekend. I’ve put up some pictures in the Channel Islands page. Overall it was not our favorite park, the hiking felt very similar to hiking the hot, dry, sun-scorched trails of Mt. Diablo, but with less shade. On the plus side we did get to see a blue whale and her calf. We may try kayaking if we go again.

Car usage tracking

Since we bought a new car last year we’ve been keeping a detailed log of every trip to the gas station, including odometer reading, calculated mpg, and location. Over the weekend I wrote a simple web app to visualize the data and provide an interface to update the data. I’ve put a few images from the results below. The page itself is up here. The code is available on github (link).

Graphs of car fill-up data made with D3.js
The interface for adding new data.

I think it’s neat to visualize the effects of various road trips on the odometer reading, the two trips to Utah being very steep. Those parts of the graph remind me of the profile of many of the geologic features we went to Utah to see. Also of note is the prolonged steep section over the summer when I was commuting to an internship. Now if anyone asks what kind of gas mileage our car gets, I can just give them a link.

I used D3.js, a JavaScript library for “data driven documents”, especially useful for making plots in a browser. I hadn’t written any JavaScript before, so it was a great learning experience and I’m really happy with the results. My usual approach is to generate and save the plots in python using matplotlib, then serve those files. I’m excited to use D3.js to create other visualizations on the web.

I used python scripts to update the data and output the html table seen on the page.

Again, here are the links:

Tips for scientists using computers

This week I gave a presentation to my research group about best practices in scientific computing. It can be hard to know what’s out there, so I thought it would be good to give a brief introduction to some tools as a starting point. My main objective was to show how easy version control is with git, and how it could improve the quality of our science. I also wanted to introduce the magic of interactive data analysis using IPython and the IPython Notebook, along with pandas and other scientific libraries. Historically our primary language has been MATLAB, along with Origin for generating figures, but I think these open source alternatives offer a lot of advantages. I’ve posted the files from the presentation here on github (look in the code directory for example IPython notebooks).

Parallel processing in R

In R a lot of operations will take advantage of parallel processing, provided the BLAS being used supports it. As in python it’s also possible to manually split calculations up between multiple cores. The code below uses the doParallel package and a foreach loop combined with %dopar% to split up a matrix multiplication task. If I limit the BLAS to using one thread, this will speed up the calculation. This sort of construction could also be used to run independent simulations.

I think this is a lot more intuitive than the parallel python implementation (see post). I’m curious to look at other parallel processing python modules to see how they compare.


parFun <- function(A, B){

nCores <- 2
n <- 4000 # for this toy code, make sure n is divisible by nCores
a <- rnorm(n*n)
A <- matrix(a, ncol=n)

systime <- system.time(
  result <- foreach(i = 1:nCores, .combine = cbind) %dopar% {
    rows <-  ((i - 1)*n/nCores + 1):(i*n/nCores)
    out <- parFun(A, A[, rows])

print(paste("Cores = ", nCores))

Parallel processing in python

I’m all about efficiency, and I’ve been excited to learn more about parallel processing. I really like seeing my cpu using all four cores at 100%. I paid for all those cores, and I want them computing!

If I have some calculations running in python it’s mildly annoying to check on task manager (in windows – ctrl-shift-esc, click on the “performance” tab), and see my cpu usage at 25%. Only one core of the four cores are being used. The reasons for this are somewhat complex, but this is python so there are a number of modules that make it easy to parallelize operations (post on similar code in R).

Do I have one hard working core, and three lazy ones?

To start learning how to parallelize calculations in python I used parallel python. To test the speed up gained by using parallel processing I wrote an inefficient matrix multiplication function using for loops (code below). Using numpy’s matrix class would be much better here, but numpy operations can do some parallelization on their own. Generating four 400 x 400 matrices and multiplying them by themselves took 76.2 s when run serially, and 21.8 s when distributed across the four cores, almost matching a full four-fold speedup. Also, I see I’m getting the full value from the four cores I paid for:

Using all the cpus to speed up the calculation.

This trivial sort of parallelization is ideal for speeding up simulations, since each simulation is independent of the others.

import pp
import time
import random

def randprod(n):
    """Waste time by inefficiently multiplying an n x n matrix"""
    n = int(n)
    # Generate an n x n matrix of random numbers
    X = [[random.random() for _ in range(n)] for _ in range(n)]
    Y = [[None]*n]*n
    for i in range(n):
        for j in range(n):
            Y[i][j] = sum([X[i][k]*X[k][j] for k in range(n)])
    return Y

if __name__ == "__main__":
    test_n = [400]*4
    # Do 4 serial calculations:
    start_time = time.time()
    series_sums = [randprod(n) for n in test_n]
    elapsed = (time.time() - start_time)
    print "Time for serial processing: %.1f" % elapsed + " s"

    time.sleep(10) # provide a break in windows CPU usage graph    

    # Do the same 4 calculations in parallel, using parallel python
    par_start_time = time.time()
    n_cpus = 4
    job_server = pp.Server(n_cpus)
    jobs = [job_server.submit(randprod, (n,), (), ("random",)) for n in test_n]
    par_sums = [job() for job in jobs]
    par_elapsed = (time.time() - par_start_time)
    print "Time for parallel processing: %.1f" % par_elapsed + " s"

    print par_sums == series_sums