Since we bought a new car last year we’ve been keeping a detailed log of every trip to the gas station, including odometer reading, calculated mpg, and location. Over the weekend I wrote a simple web app to visualize the data and provide an interface to update the data. I’ve put a few images from the results below. The page itself is up here. The code is available on github (link).
I think it’s neat to visualize the effects of various road trips on the odometer reading, the two trips to Utah being very steep. Those parts of the graph remind me of the profile of many of the geologic features we went to Utah to see. Also of note is the prolonged steep section over the summer when I was commuting to an internship. Now if anyone asks what kind of gas mileage our car gets, I can just give them a link.
I used python scripts to update the data and output the html table seen on the page.
This week I gave a presentation to my research group about best practices in scientific computing. It can be hard to know what’s out there, so I thought it would be good to give a brief introduction to some tools as a starting point. My main objective was to show how easy version control is with git, and how it could improve the quality of our science. I also wanted to introduce the magic of interactive data analysis using IPython and the IPython Notebook, along with pandas and other scientific libraries. Historically our primary language has been MATLAB, along with Origin for generating figures, but I think these open source alternatives offer a lot of advantages. I’ve posted the files from the presentation here on github (look in the code directory for example IPython notebooks).
I’m all about efficiency, and I’ve been excited to learn more about parallel processing. I really like seeing my cpu using all four cores at 100%. I paid for all those cores, and I want them computing!
If I have some calculations running in python it’s mildly annoying to check on task manager (in windows – ctrl-shift-esc, click on the “performance” tab), and see my cpu usage at 25%. Only one core of the four cores are being used. The reasons for this are somewhat complex, but this is python so there are a number of modules that make it easy to parallelize operations (post on similar code in R).
To start learning how to parallelize calculations in python I used parallel python. To test the speed up gained by using parallel processing I wrote an inefficient matrix multiplication function using for loops (code below). Using numpy’s matrix class would be much better here, but numpy operations can do some parallelization on their own. Generating four 400 x 400 matrices and multiplying them by themselves took 76.2 s when run serially, and 21.8 s when distributed across the four cores, almost matching a full four-fold speedup. Also, I see I’m getting the full value from the four cores I paid for:
This trivial sort of parallelization is ideal for speeding up simulations, since each simulation is independent of the others.
"""Waste time by inefficiently multiplying an n x n matrix"""
n = int(n)
# Generate an n x n matrix of random numbers
X = [[random.random() for _ in range(n)] for _ in range(n)]
Y = [[None]*n]*n
for i in range(n):
for j in range(n):
Y[i][j] = sum([X[i][k]*X[k][j] for k in range(n)])
if __name__ == "__main__":
test_n = *4
# Do 4 serial calculations:
start_time = time.time()
series_sums = [randprod(n) for n in test_n]
elapsed = (time.time() - start_time)
print "Time for serial processing: %.1f" % elapsed + " s"
time.sleep(10) # provide a break in windows CPU usage graph
# Do the same 4 calculations in parallel, using parallel python
par_start_time = time.time()
n_cpus = 4
job_server = pp.Server(n_cpus)
jobs = [job_server.submit(randprod, (n,), (), ("random",)) for n in test_n]
par_sums = [job() for job in jobs]
par_elapsed = (time.time() - par_start_time)
print "Time for parallel processing: %.1f" % par_elapsed + " s"
print par_sums == series_sums