Tag Archives: code

Exploring NYC School Data

New York City makes a large amount of data on its school system available for analysis. I recently took some time to explore some of the data in an IPython notebook (link to notebook). Eventually I’d like to do more detailed analysis including clustering and anomaly detection, but for now I had a great time getting a feel for the data.

I was able to make use of a neat feature in pandas, vectorized string operations. Each school is identified by a “DBN”, a string indicating its district, borough and number. For example “10X95” is district 10, Bronx, # 95. The following code extracts the district and borough information to separate columns in a DataFrame in a fast, nan-safe way.

borough = mergeddf['DBN'].str.extract(r'\d+([A-Z])\d+')
district = mergeddf['DBN'].str.extract(r'(\d+)[A-Z]\d+')
mergeddf['Borough'] = borough
mergeddf['District'] = district

For more neat data and plotting tricks made possible by python, check out the full notebook.

Book review: Python for Data Analysis

Python for Data Analysis (disclaimer) is written by Wes McKinney, the original author of the excellent Pandas library. I highly recommend this book for anyone who interacts with data. The scope of the book goes well beyond Pandas and covers other essential python data tools such as IPython, Numpy and Matplotlib. Also included are recommendations and best practices for data workflows and interactive analysis. The examples in the book are well thought out and illustrate the point in question without unnecessary complication. As a bonus a diverse group of data sets are used in the examples, which makes for a more interesting read.

One useful function that I had previously overlooked is the apply method of pandas groupby objects. The apply method applies a function to each group in the groupby object, then glues the results together row wise. I like apply because it’s an elegant way to do arbitrary operations to each group of data, replacing cases where I might otherwise have used a loop like the one below:

result_dict = {}
for group_name, group in groupby_object:
    result = some_function(group)
    result_dict[group_name] = result

There are some useful examples of using apply in the pandas documentation. There are even more examples in the Python for Data Analysis book, including applying a regression model to the data in each group.

Book review: Building Machine Learning Systems with Python

I recently read the book Building Machine Learning Systems with Python by Willi Richert and Luis Pedro Coelho (disclaimer). Overall I think it is worth reading for someone who is already familiar with coding in python (and the numpy library) and is interested in using python machine learning libraries. I can’t recommended it as strongly for someone who is unfamiliar with python because the code in the book is often unpythonic (in my opinion), and the code available on the book’s website doesn’t match well with the code in the book and requires a fair bit of tweaking before you can actually run the examples.

I especially enjoyed exploring the gensim library, which is touched on in the book. I also liked the approach taken in the book of building and analyzing machine learning systems as an iterative process, exploring models and features to converge on a good solution.

One thing I think could improve the code in the book is better variable names. For example the following code is needlessly cryptic:

dense = np.zeros( (len(topics), 100), float)
for ti,t in enumerate(topics):
    for tj,v in t:
        dense[ti,tj] = v

There is also a blundering use of a list comprehension to reshape a numpy array:

x = np.array([[v] for v in x])

Using the built in reshape method is more memory efficient, easier to read, and 1,000 times faster:

In [25]: x.shape
Out[25]: (506L,)

In [26]: %timeit y = np.array([[v] for v in x])
100 loops, best of 3: 4.07 ms per loop

In [27]: %timeit y = x.reshape((x.size, 1)).copy()
100000 loops, best of 3: 4.09 µs per loop

 

Python script to rename figures

I’m currently in the process of finishing up writing my thesis. I’ve found the simplest way to name the files for the figures in a paper is in the format figure 1.jpg, figure 2.jpg…  That way all the authors know exactly what figure in the paper corresponds to what file. This works well in a paper where the number of figures is generally fixed, but if I decide want to insert a figure between figures 1 and 2 in my thesis every figure’s number increases by one, and I don’t want to rename 20 files by hand. To fix this I wrote a python script to go through a folder and increment by 1 the number of every figure after a certain figure (i.e. turn figure 1.jpg, figure 2.jpg, figure 3.jpg… into figure 1.jpg, figure 3.jpg, figure 4.jpg…).

The code uses regular expressions to find the files that should be renamed, and figure out what number they had originally. To change figure 2 and above, the command line call would be:

$python figure_rename.py 2

Here is the script:

Singular Value Decomposition

In the Statistical Computing class I took last fall, a matrix decomposition called the singular value decomposition came up briefly as a way to classify similar objects. I wanted to learn more about it but I couldn’t find a resource that was exactly what I wanted, so I decided to create one. I wrote up two IPython Notebooks about singular value decomposition, one is an introduction to the concept and an example application to classifying research articles, the other deals with image compression. The notebooks are viewable at the links below. You can also download the code from these links:

Singular Value Decomposition and Applications

Singular Value Decomposition of an Image

Machine Learning Class

I recently started taking another online class, Machine Learning. Like the database class I took previously (blog entry) it is from Stanford. Unlike the database class, it is offered through coursera, a for-profit company co-founded by the Professor for the course, Andrew Ng. The for-profit aspect really came into focus when I got an email offering “Free $25 Promotional Code for Machine Learning Tutoring!”

The first few weeks covered a lot of things I was already familiar with, but I’m still enjoying hearing the material from a different angle. Although overall the course feels a bit less polished than the Database class, I’m really excited to be taking it and I’m looking forward to the next week of lectures.

Online Introduction to Databases class

I recently finished a free online course from Stanford on Databases (link to course webpage). It’s taught by Professor Jennifer Widom, who gives excellent lectures throughout the course. The exercises are well thought out and fun to do, and the online submission and grading system is very intuitive. Another nice feature is the “progress” tab, which shows a bar graph of all the points you’ve earned on each individual assignment, as well as your current total of points for the course. The progress graph makes writing SQL and XML queries feel like playing a video game.

I really enjoyed the challenge of thinking of ways to build up different sets using nested SQL queries. Besides just learning SQL for its own sake, taking the class has helped me understand the conventions behind some operations in other environments, for example dataframe joins in pandas. I definitely feel much more powerful in regards to accessing and manipulating data after taking the class.

I highly recommend the class. I imagine it will be offered again.

Python templates with Jinja2

When I initially wrote the car usage tracking web app I hard coded the html into the python script that worked with the data. This is alright for prototyping a D3.js visualization, but it’s a real mess having the page layout and text intermingled with the python code. Luckily there exists a large number of python templating engines that make it easy to fill in variables in an html layout with data from a python script. In this case I used Jinja2.

Here is what part of the python script looked liked before refactoring using Jinja2.

# print info from current fill up and table with all fillups
print "<p>Miles driven: %s" % (float(data[-1][-4]) - float(data[-2][-4]))
print "<p>Your mpg was: <b>%.2f</b>" % float(data[-1][-1])

Jinja2 allows for passing variables from python in the form of a dictionary, where the key in the dictionary is specified in the html template (surrounded by double curly braces i.e. {{ key }}), and will be replaced with the value associated with that key. It allows simple logic such as for loops, which I used to produce the table of past data:

<table>
  <tr>
    {% for cell in header %}
      <td><b>{{ cell }}</b></td>
    {% endfor %}
  </tr>
  {% for row in fillups %}
    <tr>
      {% for cell in row %}
        <td>{{ cell }}</td>
      {% endfor %}
    </tr>
  {% endfor %} 
</table>

In the python script, the code to produce the template looks like this:

import jinja2

env = jinja2.Environment(loader=jinja2.FileSystemLoader(
 searchpath='templates/')
 )
template = env.get_template('mpg.html')

display_dictionary = {}
# code that fills in display_dictionary with the values to send to the template

print template.render(display_dictionary)

It’s also possible to save the rendered template as a static html file:

output = template.render(display_dictionary)
with open('../index.html', 'w') as f:
    f.write(output);

Where dispdict is a dictionary containing variables like mpg, milesdriven, etc. The keys to this dictionary are in the template, awaiting replacement the the correct value. The code ends up being much cleaner this way: the data calculation is separated cleanly from the html that defines how it is displayed. Now that I’ve gotten familiar with Jinja2 I plan to use it from the beginning in future projects. I’ll probably further refactor the code to generate a static html file each time the data is updated, since creating the page each time it is visited is not necessary in this case.

The all the code for the car usage tracking web app is up on github (link).

Car usage tracking

Since we bought a new car last year we’ve been keeping a detailed log of every trip to the gas station, including odometer reading, calculated mpg, and location. Over the weekend I wrote a simple web app to visualize the data and provide an interface to update the data. I’ve put a few images from the results below. The page itself is up here. The code is available on github (link).

Graphs of car fill-up data made with D3.js
The interface for adding new data.

I think it’s neat to visualize the effects of various road trips on the odometer reading, the two trips to Utah being very steep. Those parts of the graph remind me of the profile of many of the geologic features we went to Utah to see. Also of note is the prolonged steep section over the summer when I was commuting to an internship. Now if anyone asks what kind of gas mileage our car gets, I can just give them a link.

I used D3.js, a JavaScript library for “data driven documents”, especially useful for making plots in a browser. I hadn’t written any JavaScript before, so it was a great learning experience and I’m really happy with the results. My usual approach is to generate and save the plots in python using matplotlib, then serve those files. I’m excited to use D3.js to create other visualizations on the web.

I used python scripts to update the data and output the html table seen on the page.

Again, here are the links:

Tips for scientists using computers

This week I gave a presentation to my research group about best practices in scientific computing. It can be hard to know what’s out there, so I thought it would be good to give a brief introduction to some tools as a starting point. My main objective was to show how easy version control is with git, and how it could improve the quality of our science. I also wanted to introduce the magic of interactive data analysis using IPython and the IPython Notebook, along with pandas and other scientific libraries. Historically our primary language has been MATLAB, along with Origin for generating figures, but I think these open source alternatives offer a lot of advantages. I’ve posted the files from the presentation here on github (look in the code directory for example IPython notebooks).