All posts by Frank

LaTeX resume template

Last May when I had a week between graduation and starting work I spent some time creating a LaTeX version of my resume. I couldn’t find a template that suited my needs right out of the box, but I found a two column layout that came close. I had to make a fair bit of changes to template to be happy with the result, including more customization options for the personal info box and adding the option to put bulleted lists under each job. The the underlying LaTeX template and example document are up on github, so you can clone the project and fill in your own details.

You can also edit the LaTeX directly at writelatex.com, a great website that provides online LaTeX editing and compilation. Here’s a link to the template as an interactive example.

Exporting static IPython Notebooks with style


The IPython Notebook is an incredible tool that I use almost daily. Notebooks can be exported to HTML for easy sharing using nbconvert, however I’ve never been happy with the look of the exported notebooks. In the past I’ve used the nbviewer to render the .ipynb files and display notebooks online, but I’d prefer to host the notebooks myself. I set out to make some changes so I could export HTML notebooks that looked good. The results can be seen in my Introduction to Singular Value Decomposition and NYC School Data Exploration notebooks. See below for how to.

Summary (gists):

View code

Making an interactive histogram in D3.js

I recently worked on some updates to the MPG tracking page I set up in January. One of my goals was to make the graphs on the page respond to mouseover events by displaying more data. Here’s some sample code for a simplified version of the histogram that now appears on the MPG page, the data is below the code. The code for this example and the entire MPG project is available on github. See also the interactive miles over time graph.

View code

Elastic MapReduce tip

I’ve been working heavily with Amazon’s Elastic MapReduce (EMR) lately to run analysis jobs on hadoop. During development I often have to ssh into the master node of the cluster and the constant copying/pasting of DNS names or job-ids was starting to get annoying. I wrote this function to automatically log me in to an ssh session with the most recently created active master node and put it in my .bashrc.

A bit of configuration of the elastic-mapreduce CLI is required (see here).

Book review: Doing Data Science: Straight Talk from the Frontline

The dramatically titled Doing Data Science: Straight Talk from the Frontline by Cathy O’Neil and Rachel Schutt (disclaimer) reads much like someone reporting back on notes they took at conference, because that’s essentially what it is. The book largely consists of summaries of talks given as part of a Data Science class. I wish the distinction between the content of the talks and the author’s insertions of background information or teaching suggestions was clearer. I don’t see this book being a reference work for me, but it was nice to read through once to learn about how different people set about solving specific data problems.

Exploring NYC School Data

New York City makes a large amount of data on its school system available for analysis. I recently took some time to explore some of the data in an IPython notebook (link to notebook). Eventually I’d like to do more detailed analysis including clustering and anomaly detection, but for now I had a great time getting a feel for the data.

I was able to make use of a neat feature in pandas, vectorized string operations. Each school is identified by a “DBN”, a string indicating its district, borough and number. For example “10X95” is district 10, Bronx, # 95. The following code extracts the district and borough information to separate columns in a DataFrame in a fast, nan-safe way.

borough = mergeddf['DBN'].str.extract(r'\d+([A-Z])\d+')
district = mergeddf['DBN'].str.extract(r'(\d+)[A-Z]\d+')
mergeddf['Borough'] = borough
mergeddf['District'] = district

For more neat data and plotting tricks made possible by python, check out the full notebook.

Channel Islands National Park

We visited Channel Islands National Park last weekend. I’ve put up some pictures in the Channel Islands page. Overall it was not our favorite park, the hiking felt very similar to hiking the hot, dry, sun-scorched trails of Mt. Diablo, but with less shade. On the plus side we did get to see a blue whale and her calf. We may try kayaking if we go again.

Book review: Python for Data Analysis

Python for Data Analysis (disclaimer) is written by Wes McKinney, the original author of the excellent Pandas library. I highly recommend this book for anyone who interacts with data. The scope of the book goes well beyond Pandas and covers other essential python data tools such as IPython, Numpy and Matplotlib. Also included are recommendations and best practices for data workflows and interactive analysis. The examples in the book are well thought out and illustrate the point in question without unnecessary complication. As a bonus a diverse group of data sets are used in the examples, which makes for a more interesting read.

One useful function that I had previously overlooked is the apply method of pandas groupby objects. The apply method applies a function to each group in the groupby object, then glues the results together row wise. I like apply because it’s an elegant way to do arbitrary operations to each group of data, replacing cases where I might otherwise have used a loop like the one below:

result_dict = {}
for group_name, group in groupby_object:
    result = some_function(group)
    result_dict[group_name] = result

There are some useful examples of using apply in the pandas documentation. There are even more examples in the Python for Data Analysis book, including applying a regression model to the data in each group.