Category Archives: Data

Book review: How Linux Works

How Linux Works: What Every Superuser Should Know is an excellent book that gives a well organized introduction to what’s going on behind the scenes of a linux system (disclaimer). While doing data science I learned linux piece by piece, after reading this book I was able to better understand how all the pieces fit together. I also learned a lot about subjects I hadn’t really touched, for example how the network stack translates information to and from physical bits on a wire.

Most data science happens on linux or unix based systems. Just as knowing a lot about farming will help make you a better chef, knowing a lot about linux will make you a better data scientist. Since I work at a small startup my role often extends from data science to data engineering to operations support for data science and engineering to general operations support. Reading How Linux Works helped make me more effective at all of these roles.

Book review: Think Bayes

Think Bayes is a rare book that isn’t specifically about Python or a Python library but still contains excellent, well organized python code (disclaimer). Think Bayes covers using Bayesian Statistics to calculate probabilities and better understand real problems drawing from a range of real data sets. I definitely recommend that any scientist (and especially any data scientist) read this book, it’s available for free here.

Exporting static IPython Notebooks with style


The IPython Notebook is an incredible tool that I use almost daily. Notebooks can be exported to HTML for easy sharing using nbconvert, however I’ve never been happy with the look of the exported notebooks. In the past I’ve used the nbviewer to render the .ipynb files and display notebooks online, but I’d prefer to host the notebooks myself. I set out to make some changes so I could export HTML notebooks that looked good. The results can be seen in my Introduction to Singular Value Decomposition and NYC School Data Exploration notebooks. See below for how to.

Summary (gists):

View code

Making an interactive histogram in D3.js

I recently worked on some updates to the MPG tracking page I set up in January. One of my goals was to make the graphs on the page respond to mouseover events by displaying more data. Here’s some sample code for a simplified version of the histogram that now appears on the MPG page, the data is below the code. The code for this example and the entire MPG project is available on github. See also the interactive miles over time graph.

View code

Elastic MapReduce tip

I’ve been working heavily with Amazon’s Elastic MapReduce (EMR) lately to run analysis jobs on hadoop. During development I often have to ssh into the master node of the cluster and the constant copying/pasting of DNS names or job-ids was starting to get annoying. I wrote this function to automatically log me in to an ssh session with the most recently created active master node and put it in my .bashrc.

A bit of configuration of the elastic-mapreduce CLI is required (see here).

Book review: Doing Data Science: Straight Talk from the Frontline

The dramatically titled Doing Data Science: Straight Talk from the Frontline
by Cathy O’Neil and Rachel Schutt (disclaimer) reads much like someone reporting back on notes they took at conference, because that’s essentially what it is. The book largely consists of summaries of talks given as part of a Data Science class. I wish the distinction between the content of the talks and the author’s insertions of background information or teaching suggestions was clearer. I don’t see this book being a reference work for me, but it was nice to read through once to learn about how different people set about solving specific data problems.

Exploring NYC School Data

New York City makes a large amount of data on its school system available for analysis. I recently took some time to explore some of the data in an IPython notebook (link to notebook). Eventually I’d like to do more detailed analysis including clustering and anomaly detection, but for now I had a great time getting a feel for the data.

I was able to make use of a neat feature in pandas, vectorized string operations. Each school is identified by a “DBN”, a string indicating its district, borough and number. For example “10X95” is district 10, Bronx, # 95. The following code extracts the district and borough information to separate columns in a DataFrame in a fast, nan-safe way.

borough = mergeddf['DBN'].str.extract(r'\d+([A-Z])\d+')
district = mergeddf['DBN'].str.extract(r'(\d+)[A-Z]\d+')
mergeddf['Borough'] = borough
mergeddf['District'] = district

For more neat data and plotting tricks made possible by python, check out the full notebook.

Book review: Python for Data Analysis

Python for Data Analysis (disclaimer) is written by Wes McKinney, the original author of the excellent Pandas library. I highly recommend this book for anyone who interacts with data. The scope of the book goes well beyond Pandas and covers other essential python data tools such as IPython, Numpy and Matplotlib. Also included are recommendations and best practices for data workflows and interactive analysis. The examples in the book are well thought out and illustrate the point in question without unnecessary complication. As a bonus a diverse group of data sets are used in the examples, which makes for a more interesting read.

One useful function that I had previously overlooked is the apply method of pandas groupby objects. The apply method applies a function to each group in the groupby object, then glues the results together row wise. I like apply because it’s an elegant way to do arbitrary operations to each group of data, replacing cases where I might otherwise have used a loop like the one below:

result_dict = {}
for group_name, group in groupby_object:
    result = some_function(group)
    result_dict[group_name] = result

There are some useful examples of using apply in the pandas documentation. There are even more examples in the Python for Data Analysis book, including applying a regression model to the data in each group.