Tag Archives: python

Book Review: Effective Python

Effective Python (disclaimer) contains 59 mini-lessons on python best practices. This book is not for programming novices but is great for advancing your python skill or as a reference if you’re moving to python from another language.

I see Effective Python as a compliment to Python for Data Analysis, the latter covering scientific and statistical uses where alternatives include MATLAB or R, and the former covering general software problems where python takes the place of Java or Ruby. Of course those two fields are not mutually exclusive, after reading Effective Python I have a better understanding of what’s going on (or should be going on) behind the scenes of the python libraries I use.

Book review: Black Hat Python

Black Hat Python: Python Programming for Hackers and Pentesters is worth a quick read to learn a bit about avenues of attack on networks and web services (disclaimer). The second half of the book is largely devoted to attacks on windows machines which is less useful to me professionally but still interesting.

As a data scientist I sometimes expose data through web interfaces and it’s important to understand how a malicious user might try to exploit a system to access its data or take down the service.

Book review: Think Bayes

Think Bayes is a rare book that isn’t specifically about Python or a Python library but still contains excellent, well organized python code (disclaimer). Think Bayes covers using Bayesian Statistics to calculate probabilities and better understand real problems drawing from a range of real data sets. I definitely recommend that any scientist (and especially any data scientist) read this book, it’s available for free here.

Exporting static IPython Notebooks with style


The IPython Notebook is an incredible tool that I use almost daily. Notebooks can be exported to HTML for easy sharing using nbconvert, however I’ve never been happy with the look of the exported notebooks. In the past I’ve used the nbviewer to render the .ipynb files and display notebooks online, but I’d prefer to host the notebooks myself. I set out to make some changes so I could export HTML notebooks that looked good. The results can be seen in my Introduction to Singular Value Decomposition and NYC School Data Exploration notebooks. See below for how to.

Summary (gists):

View code

Exploring NYC School Data

New York City makes a large amount of data on its school system available for analysis. I recently took some time to explore some of the data in an IPython notebook (link to notebook). Eventually I’d like to do more detailed analysis including clustering and anomaly detection, but for now I had a great time getting a feel for the data.

I was able to make use of a neat feature in pandas, vectorized string operations. Each school is identified by a “DBN”, a string indicating its district, borough and number. For example “10X95” is district 10, Bronx, # 95. The following code extracts the district and borough information to separate columns in a DataFrame in a fast, nan-safe way.

borough = mergeddf['DBN'].str.extract(r'\d+([A-Z])\d+')
district = mergeddf['DBN'].str.extract(r'(\d+)[A-Z]\d+')
mergeddf['Borough'] = borough
mergeddf['District'] = district

For more neat data and plotting tricks made possible by python, check out the full notebook.

Book review: Python for Data Analysis

Python for Data Analysis (disclaimer) is written by Wes McKinney, the original author of the excellent Pandas library. I highly recommend this book for anyone who interacts with data. The scope of the book goes well beyond Pandas and covers other essential python data tools such as IPython, Numpy and Matplotlib. Also included are recommendations and best practices for data workflows and interactive analysis. The examples in the book are well thought out and illustrate the point in question without unnecessary complication. As a bonus a diverse group of data sets are used in the examples, which makes for a more interesting read.

One useful function that I had previously overlooked is the apply method of pandas groupby objects. The apply method applies a function to each group in the groupby object, then glues the results together row wise. I like apply because it’s an elegant way to do arbitrary operations to each group of data, replacing cases where I might otherwise have used a loop like the one below:

result_dict = {}
for group_name, group in groupby_object:
    result = some_function(group)
    result_dict[group_name] = result

There are some useful examples of using apply in the pandas documentation. There are even more examples in the Python for Data Analysis book, including applying a regression model to the data in each group.

Book review: Building Machine Learning Systems with Python

I recently read the book Building Machine Learning Systems with Python by Willi Richert and Luis Pedro Coelho (disclaimer). Overall I think it is worth reading for someone who is already familiar with coding in python (and the numpy library) and is interested in using python machine learning libraries. I can’t recommended it as strongly for someone who is unfamiliar with python because the code in the book is often unpythonic (in my opinion), and the code available on the book’s website doesn’t match well with the code in the book and requires a fair bit of tweaking before you can actually run the examples.

I especially enjoyed exploring the gensim library, which is touched on in the book. I also liked the approach taken in the book of building and analyzing machine learning systems as an iterative process, exploring models and features to converge on a good solution.

One thing I think could improve the code in the book is better variable names. For example the following code is needlessly cryptic:

dense = np.zeros( (len(topics), 100), float)
for ti,t in enumerate(topics):
    for tj,v in t:
        dense[ti,tj] = v

There is also a blundering use of a list comprehension to reshape a numpy array:

x = np.array([[v] for v in x])

Using the built in reshape method is more memory efficient, easier to read, and 1,000 times faster:

In [25]: x.shape
Out[25]: (506L,)

In [26]: %timeit y = np.array([[v] for v in x])
100 loops, best of 3: 4.07 ms per loop

In [27]: %timeit y = x.reshape((x.size, 1)).copy()
100000 loops, best of 3: 4.09 µs per loop

 

Python script to rename figures

I’m currently in the process of finishing up writing my thesis. I’ve found the simplest way to name the files for the figures in a paper is in the format figure 1.jpg, figure 2.jpg…  That way all the authors know exactly what figure in the paper corresponds to what file. This works well in a paper where the number of figures is generally fixed, but if I decide want to insert a figure between figures 1 and 2 in my thesis every figure’s number increases by one, and I don’t want to rename 20 files by hand. To fix this I wrote a python script to go through a folder and increment by 1 the number of every figure after a certain figure (i.e. turn figure 1.jpg, figure 2.jpg, figure 3.jpg… into figure 1.jpg, figure 3.jpg, figure 4.jpg…).

The code uses regular expressions to find the files that should be renamed, and figure out what number they had originally. To change figure 2 and above, the command line call would be:

$python figure_rename.py 2

Here is the script:

Singular Value Decomposition

In the Statistical Computing class I took last fall, a matrix decomposition called the singular value decomposition came up briefly as a way to classify similar objects. I wanted to learn more about it but I couldn’t find a resource that was exactly what I wanted, so I decided to create one. I wrote up two IPython Notebooks about singular value decomposition, one is an introduction to the concept and an example application to classifying research articles, the other deals with image compression. The notebooks are viewable at the links below. You can also download the code from these links:

Singular Value Decomposition and Applications

Singular Value Decomposition of an Image

Python templates with Jinja2

When I initially wrote the car usage tracking web app I hard coded the html into the python script that worked with the data. This is alright for prototyping a D3.js visualization, but it’s a real mess having the page layout and text intermingled with the python code. Luckily there exists a large number of python templating engines that make it easy to fill in variables in an html layout with data from a python script. In this case I used Jinja2.

Here is what part of the python script looked liked before refactoring using Jinja2.

# print info from current fill up and table with all fillups
print "<p>Miles driven: %s" % (float(data[-1][-4]) - float(data[-2][-4]))
print "<p>Your mpg was: <b>%.2f</b>" % float(data[-1][-1])

Jinja2 allows for passing variables from python in the form of a dictionary, where the key in the dictionary is specified in the html template (surrounded by double curly braces i.e. {{ key }}), and will be replaced with the value associated with that key. It allows simple logic such as for loops, which I used to produce the table of past data:

<table>
  <tr>
    {% for cell in header %}
      <td><b>{{ cell }}</b></td>
    {% endfor %}
  </tr>
  {% for row in fillups %}
    <tr>
      {% for cell in row %}
        <td>{{ cell }}</td>
      {% endfor %}
    </tr>
  {% endfor %} 
</table>

In the python script, the code to produce the template looks like this:

import jinja2

env = jinja2.Environment(loader=jinja2.FileSystemLoader(
 searchpath='templates/')
 )
template = env.get_template('mpg.html')

display_dictionary = {}
# code that fills in display_dictionary with the values to send to the template

print template.render(display_dictionary)

It’s also possible to save the rendered template as a static html file:

output = template.render(display_dictionary)
with open('../index.html', 'w') as f:
    f.write(output);

Where dispdict is a dictionary containing variables like mpg, milesdriven, etc. The keys to this dictionary are in the template, awaiting replacement the the correct value. The code ends up being much cleaner this way: the data calculation is separated cleanly from the html that defines how it is displayed. Now that I’ve gotten familiar with Jinja2 I plan to use it from the beginning in future projects. I’ll probably further refactor the code to generate a static html file each time the data is updated, since creating the page each time it is visited is not necessary in this case.

The all the code for the car usage tracking web app is up on github (link).