Category Archives: Books

Book Review: Effective Python

Effective Python (disclaimer) contains 59 mini-lessons on python best practices. This book is not for programming novices but is great for advancing your python skill or as a reference if you’re moving to python from another language.

I see Effective Python as a compliment to Python for Data Analysis, the latter covering scientific and statistical uses where alternatives include MATLAB or R, and the former covering general software problems where python takes the place of Java or Ruby. Of course those two fields are not mutually exclusive, after reading Effective Python I have a better understanding of what’s going on (or should be going on) behind the scenes of the python libraries I use.

Book review: The Linux Command Line

The Linux Command Line: A Complete Introduction lives up to its name and covers a wide range of command line topics (disclaimer). There are a lot of subtleties with how the bash shell behaves that are more clear to me after reading this book. I didn’t learn as much as I did from How Linux Works (review), but I wish I’d read The Linux Command Line when I first starting working with linux systems.

The last part of the book deals with writing shell scripts and covers more about shell scripting than I ever wanted to know. Reading and writing nontrivial shell scripts is sometimes a reality, so overall it was good to pick up a few useful tips and tricks along with a generally increased understanding of bash programming.

Book review: How Linux Works

How Linux Works: What Every Superuser Should Know is an excellent book that gives a well organized introduction to what’s going on behind the scenes of a linux system (disclaimer). While doing data science I learned linux piece by piece, after reading this book I was able to better understand how all the pieces fit together. I also learned a lot about subjects I hadn’t really touched, for example how the network stack translates information to and from physical bits on a wire.

Most data science happens on linux or unix based systems. Just as knowing a lot about farming will help make you a better chef, knowing a lot about linux will make you a better data scientist. Since I work at a small startup my role often extends from data science to data engineering to operations support for data science and engineering to general operations support. Reading How Linux Works helped make me more effective at all of these roles.

Book review: Black Hat Python

Black Hat Python: Python Programming for Hackers and Pentesters is worth a quick read to learn a bit about avenues of attack on networks and web services (disclaimer). The second half of the book is largely devoted to attacks on windows machines which is less useful to me professionally but still interesting.

As a data scientist I sometimes expose data through web interfaces and it’s important to understand how a malicious user might try to exploit a system to access its data or take down the service.

Book review: Think Bayes

Think Bayes is a rare book that isn’t specifically about Python or a Python library but still contains excellent, well organized python code (disclaimer). Think Bayes covers using Bayesian Statistics to calculate probabilities and better understand real problems drawing from a range of real data sets. I definitely recommend that any scientist (and especially any data scientist) read this book, it’s available for free here.

Book review: Doing Data Science: Straight Talk from the Frontline

The dramatically titled Doing Data Science: Straight Talk from the Frontline by Cathy O’Neil and Rachel Schutt (disclaimer) reads much like someone reporting back on notes they took at conference, because that’s essentially what it is. The book largely consists of summaries of talks given as part of a Data Science class. I wish the distinction between the content of the talks and the author’s insertions of background information or teaching suggestions was clearer. I don’t see this book being a reference work for me, but it was nice to read through once to learn about how different people set about solving specific data problems.

Book review: Python for Data Analysis

Python for Data Analysis (disclaimer) is written by Wes McKinney, the original author of the excellent Pandas library. I highly recommend this book for anyone who interacts with data. The scope of the book goes well beyond Pandas and covers other essential python data tools such as IPython, Numpy and Matplotlib. Also included are recommendations and best practices for data workflows and interactive analysis. The examples in the book are well thought out and illustrate the point in question without unnecessary complication. As a bonus a diverse group of data sets are used in the examples, which makes for a more interesting read.

One useful function that I had previously overlooked is the apply method of pandas groupby objects. The apply method applies a function to each group in the groupby object, then glues the results together row wise. I like apply because it’s an elegant way to do arbitrary operations to each group of data, replacing cases where I might otherwise have used a loop like the one below:

result_dict = {}
for group_name, group in groupby_object:
    result = some_function(group)
    result_dict[group_name] = result

There are some useful examples of using apply in the pandas documentation. There are even more examples in the Python for Data Analysis book, including applying a regression model to the data in each group.

Book review: Building Machine Learning Systems with Python

I recently read the book Building Machine Learning Systems with Python by Willi Richert and Luis Pedro Coelho (disclaimer). Overall I think it is worth reading for someone who is already familiar with coding in python (and the numpy library) and is interested in using python machine learning libraries. I can’t recommended it as strongly for someone who is unfamiliar with python because the code in the book is often unpythonic (in my opinion), and the code available on the book’s website doesn’t match well with the code in the book and requires a fair bit of tweaking before you can actually run the examples.

I especially enjoyed exploring the gensim library, which is touched on in the book. I also liked the approach taken in the book of building and analyzing machine learning systems as an iterative process, exploring models and features to converge on a good solution.

One thing I think could improve the code in the book is better variable names. For example the following code is needlessly cryptic:

dense = np.zeros( (len(topics), 100), float)
for ti,t in enumerate(topics):
    for tj,v in t:
        dense[ti,tj] = v

There is also a blundering use of a list comprehension to reshape a numpy array:

x = np.array([[v] for v in x])

Using the built in reshape method is more memory efficient, easier to read, and 1,000 times faster:

In [25]: x.shape
Out[25]: (506L,)

In [26]: %timeit y = np.array([[v] for v in x])
100 loops, best of 3: 4.07 ms per loop

In [27]: %timeit y = x.reshape((x.size, 1)).copy()
100000 loops, best of 3: 4.09 µs per loop