New York City makes a large amount of data on its school system available for analysis. I'm especially interested in this data since I was an input into the process that generated it from 2006-2008. Here I've taken some of that data and looked at trends of test scores across geography and time.
The data are available here:%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
I save the data as a .csv file in the end because parsing the excel sheet is slow, and storing the .csv avoids repeating this step. This data is from a New York State Math test from 2006-2013, grades 3-8.
rawdf = pd.read_excel('data/SchoolMathResults20062012Public.xlsx',
'All Students',
skiprows=6,
na_values=['s'])
raw2013 = pd.read_excel('data/SchoolMathResults2013.xlsx',
'All Students',
skiprows=6,
na_values=['s'])
rawdf = pd.concat([rawdf, raw2013])
rawdf.to_csv('data/SchoolMathResults20062013Public.csv')
rawdf = pd.read_csv('data/SchoolMathResults20062013Public.csv')
rawdf.head()
Here I add information back into the column names that was lost when reading in the excel sheet:
level_number_strings = [str(i) for i in range(1,5)] + ['3&4']
new_cols = ['Level ' + str(s1) + ' ' + s2
for s1 in level_number_strings
for s2 in ['#', '%']]
rawdf.columns = np.append(rawdf.columns.values[:-10], new_cols)
rawdf.head()
Here I use the converters
argument to get the ending year of the school years, which appear in "2006-2007" format in this file. This makes the year column consistent with the format of the test results data. Also I use na_values
to properly encode the string "n/a" as a missing data point.
demographic_df = pd.read_excel(
'data/DemographicSnapshot2012Public.xlsx',
'School Demographics',
converters={'School Year': lambda x: int(x[-4:])},
na_values=['n/a'])
demographic_df = demographic_df.replace(r'^\s+$', np.nan, regex=True)
demographic_df.to_csv('data/DemographicSnapshot2012Public.csv')
demographic_df.head()
The combine_first
method replaces all the NaN values in the '% Free Lunch' column with the value from the '% Free and Reduced Price Lunch' column. It seems '% Free and Reduced Price Lunch' became the statistic of choice starting in 2010.
combined_lunch = demographic_df['% Free Lunch'].combine_first(
demographic_df['% Free and Reduced Price Lunch']
)
demographic_df['% Free Lunch'] = combined_lunch.astype(float)
demographic_df = demographic_df.drop(['% Free and Reduced Price Lunch'], axis=1)
demographic_df = demographic_df.rename(columns={'School Year': 'Year'})
demographic_df.head(5)
JOIN
operation):¶mergeddf = rawdf.merge(demographic_df, on=['DBN', 'Year'])
Pandas includes vectorized string methods that allow fast, NaN safe extraction of borough and district from the 'DBN' column using regular expressions.
borough = mergeddf['DBN'].str.extract(r'\d+([A-Z])\d+')
district = mergeddf['DBN'].str.extract(r'(\d+)[A-Z]\d+')
mergeddf['Borough'] = borough
mergeddf['District'] = district
mergeddf.head()
I find it useful to write plotting functions that take in the data in question as a DataFrame
, that way the data can be easily subsetted or transformed and then easily plotted while doing exploratory analysis. The plot()
method of DataFrame
is helpful for segmenting the data between columns, which otherwise would require repeated explicit matplotlib
calls.
It's a little unintuitive to make scatter plots with the pandas plot()
method, because kind='scatter'
doesn't work. The solution is to pass in the arguments linestyle='', marker='o'
, which gets rid of the line and adds a circle at each data point.
def scatter_plot(alldatadf,
x_axis_col='% Free Lunch',
y_axis_col='Mean Scale Score',
separate_plots_by='Year',
groups_on_plots='Grade',
drop=[]):
"""Make scatter plots of data in alldatadf.
alldatadf: pandas data frame.
x_axis_col: the column of alldatadf to use as the x-axis. Should
be the same for all values of the in the groups_on_plots column,
since this will become in the index of the data frame that is
eventually plotted.
y_axis_col: the column of alldatadf to use as the y-axis.
separate_plots_by: a column name of alldatadf, each unique value
in the column be plotted in a separate figure.
groups_on_plots: a column name of alldatadf, each unique value
will be plotted as a separate color.
drop: columns of alldatadf to drop.
"""
for grp_name, df in alldatadf.groupby(separate_plots_by):
pivot_table = df.pivot(index='DBN', columns=groups_on_plots)
pivot_table = pivot_table[[x_axis_col, y_axis_col]]
pivot_table.index = pivot_table[x_axis_col].iloc[:, 0]
pivot_table.index.name = x_axis_col
pivot_table = pivot_table[y_axis_col]
pivot_table = pivot_table.drop(drop, axis=1)
pivot_table = pivot_table[~np.isnan(pivot_table.index.values)]
pivot_table = pivot_table.sort_index()
ax = pivot_table.plot(linestyle='',
marker='o',
alpha=.5,
title=grp_name,
figsize=(10,6))
ax.set_ylabel(y_axis_col)
title_value = ax.get_title()
ax.set_title(separate_plots_by + ' = ' + title_value)
Here's the plotting function in action, with its default arguments. The data is pretty dense here and specific trends within each grade are hard to distinguish, however there does seem to be an overall negative correlation between the mean score and the % free lunch.
scatter_plot(mergeddf)