Exploring NYC school data

New York City makes a large amount of data on its school system available for analysis. I'm especially interested in this data since I was an input into the process that generated it from 2006-2008. Here I've taken some of that data and looked at trends of test scores across geography and time.

The data are available here:

The standard imports:

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

Read the data into pandas:

I save the data as a .csv file in the end because parsing the excel sheet is slow, and storing the .csv avoids repeating this step. This data is from a New York State Math test from 2006-2013, grades 3-8.

In [2]:
rawdf = pd.read_excel('data/SchoolMathResults20062012Public.xlsx',
                      'All Students',
                       skiprows=6, 
                       na_values=['s'])
raw2013 = pd.read_excel('data/SchoolMathResults2013.xlsx',
                        'All Students',
                        skiprows=6,
                        na_values=['s'])
rawdf = pd.concat([rawdf, raw2013])
rawdf.to_csv('data/SchoolMathResults20062013Public.csv')

Read in the .csv data:

In [3]:
rawdf = pd.read_csv('data/SchoolMathResults20062013Public.csv')
rawdf.head()
Out[3]:
Unnamed: 0 DBN Grade Year Category Number Tested Mean Scale Score # % #.1 %.1 #.2 %.2 #.3 %.3 #.4 %.4
0 0 01M015 3 2006 All Students 39 667 2 5.1 11 28.2 20 51.3 6 15.4 26 66.7
1 1 01M015 3 2007 All Students 31 672 2 6.5 3 9.7 22 71.0 4 12.9 26 83.9
2 2 01M015 3 2008 All Students 37 668 0 0.0 6 16.2 29 78.4 2 5.4 31 83.8
3 3 01M015 3 2009 All Students 33 668 0 0.0 4 12.1 28 84.8 1 3.0 29 87.9
4 4 01M015 3 2010 All Students 26 677 6 23.1 12 46.2 6 23.1 2 7.7 8 30.8

Here I add information back into the column names that was lost when reading in the excel sheet:

In [4]:
level_number_strings = [str(i) for i in range(1,5)] + ['3&4']
new_cols = ['Level ' + str(s1) + ' ' + s2
            for s1 in level_number_strings
            for s2 in ['#', '%']]
rawdf.columns = np.append(rawdf.columns.values[:-10], new_cols)
rawdf.head()
Out[4]:
Unnamed: 0 DBN Grade Year Category Number Tested Mean Scale Score Level 1 # Level 1 % Level 2 # Level 2 % Level 3 # Level 3 % Level 4 # Level 4 % Level 3&4 # Level 3&4 %
0 0 01M015 3 2006 All Students 39 667 2 5.1 11 28.2 20 51.3 6 15.4 26 66.7
1 1 01M015 3 2007 All Students 31 672 2 6.5 3 9.7 22 71.0 4 12.9 26 83.9
2 2 01M015 3 2008 All Students 37 668 0 0.0 6 16.2 29 78.4 2 5.4 31 83.8
3 3 01M015 3 2009 All Students 33 668 0 0.0 4 12.1 28 84.8 1 3.0 29 87.9
4 4 01M015 3 2010 All Students 26 677 6 23.1 12 46.2 6 23.1 2 7.7 8 30.8

Read in school demographic data from the excel sheet:

Here I use the converters argument to get the ending year of the school years, which appear in "2006-2007" format in this file. This makes the year column consistent with the format of the test results data. Also I use na_values to properly encode the string "n/a" as a missing data point.

In [5]:
demographic_df = pd.read_excel(
                    'data/DemographicSnapshot2012Public.xlsx',
                    'School Demographics',
                    converters={'School Year': lambda x: int(x[-4:])},
                    na_values=['n/a'])
demographic_df = demographic_df.replace(r'^\s+$', np.nan, regex=True)
demographic_df.to_csv('data/DemographicSnapshot2012Public.csv')
demographic_df.head()
Out[5]:
DBN Name School Year % Free Lunch % Free and Reduced Price Lunch Total Enrollment Pre-K K Grade 1 Grade 2 ... # Black % Black # Hispanic % Hispanic # White % White # Male % Male # Female % Female
0 01M015 P.S. 015 ROBERTO CLEMENTE 2006 89.4 NaN 281 15 36 40 33 ... 74 26.334520 189 67.259786 5 1.779359 158 56.227758 123 43.772242
1 01M015 P.S. 015 ROBERTO CLEMENTE 2007 89.4 NaN 243 15 29 39 38 ... 68 27.983539 153 62.962963 4 1.646091 140 57.613169 103 42.386831
2 01M015 P.S. 015 ROBERTO CLEMENTE 2008 89.4 NaN 261 18 43 39 36 ... 77 29.501916 157 60.153257 7 2.681992 143 54.789272 118 45.210728
3 01M015 P.S. 015 ROBERTO CLEMENTE 2009 89.4 NaN 252 17 37 44 32 ... 75 29.761905 149 59.126984 7 2.777778 149 59.126984 103 40.873016
4 01M015 P.S. 015 ROBERTO CLEMENTE 2010 NaN 96.5 208 16 40 28 32 ... 67 32.211538 118 56.730769 6 2.884615 124 59.615385 84 40.384615

5 rows × 38 columns

A bit of wrangling on the demographic data:

The combine_first method replaces all the NaN values in the '% Free Lunch' column with the value from the '% Free and Reduced Price Lunch' column. It seems '% Free and Reduced Price Lunch' became the statistic of choice starting in 2010.

In [6]:
combined_lunch = demographic_df['% Free Lunch'].combine_first(
                        demographic_df['% Free and Reduced Price Lunch']
                        )
demographic_df['% Free Lunch'] = combined_lunch.astype(float)
demographic_df = demographic_df.drop(['% Free and Reduced Price Lunch'], axis=1)
demographic_df = demographic_df.rename(columns={'School Year': 'Year'})
demographic_df.head(5)
Out[6]:
DBN Name Year % Free Lunch Total Enrollment Pre-K K Grade 1 Grade 2 Grade 3 ... # Black % Black # Hispanic % Hispanic # White % White # Male % Male # Female % Female
0 01M015 P.S. 015 ROBERTO CLEMENTE 2006 89.4 281 15 36 40 33 38 ... 74 26.334520 189 67.259786 5 1.779359 158 56.227758 123 43.772242
1 01M015 P.S. 015 ROBERTO CLEMENTE 2007 89.4 243 15 29 39 38 34 ... 68 27.983539 153 62.962963 4 1.646091 140 57.613169 103 42.386831
2 01M015 P.S. 015 ROBERTO CLEMENTE 2008 89.4 261 18 43 39 36 38 ... 77 29.501916 157 60.153257 7 2.681992 143 54.789272 118 45.210728
3 01M015 P.S. 015 ROBERTO CLEMENTE 2009 89.4 252 17 37 44 32 34 ... 75 29.761905 149 59.126984 7 2.777778 149 59.126984 103 40.873016
4 01M015 P.S. 015 ROBERTO CLEMENTE 2010 96.5 208 16 40 28 32 30 ... 67 32.211538 118 56.730769 6 2.884615 124 59.615385 84 40.384615

5 rows × 37 columns

Combine the test results data with the demographic data (similar to a database JOIN operation):

In [7]:
mergeddf = rawdf.merge(demographic_df, on=['DBN', 'Year'])

Extract a few new columns:

Pandas includes vectorized string methods that allow fast, NaN safe extraction of borough and district from the 'DBN' column using regular expressions.

In [8]:
borough = mergeddf['DBN'].str.extract(r'\d+([A-Z])\d+')
district = mergeddf['DBN'].str.extract(r'(\d+)[A-Z]\d+')
mergeddf['Borough'] = borough
mergeddf['District'] = district
mergeddf.head()
Out[8]:
Unnamed: 0 DBN Grade Year Category Number Tested Mean Scale Score Level 1 # Level 1 % Level 2 # ... # Hispanic % Hispanic # White % White # Male % Male # Female % Female Borough District
0 0 01M015 3 2006 All Students 39 667 2 5.1 11 ... 189 67.259786 5 1.779359 158 56.227758 123 43.772242 M 01
1 7 01M015 4 2006 All Students 49 629 20 40.8 18 ... 189 67.259786 5 1.779359 158 56.227758 123 43.772242 M 01
2 14 01M015 5 2006 All Students 31 630 12 38.7 9 ... 189 67.259786 5 1.779359 158 56.227758 123 43.772242 M 01
3 21 01M015 6 2006 All Students 39 639 10 25.6 12 ... 189 67.259786 5 1.779359 158 56.227758 123 43.772242 M 01
4 22 01M015 All Grades 2006 All Students 158 NaN 44 27.8 50 ... 189 67.259786 5 1.779359 158 56.227758 123 43.772242 M 01

5 rows × 54 columns

Define a function that makes some useful plots from the data:

I find it useful to write plotting functions that take in the data in question as a DataFrame, that way the data can be easily subsetted or transformed and then easily plotted while doing exploratory analysis. The plot() method of DataFrame is helpful for segmenting the data between columns, which otherwise would require repeated explicit matplotlib calls.

It's a little unintuitive to make scatter plots with the pandas plot() method, because kind='scatter' doesn't work. The solution is to pass in the arguments linestyle='', marker='o', which gets rid of the line and adds a circle at each data point.

In [9]:
def scatter_plot(alldatadf,
                 x_axis_col='% Free Lunch',
                 y_axis_col='Mean Scale Score',
                 separate_plots_by='Year',
                 groups_on_plots='Grade',
                 drop=[]):
    """Make scatter plots of data in alldatadf.
    
    alldatadf: pandas data frame.
    x_axis_col: the column of alldatadf to use as the x-axis. Should
        be the same for all values of the in the groups_on_plots column,
        since this will become in the index of the data frame that is
        eventually plotted.
    y_axis_col: the column of alldatadf to use as the y-axis.
    separate_plots_by: a column name of alldatadf, each unique value
        in the column be plotted in a separate figure.
    groups_on_plots: a column name of alldatadf, each unique value
        will be plotted as a separate color.
    drop: columns of alldatadf to drop.
    """
    for grp_name, df in alldatadf.groupby(separate_plots_by):
        pivot_table = df.pivot(index='DBN', columns=groups_on_plots)
        pivot_table = pivot_table[[x_axis_col, y_axis_col]]
        pivot_table.index = pivot_table[x_axis_col].iloc[:, 0]
        pivot_table.index.name = x_axis_col
        pivot_table = pivot_table[y_axis_col]
        pivot_table = pivot_table.drop(drop, axis=1)
        pivot_table = pivot_table[~np.isnan(pivot_table.index.values)]
        pivot_table = pivot_table.sort_index()
        ax = pivot_table.plot(linestyle='',
                              marker='o',
                              alpha=.5,
                              title=grp_name,
                              figsize=(10,6))
        ax.set_ylabel(y_axis_col)
        title_value = ax.get_title()
        ax.set_title(separate_plots_by + ' = ' + title_value)

Here's the plotting function in action, with its default arguments. The data is pretty dense here and specific trends within each grade are hard to distinguish, however there does seem to be an overall negative correlation between the mean score and the % free lunch.

In [10]:
scatter_plot(mergeddf)

Calculating the correlation coefficient:

As with most things, pandas makes it easy to compute the correlation coefficient (calculated here for all points of mean scale score data in the data set.

In [11]:
mergeddf[['% Free Lunch', 'Mean Scale Score']].corr()
Out[11]:
% Free Lunch Mean Scale Score
% Free Lunch 1.000000 -0.470529
Mean Scale Score -0.470529 1.000000

Now instead of the mean score on the y-axis, the % of students reaching levels 3&4 (i.e. passing) is plotted, split by year on each figure. Although again the plotted data is very dense, I did notice that some of the years seem to have higher passing rates than others, which I decided to investigate in a bar graph later in the analysis.

In [12]:
scatter_plot(mergeddf,
             groups_on_plots='Year',
             separate_plots_by='Grade',
             y_axis_col='Level 3&4 %')

Making bar graphs:

To get a better view of summary statistics rather than the raw view offered by the scatter plots, I wrote a function to make bar graphs of a given slicing of the data. This function can also make line graphs for cases when there are so many bars that the graph becomes unreadable (for example when each of the 32 districts is its own bar). Unfortunately, the line graph runs into a common problem: the matplotlib default plotting options repeat quite frequently, which allows only 7 lines to be plotted before two lines look identical in the legend. I get around this by specifying a cycle of markers for the lines. Later in the analysis I also change the default color cycle to make the lines more easily distinguishable.

In [13]:
from itertools import cycle
def bar_graph(alldatadf,
              x_axis_col='Year',
              y_axis_col='Level 3&4 %',
              split_bars_by='Grade',
              kind='bar',
              figsize=(10,10),
              drop=['All Grades']):
    """Plot a bar graph of the data in alldatadf.
    
    alldatadf: DataFrame.
    x_axis_col: the column of alldatadf to use as the x-axis. 
    y_axis_col: the column of alldatadf to use as the y-axis.
    split_bars_by: a column name of alldatadf, each unique value
        in the column be plotted as one color of bar.
    kind: the type of plot. Must be supported by the plot method
        of pandas DataFrames.
    figsize: tuple of length 2, the figure size.
    drop: columns of alldatadf to drop.
    """
    grouped = alldatadf.groupby([x_axis_col, split_bars_by])
    aggregate_grouped = grouped.agg([np.mean, np.std], level=1)
    aggregate_grouped_y = aggregate_grouped[y_axis_col]
    unstacked = aggregate_grouped_y.unstack()
    unstacked = unstacked.drop(drop, axis=1, level=1)
    fig = plt.figure(figsize=figsize)
    ax = fig.add_axes([.1, .1, .8, .5])
    unstacked['mean'].plot(kind=kind, ax=ax)
    ax.set_ylabel(y_axis_col)
    ax.set_xticklabels(unstacked.index.values)
    legend = ax.legend(loc=2, ncol=2)
    legend.set_bbox_to_anchor((1.05, 1))
    legend.set_title(split_bars_by)
    if kind == 'line':
        marker_values = 'oxDs'
        markercycle = cycle(marker_values)
        for line in ax.get_lines():
            line.set_marker(next(markercycle))
        markercycle = cycle(marker_values) # reset the cycle
        for label in legend.get_lines():
            label.set_marker(next(markercycle))
In [14]:
bar_graph(mergeddf)
In [15]:
bar_graph(mergeddf, split_bars_by='Borough')
In [16]:
bar_graph(mergeddf, split_bars_by='Borough', x_axis_col='Grade')
In [17]:
bar_graph(mergeddf, split_bars_by='District')

Yikes, this is too many bars! Time for a line graph:

In [18]:
bar_graph(mergeddf, split_bars_by='District', kind='line')

These lines are still a bit hard to tell apart. To fix this I'll change the matplotlib defaults (using plt.rc()) to use a spectrum of colors.

In [19]:
colormap = plt.get_cmap('spectral')
colorlist = [colormap(z) for z in np.linspace(0, 1, num=32)]
plt.rc('axes', color_cycle=colorlist)
In [20]:
bar_graph(mergeddf, split_bars_by='District', kind='line')

Back to main site: www.frankcleary.com