Exploring NYC school data

New York City makes a large amount of data on its school system available for analysis. I'm especially interested in this data since I was an input into the process that generated it from 2006-2008. Here I've taken some of that data and looked at trends of test scores across geography and time.

The data are available here:

The standard imports:

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

Read the data into pandas:

I save the data as a .csv file in the end because parsing the excel sheet is slow, and storing the .csv avoids repeating this step. This data is from a New York State Math test from 2006-2013, grades 3-8.

In [2]:
rawdf = pd.read_excel('data/SchoolMathResults20062012Public.xlsx',
                      'All Students',
                       skiprows=6, 
                       na_values=['s'])
raw2013 = pd.read_excel('data/SchoolMathResults2013.xlsx',
                        'All Students',
                        skiprows=6,
                        na_values=['s'])
rawdf = pd.concat([rawdf, raw2013])
rawdf.to_csv('data/SchoolMathResults20062013Public.csv')

Read in the .csv data:

In [3]:
rawdf = pd.read_csv('data/SchoolMathResults20062013Public.csv')
rawdf.head()
Out[3]:
Unnamed: 0 DBN Grade Year Category Number Tested Mean Scale Score # % #.1 %.1 #.2 %.2 #.3 %.3 #.4 %.4
0 0 01M015 3 2006 All Students 39 667 2 5.1 11 28.2 20 51.3 6 15.4 26 66.7
1 1 01M015 3 2007 All Students 31 672 2 6.5 3 9.7 22 71.0 4 12.9 26 83.9
2 2 01M015 3 2008 All Students 37 668 0 0.0 6 16.2 29 78.4 2 5.4 31 83.8
3 3 01M015 3 2009 All Students 33 668 0 0.0 4 12.1 28 84.8 1 3.0 29 87.9
4 4 01M015 3 2010 All Students 26 677 6 23.1 12 46.2 6 23.1 2 7.7 8 30.8

Here I add information back into the column names that was lost when reading in the excel sheet:

In [4]:
level_number_strings = [str(i) for i in range(1,5)] + ['3&4']
new_cols = ['Level ' + str(s1) + ' ' + s2
            for s1 in level_number_strings
            for s2 in ['#', '%']]
rawdf.columns = np.append(rawdf.columns.values[:-10], new_cols)
rawdf.head()
Out[4]:
Unnamed: 0 DBN Grade Year Category Number Tested Mean Scale Score Level 1 # Level 1 % Level 2 # Level 2 % Level 3 # Level 3 % Level 4 # Level 4 % Level 3&4 # Level 3&4 %
0 0 01M015 3 2006 All Students 39 667 2 5.1 11 28.2 20 51.3 6 15.4 26 66.7
1 1 01M015 3 2007 All Students 31 672 2 6.5 3 9.7 22 71.0 4 12.9 26 83.9
2 2 01M015 3 2008 All Students 37 668 0 0.0 6 16.2 29 78.4 2 5.4 31 83.8
3 3 01M015 3 2009 All Students 33 668 0 0.0 4 12.1 28 84.8 1 3.0 29 87.9
4 4 01M015 3 2010 All Students 26 677 6 23.1 12 46.2 6 23.1 2 7.7 8 30.8

Read in school demographic data from the excel sheet:

Here I use the converters argument to get the ending year of the school years, which appear in "2006-2007" format in this file. This makes the year column consistent with the format of the test results data. Also I use na_values to properly encode the string "n/a" as a missing data point.

In [5]:
demographic_df = pd.read_excel(
                    'data/DemographicSnapshot2012Public.xlsx',
                    'School Demographics',
                    converters={'School Year': lambda x: int(x[-4:])},
                    na_values=['n/a'])
demographic_df = demographic_df.replace(r'^\s+$', np.nan, regex=True)
demographic_df.to_csv('data/DemographicSnapshot2012Public.csv')
demographic_df.head()
Out[5]:
DBN Name School Year % Free Lunch % Free and Reduced Price Lunch Total Enrollment Pre-K K Grade 1 Grade 2 ... # Black % Black # Hispanic % Hispanic # White % White # Male % Male # Female % Female
0 01M015 P.S. 015 ROBERTO CLEMENTE 2006 89.4 NaN 281 15 36 40 33 ... 74 26.334520 189 67.259786 5 1.779359 158 56.227758 123 43.772242
1 01M015 P.S. 015 ROBERTO CLEMENTE 2007 89.4 NaN 243 15 29 39 38 ... 68 27.983539 153 62.962963 4 1.646091 140 57.613169 103 42.386831
2 01M015 P.S. 015 ROBERTO CLEMENTE 2008 89.4 NaN 261 18 43 39 36 ... 77 29.501916 157 60.153257 7 2.681992 143 54.789272 118 45.210728
3 01M015 P.S. 015 ROBERTO CLEMENTE 2009 89.4 NaN 252 17 37 44 32 ... 75 29.761905 149 59.126984 7 2.777778 149 59.126984 103 40.873016
4 01M015 P.S. 015 ROBERTO CLEMENTE 2010 NaN 96.5 208 16 40 28 32 ... 67 32.211538 118 56.730769 6 2.884615 124 59.615385 84 40.384615

5 rows × 38 columns

A bit of wrangling on the demographic data:

The combine_first method replaces all the NaN values in the '% Free Lunch' column with the value from the '% Free and Reduced Price Lunch' column. It seems '% Free and Reduced Price Lunch' became the statistic of choice starting in 2010.

In [6]:
combined_lunch = demographic_df['% Free Lunch'].combine_first(
                        demographic_df['% Free and Reduced Price Lunch']
                        )
demographic_df['% Free Lunch'] = combined_lunch.astype(float)
demographic_df = demographic_df.drop(['% Free and Reduced Price Lunch'], axis=1)
demographic_df = demographic_df.rename(columns={'School Year': 'Year'})
demographic_df.head(5)
Out[6]:
DBN Name Year % Free Lunch Total Enrollment Pre-K K Grade 1 Grade 2 Grade 3 ... # Black % Black # Hispanic % Hispanic # White % White # Male % Male # Female % Female
0 01M015 P.S. 015 ROBERTO CLEMENTE 2006 89.4 281 15 36 40 33 38 ... 74 26.334520 189 67.259786 5 1.779359 158 56.227758 123 43.772242
1 01M015 P.S. 015 ROBERTO CLEMENTE 2007 89.4 243 15 29 39 38 34 ... 68 27.983539 153 62.962963 4 1.646091 140 57.613169 103 42.386831
2 01M015 P.S. 015 ROBERTO CLEMENTE 2008 89.4 261 18 43 39 36 38 ... 77 29.501916 157 60.153257 7 2.681992 143 54.789272 118 45.210728
3 01M015 P.S. 015 ROBERTO CLEMENTE 2009 89.4 252 17 37 44 32 34 ... 75 29.761905 149 59.126984 7 2.777778 149 59.126984 103 40.873016
4 01M015 P.S. 015 ROBERTO CLEMENTE 2010 96.5 208 16 40 28 32 30 ... 67 32.211538 118 56.730769 6 2.884615 124 59.615385 84 40.384615

5 rows × 37 columns

Combine the test results data with the demographic data (similar to a database JOIN operation):

In [7]:
mergeddf = rawdf.merge(demographic_df, on=['DBN', 'Year'])

Extract a few new columns:

Pandas includes vectorized string methods that allow fast, NaN safe extraction of borough and district from the 'DBN' column using regular expressions.

In [8]:
borough = mergeddf['DBN'].str.extract(r'\d+([A-Z])\d+')
district = mergeddf['DBN'].str.extract(r'(\d+)[A-Z]\d+')
mergeddf['Borough'] = borough
mergeddf['District'] = district
mergeddf.head()
Out[8]:
Unnamed: 0 DBN Grade Year Category Number Tested Mean Scale Score Level 1 # Level 1 % Level 2 # ... # Hispanic % Hispanic # White % White # Male % Male # Female % Female Borough District
0 0 01M015 3 2006 All Students 39 667 2 5.1 11 ... 189 67.259786 5 1.779359 158 56.227758 123 43.772242 M 01
1 7 01M015 4 2006 All Students 49 629 20 40.8 18 ... 189 67.259786 5 1.779359 158 56.227758 123 43.772242 M 01
2 14 01M015 5 2006 All Students 31 630 12 38.7 9 ... 189 67.259786 5 1.779359 158 56.227758 123 43.772242 M 01
3 21 01M015 6 2006 All Students 39 639 10 25.6 12 ... 189 67.259786 5 1.779359 158 56.227758 123 43.772242 M 01
4 22 01M015 All Grades 2006 All Students 158 NaN 44 27.8 50 ... 189 67.259786 5 1.779359 158 56.227758 123 43.772242 M 01

5 rows × 54 columns

Define a function that makes some useful plots from the data:

I find it useful to write plotting functions that take in the data in question as a DataFrame, that way the data can be easily subsetted or transformed and then easily plotted while doing exploratory analysis. The plot() method of DataFrame is helpful for segmenting the data between columns, which otherwise would require repeated explicit matplotlib calls.

It's a little unintuitive to make scatter plots with the pandas plot() method, because kind='scatter' doesn't work. The solution is to pass in the arguments linestyle='', marker='o', which gets rid of the line and adds a circle at each data point.

In [9]:
def scatter_plot(alldatadf,
                 x_axis_col='% Free Lunch',
                 y_axis_col='Mean Scale Score',
                 separate_plots_by='Year',
                 groups_on_plots='Grade',
                 drop=[]):
    """Make scatter plots of data in alldatadf.
    
    alldatadf: pandas data frame.
    x_axis_col: the column of alldatadf to use as the x-axis. Should
        be the same for all values of the in the groups_on_plots column,
        since this will become in the index of the data frame that is
        eventually plotted.
    y_axis_col: the column of alldatadf to use as the y-axis.
    separate_plots_by: a column name of alldatadf, each unique value
        in the column be plotted in a separate figure.
    groups_on_plots: a column name of alldatadf, each unique value
        will be plotted as a separate color.
    drop: columns of alldatadf to drop.
    """
    for grp_name, df in alldatadf.groupby(separate_plots_by):
        pivot_table = df.pivot(index='DBN', columns=groups_on_plots)
        pivot_table = pivot_table[[x_axis_col, y_axis_col]]
        pivot_table.index = pivot_table[x_axis_col].iloc[:, 0]
        pivot_table.index.name = x_axis_col
        pivot_table = pivot_table[y_axis_col]
        pivot_table = pivot_table.drop(drop, axis=1)
        pivot_table = pivot_table[~np.isnan(pivot_table.index.values)]
        pivot_table = pivot_table.sort_index()
        ax = pivot_table.plot(linestyle='',
                              marker='o',
                              alpha=.5,
                              title=grp_name,
                              figsize=(10,6))
        ax.set_ylabel(y_axis_col)
        title_value = ax.get_title()
        ax.set_title(separate_plots_by + ' = ' + title_value)

Here's the plotting function in action, with its default arguments. The data is pretty dense here and specific trends within each grade are hard to distinguish, however there does seem to be an overall negative correlation between the mean score and the % free lunch.

In [10]:
scatter_plot(mergeddf)