Exploring NYC school data¶

Frank Cleary | www.frankcleary.com | Notebook Gist ¶

New York City makes a large amount of data on its school system available for analysis. I'm especially interested in this data since I was an input into the process that generated it from 2006-2008. Here I've taken some of that data and looked at trends of test scores across geography and time.

The data are available here:

Main Site (see "NYC Data" under the "Performance & Accountability" tab on the left)
Direct Downloads (excel sheets):

The standard imports:¶

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

Read the data into pandas:¶

I save the data as a .csv file in the end because parsing the excel sheet is slow, and storing the .csv avoids repeating this step. This data is from a New York State Math test from 2006-2013, grades 3-8.

rawdf = pd.read_excel('data/SchoolMathResults20062012Public.xlsx',
                      'All Students',
                       skiprows=6, 
                       na_values=['s'])
raw2013 = pd.read_excel('data/SchoolMathResults2013.xlsx',
                        'All Students',
                        skiprows=6,
                        na_values=['s'])
rawdf = pd.concat([rawdf, raw2013])
rawdf.to_csv('data/SchoolMathResults20062013Public.csv')

Read in the .csv data:¶

rawdf = pd.read_csv('data/SchoolMathResults20062013Public.csv')
rawdf.head()

Here I add information back into the column names that was lost when reading in the excel sheet:

level_number_strings = [str(i) for i in range(1,5)] + ['3&4']
new_cols = ['Level ' + str(s1) + ' ' + s2
            for s1 in level_number_strings
            for s2 in ['#', '%']]
rawdf.columns = np.append(rawdf.columns.values[:-10], new_cols)
rawdf.head()

Read in school demographic data from the excel sheet:¶

Here I use the converters argument to get the ending year of the school years, which appear in "2006-2007" format in this file. This makes the year column consistent with the format of the test results data. Also I use na_values to properly encode the string "n/a" as a missing data point.

demographic_df = pd.read_excel(
                    'data/DemographicSnapshot2012Public.xlsx',
                    'School Demographics',
                    converters={'School Year': lambda x: int(x[-4:])},
                    na_values=['n/a'])
demographic_df = demographic_df.replace(r'^\s+$', np.nan, regex=True)
demographic_df.to_csv('data/DemographicSnapshot2012Public.csv')
demographic_df.head()

A bit of wrangling on the demographic data:¶

The combine_first method replaces all the NaN values in the '% Free Lunch' column with the value from the '% Free and Reduced Price Lunch' column. It seems '% Free and Reduced Price Lunch' became the statistic of choice starting in 2010.

combined_lunch = demographic_df['% Free Lunch'].combine_first(
                        demographic_df['% Free and Reduced Price Lunch']
                        )
demographic_df['% Free Lunch'] = combined_lunch.astype(float)
demographic_df = demographic_df.drop(['% Free and Reduced Price Lunch'], axis=1)
demographic_df = demographic_df.rename(columns={'School Year': 'Year'})
demographic_df.head(5)

Combine the test results data with the demographic data (similar to a database `JOIN` operation):¶

mergeddf = rawdf.merge(demographic_df, on=['DBN', 'Year'])

Extract a few new columns:¶

Pandas includes vectorized string methods that allow fast, NaN safe extraction of borough and district from the 'DBN' column using regular expressions.

borough = mergeddf['DBN'].str.extract(r'\d+([A-Z])\d+')
district = mergeddf['DBN'].str.extract(r'(\d+)[A-Z]\d+')
mergeddf['Borough'] = borough
mergeddf['District'] = district
mergeddf.head()

Define a function that makes some useful plots from the data:¶

I find it useful to write plotting functions that take in the data in question as a DataFrame, that way the data can be easily subsetted or transformed and then easily plotted while doing exploratory analysis. The plot() method of DataFrame is helpful for segmenting the data between columns, which otherwise would require repeated explicit matplotlib calls.

It's a little unintuitive to make scatter plots with the pandas plot() method, because kind='scatter' doesn't work. The solution is to pass in the arguments linestyle='', marker='o', which gets rid of the line and adds a circle at each data point.

def scatter_plot(alldatadf,
                 x_axis_col='% Free Lunch',
                 y_axis_col='Mean Scale Score',
                 separate_plots_by='Year',
                 groups_on_plots='Grade',
                 drop=[]):
    """Make scatter plots of data in alldatadf.
    
    alldatadf: pandas data frame.
    x_axis_col: the column of alldatadf to use as the x-axis. Should
        be the same for all values of the in the groups_on_plots column,
        since this will become in the index of the data frame that is
        eventually plotted.
    y_axis_col: the column of alldatadf to use as the y-axis.
    separate_plots_by: a column name of alldatadf, each unique value
        in the column be plotted in a separate figure.
    groups_on_plots: a column name of alldatadf, each unique value
        will be plotted as a separate color.
    drop: columns of alldatadf to drop.
    """
    for grp_name, df in alldatadf.groupby(separate_plots_by):
        pivot_table = df.pivot(index='DBN', columns=groups_on_plots)
        pivot_table = pivot_table[[x_axis_col, y_axis_col]]
        pivot_table.index = pivot_table[x_axis_col].iloc[:, 0]
        pivot_table.index.name = x_axis_col
        pivot_table = pivot_table[y_axis_col]
        pivot_table = pivot_table.drop(drop, axis=1)
        pivot_table = pivot_table[~np.isnan(pivot_table.index.values)]
        pivot_table = pivot_table.sort_index()
        ax = pivot_table.plot(linestyle='',
                              marker='o',
                              alpha=.5,
                              title=grp_name,
                              figsize=(10,6))
        ax.set_ylabel(y_axis_col)
        title_value = ax.get_title()
        ax.set_title(separate_plots_by + ' = ' + title_value)

Here's the plotting function in action, with its default arguments. The data is pretty dense here and specific trends within each grade are hard to distinguish, however there does seem to be an overall negative correlation between the mean score and the % free lunch.

scatter_plot(mergeddf)

Calculating the correlation coefficient:¶

As with most things, pandas makes it easy to compute the correlation coefficient (calculated here for all points of mean scale score data in the data set.

mergeddf[['% Free Lunch', 'Mean Scale Score']].corr()

Now instead of the mean score on the y-axis, the % of students reaching levels 3&4 (i.e. passing) is plotted, split by year on each figure. Although again the plotted data is very dense, I did notice that some of the years seem to have higher passing rates than others, which I decided to investigate in a bar graph later in the analysis.

scatter_plot(mergeddf,
             groups_on_plots='Year',
             separate_plots_by='Grade',
             y_axis_col='Level 3&4 %')

Making bar graphs:¶

To get a better view of summary statistics rather than the raw view offered by the scatter plots, I wrote a function to make bar graphs of a given slicing of the data. This function can also make line graphs for cases when there are so many bars that the graph becomes unreadable (for example when each of the 32 districts is its own bar). Unfortunately, the line graph runs into a common problem: the matplotlib default plotting options repeat quite frequently, which allows only 7 lines to be plotted before two lines look identical in the legend. I get around this by specifying a cycle of markers for the lines. Later in the analysis I also change the default color cycle to make the lines more easily distinguishable.

from itertools import cycle
def bar_graph(alldatadf,
              x_axis_col='Year',
              y_axis_col='Level 3&4 %',
              split_bars_by='Grade',
              kind='bar',
              figsize=(10,10),
              drop=['All Grades']):
    """Plot a bar graph of the data in alldatadf.
    
    alldatadf: DataFrame.
    x_axis_col: the column of alldatadf to use as the x-axis. 
    y_axis_col: the column of alldatadf to use as the y-axis.
    split_bars_by: a column name of alldatadf, each unique value
        in the column be plotted as one color of bar.
    kind: the type of plot. Must be supported by the plot method
        of pandas DataFrames.
    figsize: tuple of length 2, the figure size.
    drop: columns of alldatadf to drop.
    """
    grouped = alldatadf.groupby([x_axis_col, split_bars_by])
    aggregate_grouped = grouped.agg([np.mean, np.std], level=1)
    aggregate_grouped_y = aggregate_grouped[y_axis_col]
    unstacked = aggregate_grouped_y.unstack()
    unstacked = unstacked.drop(drop, axis=1, level=1)
    fig = plt.figure(figsize=figsize)
    ax = fig.add_axes([.1, .1, .8, .5])
    unstacked['mean'].plot(kind=kind, ax=ax)
    ax.set_ylabel(y_axis_col)
    ax.set_xticklabels(unstacked.index.values)
    legend = ax.legend(loc=2, ncol=2)
    legend.set_bbox_to_anchor((1.05, 1))
    legend.set_title(split_bars_by)
    if kind == 'line':
        marker_values = 'oxDs'
        markercycle = cycle(marker_values)
        for line in ax.get_lines():
            line.set_marker(next(markercycle))
        markercycle = cycle(marker_values) # reset the cycle
        for label in legend.get_lines():
            label.set_marker(next(markercycle))

bar_graph(mergeddf)

bar_graph(mergeddf, split_bars_by='Borough')

bar_graph(mergeddf, split_bars_by='Borough', x_axis_col='Grade')

bar_graph(mergeddf, split_bars_by='District')

Yikes, this is too many bars! Time for a line graph:

bar_graph(mergeddf, split_bars_by='District', kind='line')

These lines are still a bit hard to tell apart. To fix this I'll change the matplotlib defaults (using plt.rc()) to use a spectrum of colors.

colormap = plt.get_cmap('spectral')
colorlist = [colormap(z) for z in np.linspace(0, 1, num=32)]
plt.rc('axes', color_cycle=colorlist)

bar_graph(mergeddf, split_bars_by='District', kind='line')

	Unnamed: 0	DBN	Grade	Year	Category	Number Tested	Mean Scale Score	#	%	#.1	%.1	#.2	%.2	#.3	%.3	#.4	%.4
0	0	01M015	3	2006	All Students	39	667	2	5.1	11	28.2	20	51.3	6	15.4	26	66.7
1	1	01M015	3	2007	All Students	31	672	2	6.5	3	9.7	22	71.0	4	12.9	26	83.9
2	2	01M015	3	2008	All Students	37	668	0	0.0	6	16.2	29	78.4	2	5.4	31	83.8
3	3	01M015	3	2009	All Students	33	668	0	0.0	4	12.1	28	84.8	1	3.0	29	87.9
4	4	01M015	3	2010	All Students	26	677	6	23.1	12	46.2	6	23.1	2	7.7	8	30.8

	Unnamed: 0	DBN	Grade	Year	Category	Number Tested	Mean Scale Score	Level 1 #	Level 1 %	Level 2 #	Level 2 %	Level 3 #	Level 3 %	Level 4 #	Level 4 %	Level 3&4 #	Level 3&4 %
0	0	01M015	3	2006	All Students	39	667	2	5.1	11	28.2	20	51.3	6	15.4	26	66.7
1	1	01M015	3	2007	All Students	31	672	2	6.5	3	9.7	22	71.0	4	12.9	26	83.9
2	2	01M015	3	2008	All Students	37	668	0	0.0	6	16.2	29	78.4	2	5.4	31	83.8
3	3	01M015	3	2009	All Students	33	668	0	0.0	4	12.1	28	84.8	1	3.0	29	87.9
4	4	01M015	3	2010	All Students	26	677	6	23.1	12	46.2	6	23.1	2	7.7	8	30.8

	DBN	Name	School Year	% Free Lunch	% Free and Reduced Price Lunch	Total Enrollment	Pre-K	K	Grade 1	Grade 2	...	# Black	% Black	# Hispanic	% Hispanic	# White	% White	# Male	% Male	# Female	% Female
0	01M015	P.S. 015 ROBERTO CLEMENTE	2006	89.4	NaN	281	15	36	40	33	...	74	26.334520	189	67.259786	5	1.779359	158	56.227758	123	43.772242
1	01M015	P.S. 015 ROBERTO CLEMENTE	2007	89.4	NaN	243	15	29	39	38	...	68	27.983539	153	62.962963	4	1.646091	140	57.613169	103	42.386831
2	01M015	P.S. 015 ROBERTO CLEMENTE	2008	89.4	NaN	261	18	43	39	36	...	77	29.501916	157	60.153257	7	2.681992	143	54.789272	118	45.210728
3	01M015	P.S. 015 ROBERTO CLEMENTE	2009	89.4	NaN	252	17	37	44	32	...	75	29.761905	149	59.126984	7	2.777778	149	59.126984	103	40.873016
4	01M015	P.S. 015 ROBERTO CLEMENTE	2010	NaN	96.5	208	16	40	28	32	...	67	32.211538	118	56.730769	6	2.884615	124	59.615385	84	40.384615

	DBN	Name	Year	% Free Lunch	Total Enrollment	Pre-K	K	Grade 1	Grade 2	Grade 3	...	# Black	% Black	# Hispanic	% Hispanic	# White	% White	# Male	% Male	# Female	% Female
0	01M015	P.S. 015 ROBERTO CLEMENTE	2006	89.4	281	15	36	40	33	38	...	74	26.334520	189	67.259786	5	1.779359	158	56.227758	123	43.772242
1	01M015	P.S. 015 ROBERTO CLEMENTE	2007	89.4	243	15	29	39	38	34	...	68	27.983539	153	62.962963	4	1.646091	140	57.613169	103	42.386831
2	01M015	P.S. 015 ROBERTO CLEMENTE	2008	89.4	261	18	43	39	36	38	...	77	29.501916	157	60.153257	7	2.681992	143	54.789272	118	45.210728
3	01M015	P.S. 015 ROBERTO CLEMENTE	2009	89.4	252	17	37	44	32	34	...	75	29.761905	149	59.126984	7	2.777778	149	59.126984	103	40.873016
4	01M015	P.S. 015 ROBERTO CLEMENTE	2010	96.5	208	16	40	28	32	30	...	67	32.211538	118	56.730769	6	2.884615	124	59.615385	84	40.384615

	Unnamed: 0	DBN	Grade	Year	Category	Number Tested	Mean Scale Score	Level 1 #	Level 1 %	Level 2 #	...	# Hispanic	% Hispanic	# White	% White	# Male	% Male	# Female	% Female	Borough	District
0	0	01M015	3	2006	All Students	39	667	2	5.1	11	...	189	67.259786	5	1.779359	158	56.227758	123	43.772242	M	01
1	7	01M015	4	2006	All Students	49	629	20	40.8	18	...	189	67.259786	5	1.779359	158	56.227758	123	43.772242	M	01
2	14	01M015	5	2006	All Students	31	630	12	38.7	9	...	189	67.259786	5	1.779359	158	56.227758	123	43.772242	M	01
3	21	01M015	6	2006	All Students	39	639	10	25.6	12	...	189	67.259786	5	1.779359	158	56.227758	123	43.772242	M	01
4	22	01M015	All Grades	2006	All Students	158	NaN	44	27.8	50	...	189	67.259786	5	1.779359	158	56.227758	123	43.772242	M	01

	% Free Lunch	Mean Scale Score
% Free Lunch	1.000000	-0.470529
Mean Scale Score	-0.470529	1.000000