CS816: Big Data Analytics

Exploring Midterm Exam Scores

The dataset is a pair-graded (raw grade and curved grade) score for each student. Since the class is about data wrangling and anlaysis, the score-dataset is a good example to play with.

In [1]:
grades = [
    ['U0090', 78, 88],
    ['U0374', 81, 90],
    ['U0592', 96, 98],
    ['U0621', 48, 69],
    ['U1331', 63, 79],
    ['U2711', 42, 65],
    ['U2967', 84, 92],
    ['U4407', 80, 89],
    ['U5038', 94, 97],
    ['U5039', 80, 89],
    ['U5235', 72, 85],
    ['U5677', 53, 73],
    ['U5871', 85, 92],
    ['U6278', 13, 36],
    ['U6367', 31, 56],
    ['U6397', 94, 97],
    ['U7575', 77, 88],
    ['U7608', 65, 81],
    ['U8101', 64, 80],
    ['U9346', 92, 96]
]
In [2]:
from __future__ import division
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
matplotlib.style.use('ggplot')
%matplotlib inline
In [3]:
df = pd.DataFrame.from_records(grades, columns=['Pace_ID', 'raw_grade', 'curved_grade'], index='Pace_ID')
# sort scores
df = df.sort(['curved_grade'], ascending=False)
/Library/Python/2.7/site-packages/ipykernel/__main__.py:3: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
  app.launch_new_instance()
In [4]:
# add gain value
df['gain'] = df.curved_grade - df.raw_grade
df
Out[4]:
raw_grade curved_grade gain
Pace_ID
U0592 96 98 2
U6397 94 97 3
U5038 94 97 3
U9346 92 96 4
U5871 85 92 7
U2967 84 92 8
U0374 81 90 9
U4407 80 89 9
U5039 80 89 9
U7575 77 88 11
U0090 78 88 10
U5235 72 85 13
U7608 65 81 16
U8101 64 80 16
U1331 63 79 16
U5677 53 73 20
U0621 48 69 21
U2711 42 65 23
U6367 31 56 25
U6278 13 36 23
In [5]:
df.describe()
Out[5]:
raw_grade curved_grade gain
count 20.000000 20.00000 20.000000
mean 69.600000 82.00000 12.400000
std 22.483678 15.71121 7.257664
min 13.000000 36.00000 2.000000
25% 60.500000 77.50000 7.750000
50% 77.500000 88.00000 10.500000
75% 84.250000 92.00000 17.000000
max 96.000000 98.00000 25.000000
In [6]:
df.plot(kind='barh', figsize=(12,13))
Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x108427cd0>

Exploring the grade distribution

In [7]:
df.plot(kind='kde', figsize=(10,6))
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x10860af10>
In [8]:
df.plot(kind='box', figsize=(10,6))
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x10a06a990>
In [9]:
df.plot(kind='area', figsize=(13,5))
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x10a3669d0>

The curving method (root function):

Curved grade $f(x) = 10 \sqrt(x)$

The score is then rounded to nearest integer.

In [10]:
# curve function
curve = lambda x: 10 * np.sqrt(x)
In [11]:
# evaluate
evalu = [round(x) for x in curve(df.raw_grade)]
np.all(df.curved_grade == evalu)
Out[11]:
True
In [12]:
# e.g.
print round(curve(13))
print round(curve(77))
36.0
88.0