Solving a simple linear regression line

In a simple linear regression, the best line fit is:

$$ y = a + bx $$

where, $\;\; b = \frac{ (\sum_{i=1}^{n} x_iy_i) - n\bar{x}\bar{y} }{\sum x^2 - n \bar{x}^2} $ and, $\;\;a = \bar{y} - b\bar{x}$

$\bar{x}$ and $\bar{y}$ are the means of the corresponding axis points.


A small example:

In [1]:
import numpy as np
In [2]:
points = np.array([
    [0,0],
    [5,7],
    [10,10],
    [15,13],
    [20,20]])
In [3]:
x = np.array(points.T[0])
y = np.array(points.T[1])

The solution

In [4]:
A = np.vstack([x, np.ones(len(x))]).T
b, a = np.linalg.lstsq(A, y)[0]
print('y = {} + {}x'.format(a, b))
y = 0.8 + 0.92x

Plot the line:

In [5]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(x, y, 'o', label='Original data')
plt.plot(x, b*x + a, label='Fitted line')
plt.legend(bbox_to_anchor=(1.5,1));plt.grid();plt.show()

Compute the error:

$error = y - \hat{y}$

where, $y$ is the actual point, and $\hat{y}$ is the predicted one.

The Sum Square Error SSE:

$SSE = \sum_{i=1}^{n} (y - \hat{y})^2$

In [6]:
error = lambda x, y: (y - (a + b * x))**2
sse = sum([error(i,j) for i, j in zip(x,y)])
print('SSE = {}'.format(sse))
SSE = 6.4

For each point

In [7]:
for i, j in zip(x,y):
    print('point: {}\terror: {:.2f}'.format((i,j), np.sqrt(error(i,j))))
point: (0, 0)	error: 0.80
point: (5, 7)	error: 1.60
point: (10, 10)	error: 0.00
point: (15, 13)	error: 1.60
point: (20, 20)	error: 0.80



In [8]:
!whoami && date
Aziz
Mon Dec 14 23:02:58 EST 2015