### Exercise 5 (Ch.2 - Duda):

In [17]:
question5
Out[17]:

Concept:

• The problem deals with the idea of the Central Limit Theorem (CLT) that:
• "the aggregate effect of a large number of small, independent random disturbances will lead to a Gaussian distribution." Duda, et al- Ch.2 p.17

### Solution:¶

In [3]:
import numpy as np
from random import randrange as rrange
from random import choice as rchoice
import matplotlib.pyplot as plt
from scipy.stats.kde import gaussian_kde

class DudaExcercise5(object):
""" The functions of this class are tailored to answer Exercise 5 of Ch.2 in Duda, et al. """

def U(self, low=0, hi=1, num_points=3):
"""return an arbitrary distribution of integer values between U(xl,xu)."""
self.dist = [int(i) for i in np.random.uniform(low, hi, num_points)]
return self.dist

def get_sample(self, n):
"""select n random numbers from the distribution U(xl,xu)"""
d = self.dist
sample = [rchoice(d) for _ in range(n)]
return sample

def get_xl_xu_n(self):
"""choose xl and xu randomly, in the range âˆ’100 â‰¤ xl < xu â‰¤ +100, and n as the absolute difference between xl and xu. """
a = [rrange(-100, 100) for _ in range(2)]
xl, xu = min(a), max(a)
self.n = abs(xu - xl)
return xl, xu, self.n

def get_mean(self, sample):
"""compute the mean of a sample"""
if len(sample) > 1:
l = [float(i) for i in sample]
return sum(l) / len(sample)

def get_sample_means(self, num_samples, sample_size):
"""
select num_samples random samples from U(xl, xu), each sample of size sample_size.
return the means of a set of (num_samples) random variables """
sample_means = []
for i in range(num_samples):
s1 = self.get_sample(sample_size)
x1 = self.get_mean(s1)
sample_means.append(x1)

return sample_means

def plotme(self, data, hist=1, norm=0, density=0, labels=0, title=1, mean=1, std=1):
"""
input:
an array of a one-dimensional distribution between two boundaries.
return:
plot the probability distribution (the class-conditional probability or likelihood) with options of histogram, density, and/or kernel plots with options of mean and/or std.

parameters:
-----------
all parameters are [optional] boolean to show/hide the corresponding parameter on the subplot.

hist: generate (in subplot 1) the histogram of the distribution.

norm: normalize the values of y-axis.

density: generate (in subplot 2) the estimate of the pdf curve (estimated using gaussian kernels).

labels: display the label on the axes.

title: display the plot title.

mean: plot the mean of the distribution (in both subplots 1 and 2).

std: plot the sections of the standard deviations (in subplot 2)
"""

mu = np.mean(data)
sig = np.std(data)

fig = plt.figure(figsize=(15, 5))

# plot hist
if hist:
ax1.hist(data, bins=self.n, normed=norm)

# plot mean
if mean:
ax1.axvline(x=mu, ymin=0, ymax=0.75, linewidth=3, color='m', label=r'$\mu=${}'.format(mu))

# plot k-density (estimate the probability density function (PDF) of random variable)
if density:
# subplot 2
# this create the kernel, given an array it will estimate the probability over that values
xker = data
kde = gaussian_kde(xker)
# these are the values over wich your kernel will be evaluated
dist_space = np.linspace( min(xker), max(xker), len(xker) )
# plot the results
ax2.plot( dist_space, kde(dist_space), 'r--', label='pdf estimate')
ax1.plot( dist_space, kde(dist_space), 'r--', label='pdf estimate')

# plot std sections
if std:
for i in range(1, 3):
# right side
sec = mu + (i * sig)
ax2.axvline(sec, ymin=0, ymax=(0.5/i), linewidth=1, color='g')
# left side
sec = mu + (-i * sig)
ax2.axvline(sec, ymin=0, ymax=(0.5/i), linewidth=1, color='g')

ax2.axvline(sec, ymin=0, ymax=(0.5/i), linewidth=1, color='g', label=r'$\sigma=${}'.format(sig))
ax2.axvline(x=mu, ymin=0, ymax=0.75, linewidth=2, color='m', label=r'$\mu=${}'.format(mu))

# plot title and axes labels
if title:
ax2.set_title(r'Distribution density with $\sigma$')

ax2.legend(loc='upper right')

if title:
ax1.set_title("Distribution")
ax1.legend(loc='upper right')
if labels:
ax1.set_ylabel("Frequency of the sample mean")

### Start by generating an arbiterary distribution (from a uniform distribution between xl and xu)¶

In [8]:
ex5 = DudaExcercise5()
xl, xu, n = ex5.get_xl_xu_n()
d = ex5.U(xl, xu, n)

print("Distribution of: U({},{}) (n = {} )".format(xl, xu, n))
Distribution of: U(-62,32) (n = 94 )

In [10]:
ex5.plotme(d)
plt.show()

### Sampling Distribution of the Sample Mean

Lets see what happens when we sample multiple sets (from the distribution above) and compute the mean of each; then plot the Frequency Distribition of the Sample Mean.

In [11]:
nsamples = 10**2      # number of samples
ssize    = 10**3      # size of each sample

# generate nsamples random samples, and compute the average of each sample (sample means)
s_means = ex5.get_sample_means(nsamples, ssize)

# plot the frequencies (aggregate) of the sample means
print("\n{} SAMPLE, EACH OF SIZE: {}".format(nsamples, ssize))
ex5.plotme(s_means, density=1, labels=1, norm=0)
plt.show()

100 SAMPLE, EACH OF SIZE: 1000

## Comparision: Multiple plots of different sample size and number of samples

In [12]:
num_samples = [10**3, 10**3, 10**3]  #  list of number of samples, to generate them randomly use [rrange(1, 100) for _ in range(3)]
sample_sizes = [10**3, 10**4, 10**5] #  list of sample size

for nsamples, ssize in zip(num_samples, sample_sizes):
# get sample means
%time s_means= ex5.get_sample_means(nsamples, ssize)

# plot sample distributions
print("\n{} SAMPLE, EACH OF SIZE: {}".format(nsamples, ssize))
ex5.plotme(s_means, density=1, labels=1, norm=1)
plt.show()
print("-" * 110)
CPU times: user 1.33 s, sys: 14 ms, total: 1.34 s
Wall time: 1.49 s

1000 SAMPLE, EACH OF SIZE: 1000

--------------------------------------------------------------------------------------------------------------
CPU times: user 13.1 s, sys: 96.1 ms, total: 13.2 s
Wall time: 14.6 s

1000 SAMPLE, EACH OF SIZE: 10000

--------------------------------------------------------------------------------------------------------------
CPU times: user 2min 16s, sys: 3.67 s, total: 2min 20s
Wall time: 15min 23s

1000 SAMPLE, EACH OF SIZE: 100000

--------------------------------------------------------------------------------------------------------------

### Observations:

• We see from the plots that the average of large number of independent random variables will eventually approximate to a normal distribution (Gaussian).
• As the sample size gets larger, two things happen:
• The standard deviation $\sigma$ becomes smaller. (consequently, more normal skew and kurtosis with large num_samples)
• The mean $\mu$ of the sampling distribution gets closer to the mean of the original distributioin.

References:

In [20]:
from IPython.core.display import Image
In [21]:
questiont5 = Image("~/Desktop/ex5-question.png")