Exercise 5 (Ch.2 - Duda):

In [17]:
question5
Out[17]:

Concept:

  • The problem deals with the idea of the Central Limit Theorem (CLT) that:
    • "the aggregate effect of a large number of small, independent random disturbances will lead to a Gaussian distribution." Duda, et al- Ch.2 p.17

Solution:

In [3]:
import numpy as np
from random import randrange as rrange
from random import choice as rchoice
import matplotlib.pyplot as plt
from scipy.stats.kde import gaussian_kde

class DudaExcercise5(object):
    """ The functions of this class are tailored to answer Exercise 5 of Ch.2 in Duda, et al. """
    
    def U(self, low=0, hi=1, num_points=3):
        """return an arbitrary distribution of integer values between U(xl,xu)."""
        self.dist = [int(i) for i in np.random.uniform(low, hi, num_points)]
        return self.dist
        
    def get_sample(self, n):
        """select n random numbers from the distribution U(xl,xu)"""
        d = self.dist
        sample = [rchoice(d) for _ in range(n)]
        return sample

    def get_xl_xu_n(self):
        """choose xl and xu randomly, in the range −100 ≤ xl < xu ≤ +100, and n as the absolute difference between xl and xu. """
        a = [rrange(-100, 100) for _ in range(2)]
        xl, xu = min(a), max(a)
        self.n = abs(xu - xl)
        return xl, xu, self.n

    def get_mean(self, sample):
        """compute the mean of a sample"""
        if len(sample) > 1:
            l = [float(i) for i in sample]
            return sum(l) / len(sample)
        
    def get_sample_means(self, num_samples, sample_size):
        """
        select `num_samples` random samples from U(xl, xu), each sample of size `sample_size`.
        return the means of a set of (`num_samples`) random variables """
        sample_means = []
        for i in range(num_samples):
            s1 = self.get_sample(sample_size)
            x1 = self.get_mean(s1)
            sample_means.append(x1)

        return sample_means


    def plotme(self, data, hist=1, norm=0, density=0, labels=0, title=1, mean=1, std=1):
        """ 
        input: 
            an array of a one-dimensional distribution between two boundaries.
        return:
            plot the probability distribution (`the class-conditional probability` or likelihood) with options of histogram, density, and/or kernel plots with options of mean and/or std.
        
        parameters:
        -----------
        all parameters are [optional] boolean to show/hide the corresponding parameter on the subplot.

        hist: generate (in subplot 1) the histogram of the distribution.
        
        norm: normalize the values of y-axis.
        
        density: generate (in subplot 2) the estimate of the pdf curve (estimated using gaussian kernels).
        
        labels: display the label on the axes.
        
        title: display the plot title.
        
        mean: plot the mean of the distribution (in both subplots 1 and 2).
        
        std: plot the sections of the standard deviations (in subplot 2)
        """

        mu = np.mean(data)
        sig = np.std(data)

        fig = plt.figure(figsize=(15, 5))
        ax1 = fig.add_subplot(121)

        # plot hist
        if hist:
            ax1.hist(data, bins=self.n, normed=norm)

        # plot mean
        if mean:
            ax1.axvline(x=mu, ymin=0, ymax=0.75, linewidth=3, color='m', label=r'$\mu=${}'.format(mu))


        # plot k-density (estimate the probability density function (PDF) of random variable)
        if density:
            # subplot 2
            ax2 = fig.add_subplot(122)
            # this create the kernel, given an array it will estimate the probability over that values
            xker = data
            kde = gaussian_kde(xker)
            # these are the values over wich your kernel will be evaluated
            dist_space = np.linspace( min(xker), max(xker), len(xker) )
            # plot the results
            ax2.plot( dist_space, kde(dist_space), 'r--', label='pdf estimate')
            ax1.plot( dist_space, kde(dist_space), 'r--', label='pdf estimate')

            # plot std sections
            if std:
                for i in range(1, 3):
                    # right side
                    sec = mu + (i * sig)
                    ax2.axvline(sec, ymin=0, ymax=(0.5/i), linewidth=1, color='g')
                    # left side
                    sec = mu + (-i * sig)
                    ax2.axvline(sec, ymin=0, ymax=(0.5/i), linewidth=1, color='g')

                ax2.axvline(sec, ymin=0, ymax=(0.5/i), linewidth=1, color='g', label=r'$\sigma=${}'.format(sig))
                # add mean to ax2
                ax2.axvline(x=mu, ymin=0, ymax=0.75, linewidth=2, color='m', label=r'$\mu=${}'.format(mu))

            # plot title and axes labels
            if title:
                ax2.set_title(r'Distribution density with $\sigma$')

            ax2.legend(loc='upper right')

        if title:
            ax1.set_title("Distribution")
        ax1.legend(loc='upper right')
        if labels:
            ax1.set_ylabel("Frequency of the sample mean")

Start by generating an arbiterary distribution (from a uniform distribution between xl and xu)

In [8]:
ex5 = DudaExcercise5()
xl, xu, n = ex5.get_xl_xu_n()
d = ex5.U(xl, xu, n)

print("Distribution of: U({},{}) (n = {} )".format(xl, xu, n))
Distribution of: U(-62,32) (n = 94 )

Histogram of the selected distribution

In [10]:
ex5.plotme(d)
plt.show()

Sampling Distribution of the Sample Mean

Lets see what happens when we sample multiple sets (from the distribution above) and compute the mean of each; then plot the Frequency Distribition of the Sample Mean.

In [11]:
nsamples = 10**2      # number of samples
ssize    = 10**3      # size of each sample

# generate `nsamples` random samples, and compute the average of each sample (sample means)
s_means = ex5.get_sample_means(nsamples, ssize)

# plot the frequencies (aggregate) of the sample means
print("\n{} SAMPLE, EACH OF SIZE: {}".format(nsamples, ssize))
ex5.plotme(s_means, density=1, labels=1, norm=0)
plt.show()

100 SAMPLE, EACH OF SIZE: 1000

Comparision: Multiple plots of different sample size and number of samples

In [12]:
num_samples = [10**3, 10**3, 10**3]  #  list of number of samples, to generate them randomly use [rrange(1, 100) for _ in range(3)]
sample_sizes = [10**3, 10**4, 10**5] #  list of sample size

for nsamples, ssize in zip(num_samples, sample_sizes):
    # get sample means
    %time s_means= ex5.get_sample_means(nsamples, ssize)

    # plot sample distributions
    print("\n{} SAMPLE, EACH OF SIZE: {}".format(nsamples, ssize))
    ex5.plotme(s_means, density=1, labels=1, norm=1)
    plt.show()
    print("-" * 110)
CPU times: user 1.33 s, sys: 14 ms, total: 1.34 s
Wall time: 1.49 s

1000 SAMPLE, EACH OF SIZE: 1000

--------------------------------------------------------------------------------------------------------------
CPU times: user 13.1 s, sys: 96.1 ms, total: 13.2 s
Wall time: 14.6 s

1000 SAMPLE, EACH OF SIZE: 10000

--------------------------------------------------------------------------------------------------------------
CPU times: user 2min 16s, sys: 3.67 s, total: 2min 20s
Wall time: 15min 23s

1000 SAMPLE, EACH OF SIZE: 100000

--------------------------------------------------------------------------------------------------------------

Observations:

  • We see from the plots that the average of large number of independent random variables will eventually approximate to a normal distribution (Gaussian).
  • As the sample size gets larger, two things happen:
    • The standard deviation $\sigma$ becomes smaller. (consequently, more normal skew and kurtosis with large num_samples)
    • The mean $\mu$ of the sampling distribution gets closer to the mean of the original distributioin.



References:

In [20]:
from IPython.core.display import Image
In [21]:
questiont5 = Image("~/Desktop/ex5-question.png")