Calculating the Entropy and Information Gain, part of an answer for:

http://csis.pace.edu/~benjamin/teaching/cs827/webfiles/learn/learnhmwk1.html

By: A.Aziz Altowayan


In [1]:
from __future__ import division # for python 2.7
from math import log

The function H(pi) returns the Entropy of a probability distribution: $$ H(p) = - \sum_{i=1}^n p_i * log_2(p_i) $$

defintion: entropy is a metric to measure the uncertainty of a probability distribution. Its value ranges between 0 and 1

In [2]:
def H(pi):
    total = 0
    for p in pi:
        p = p / sum(pi)
        try:
            total += p * log(p, 2)
        except ValueError:
            total += 0
    total *= -1
    return total

The Information Gain $G(D,A)$ is calculated according to: $$ G(D, A) = H(D)−􏰋 \sum_{i=1}^n \frac{D_i}{D} * H(D_i) $$ Where, $D$ is the dataset and $A$ is an Attribute in that dataset.

The following function G(d, a) does that:

In [3]:
def G(d, a):
    total = 0
    for v in a:
        adi = sum(v)
        ad = sum(d)
        total += adi / ad * H(v)

    gain = H(d) - total
    return gain

Now, we apply that on our PlayTennis dataset:

In [4]:
playTennis = [9, 5]

outlook = [
    [4, 0],  # overcase
    [2, 3],  # sunny
    [3, 2]   # rain
]

temperature = [
    [2, 2],  # hot
    [3, 1],  # cool
    [4, 2]   # mild
]

humidity = [
    [3, 4],  # high
    [6, 1]   # normal
]

wind = [
    [6, 2],  # weak
    [3, 3]   # strong
]

print("Information Gain: ")
d = playTennis
attrs = outlook, temperature, humidity, wind

for a in attrs:
    print('G({}, {})'.format(d, a))
    print('\t = {}'.format( G(d, a) ))
print "*" * 40
Information Gain: 
G([9, 5], [[4, 0], [2, 3], [3, 2]])
	 = 0.246749819774
G([9, 5], [[2, 2], [3, 1], [4, 2]])
	 = 0.029222565659
G([9, 5], [[3, 4], [6, 1]])
	 = 0.151835501362
G([9, 5], [[6, 2], [3, 3]])
	 = 0.0481270304083
****************************************

The entropy of the attributes distribution:

In [5]:
print("Entropy of Outlook distribution: ")
for d in outlook:
    print('D:{}  E: {}'.format(d, H(d)))
Entropy of Outlook distribution: 
D:[4, 0]  E: -0.0
D:[2, 3]  E: 0.970950594455
D:[3, 2]  E: 0.970950594455
In [6]:
print("Entropy of Temperature distribution:")
for d in temperature:
    print('D:{}  E: {}'.format(d, H(d)))
Entropy of Temperature distribution:
D:[2, 2]  E: 1.0
D:[3, 1]  E: 0.811278124459
D:[4, 2]  E: 0.918295834054
In [7]:
print("Entropy of Humidity distribution:")
for d in humidity:
    print('D:{}  E: {}'.format(d, H(d)))
Entropy of Humidity distribution:
D:[3, 4]  E: 0.985228136034
D:[6, 1]  E: 0.591672778582
In [9]:
print("Entropy of Wind distribution:")
for d in wind:
    print('D:{}  E: {}'.format(d, H(d)))
Entropy of Wind distribution:
D:[6, 2]  E: 0.811278124459
D:[3, 3]  E: 1.0