Calculating the Entropy and Information Gain, part of an answer for:

http://csis.pace.edu/~benjamin/teaching/cs827/webfiles/learn/learnhmwk1.html

By: A.Aziz Altowayan

In [1]:
from __future__ import division # for python 2.7
from math import log


The function H(pi) returns the Entropy of a probability distribution: $$H(p) = - \sum_{i=1}^n p_i * log_2(p_i)$$

defintion: entropy is a metric to measure the uncertainty of a probability distribution. Its value ranges between 0 and 1

• Low entropy means the distribution varies (peaks and valleys).
• High entropy means the distribution is uniform.
In [2]:
def H(pi):
total = 0
for p in pi:
p = p / sum(pi)
try:
total += p * log(p, 2)
except ValueError:
total += 0
total *= -1


The Information Gain $G(D,A)$ is calculated according to: $$G(D, A) = H(D)−􏰋 \sum_{i=1}^n \frac{D_i}{D} * H(D_i)$$ Where, $D$ is the dataset and $A$ is an Attribute in that dataset.

The following function G(d, a) does that:

In [3]:
def G(d, a):
total = 0
for v in a:

gain = H(d) - total
return gain


Now, we apply that on our PlayTennis dataset:

In [4]:
playTennis = [9, 5]

outlook = [
[4, 0],  # overcase
[2, 3],  # sunny
[3, 2]   # rain
]

temperature = [
[2, 2],  # hot
[3, 1],  # cool
[4, 2]   # mild
]

humidity = [
[3, 4],  # high
[6, 1]   # normal
]

wind = [
[6, 2],  # weak
[3, 3]   # strong
]

print("Information Gain: ")
d = playTennis
attrs = outlook, temperature, humidity, wind

for a in attrs:
print('G({}, {})'.format(d, a))
print('\t = {}'.format( G(d, a) ))
print "*" * 40

Information Gain:
G([9, 5], [[4, 0], [2, 3], [3, 2]])
= 0.246749819774
G([9, 5], [[2, 2], [3, 1], [4, 2]])
= 0.029222565659
G([9, 5], [[3, 4], [6, 1]])
= 0.151835501362
G([9, 5], [[6, 2], [3, 3]])
= 0.0481270304083
****************************************


The entropy of the attributes distribution:

In [5]:
print("Entropy of Outlook distribution: ")
for d in outlook:
print('D:{}  E: {}'.format(d, H(d)))

Entropy of Outlook distribution:
D:[4, 0]  E: -0.0
D:[2, 3]  E: 0.970950594455
D:[3, 2]  E: 0.970950594455

In [6]:
print("Entropy of Temperature distribution:")
for d in temperature:
print('D:{}  E: {}'.format(d, H(d)))

Entropy of Temperature distribution:
D:[2, 2]  E: 1.0
D:[3, 1]  E: 0.811278124459
D:[4, 2]  E: 0.918295834054

In [7]:
print("Entropy of Humidity distribution:")
for d in humidity:
print('D:{}  E: {}'.format(d, H(d)))

Entropy of Humidity distribution:
D:[3, 4]  E: 0.985228136034
D:[6, 1]  E: 0.591672778582

In [9]:
print("Entropy of Wind distribution:")
for d in wind:
print('D:{}  E: {}'.format(d, H(d)))

Entropy of Wind distribution:
D:[6, 2]  E: 0.811278124459
D:[3, 3]  E: 1.0