CS816 Big Data Analytics

HW6, Team 3
Oct 14th, 2015
Answered by: Aziz


Apriori Algorithm

  • Learns the Association Rules between the Frequent Itemsets in a database transactions.
  • Support is the percentage of the itemset in the overall transactions, i.e. $$Support(itemset) = \frac{itemset}{Transactions}$$
  • Minimum Support is the minimum value for considering frequent itemsets.
  • If minimum support is 50%, a Frequent Itemset is the one that appears in at least 50% of the total transactions.

    e.g. itemset A is a frequent itemset if: $$support(A) \geqslant minimum\; support$$

  • Confidence, a measure of the certainty of each rule, is ...

$$Confidence(X \to Y) = \frac{Support(X \wedge Y)}{Support(X)}$$
  • Lift, a measure of how $X$ and $Y$ are really related, is ...
$$Lift(X \to Y) = \frac{Support(X \wedge Y)}{Support(X) \ast Support(Y)}$$
  • Association rules is generated using frequent itemsets and minimum confidence.

Excercise 4, Chapter 5

A local retailer has a database that stores 10,000 transactions of last summer. After analyzing the data, a data science team has identified the following statistics:

• {battery} appears in 6,000 transactions.
• {sunscreen} appears in 5,000 transactions.
• {sandals} appearsin 4,000 transactions.
• {bowls}appears in 2,000 transactions.
• {battery, sunscreen} appearsin 1,500 transactions.
• {battery, sandals} appearsin 1,000 transactions.
• {battery, bowls} appears in 250 transactions.
• {battery, sunscreen, sandals} appears in 600transactions.

Answer the following questions:

a. What are the support values of the preceding itemsets?

b. Assuming the minimum support is 0.05, which itemsets are considered frequent?

c. What are the confidence values of {battery}-> { sunscreen} and {battery, sunscreen}->{ sandals} ?Which ofthe two rules is more interesting?

d. List all the candidate rules that can be formed from the statistics. Which rules are considered interesting at the minimum confidence 0.25? Out of these interesting rules, which rule is con- sidered the most useful (that is, least coincidental)?
In [1]:
import pandas as pd
import numpy as np
In [2]:
transactions = 10000
itemsets = [
        ['battery', 6000],
        ['sunscreen', 5000],
        ['sandals', 4000],
        ['bowls', 2000],
        ['battery, sunscreen', 1500],
        ['battery, sandals', 1000],
        ['battery, bowls', 250],
        ['battery, sunscreen, sandals', 600],
        ]
data = np.array(itemsets)
In [3]:
# database
cols = ('itemset', 'appear_count')
count = pd.DataFrame(data, columns=cols)
count
Out[3]:
itemset appear_count
0 battery 6000
1 sunscreen 5000
2 sandals 4000
3 bowls 2000
4 battery, sunscreen 1500
5 battery, sandals 1000
6 battery, bowls 250
7 battery, sunscreen, sandals 600

A) Support values

$Support(itemset) = \frac{itemset}{Transactions}$

In [4]:
# check support value of each itemset
support = lambda x, num: float(x) / num
count['support'] = [support(n, transactions) for n in count.appear_count]
count
Out[4]:
itemset appear_count support
0 battery 6000 0.600
1 sunscreen 5000 0.500
2 sandals 4000 0.400
3 bowls 2000 0.200
4 battery, sunscreen 1500 0.150
5 battery, sandals 1000 0.100
6 battery, bowls 250 0.025
7 battery, sunscreen, sandals 600 0.060

B) Frequent itemset(s)

An itemset A is frequent if:

$support(A) \geqslant minimum\; support$

In [5]:
# check whether an itemset is frequent or not
min_support = 0.05
frequent = lambda x: 'yes' if support(x, transactions) >= min_support else 'no'
count['frequent?'] = [frequent(n) for n in count.appear_count]
In [6]:
# show result
print('minimum support: {}'.format(min_support))
count
minimum support: 0.05
Out[6]:
itemset appear_count support frequent?
0 battery 6000 0.600 yes
1 sunscreen 5000 0.500 yes
2 sandals 4000 0.400 yes
3 bowls 2000 0.200 yes
4 battery, sunscreen 1500 0.150 yes
5 battery, sandals 1000 0.100 yes
6 battery, bowls 250 0.025 no
7 battery, sunscreen, sandals 600 0.060 yes

C) Confidence value

Rule: $$Confidence(X \to Y) = \frac{support(X\wedge Y)}{support(X)}$$

So:

$Confidence( \{battery\} \to \{sunscreen\} ) = \frac{Support( battery \wedge sunscreen )}{Support(battery)} = \frac{0.15}{0.6} = 0.25$

$Confidence( \{battery, sunscreen\} \to \{sandals\} ) = \frac{Support( \{battery, sunscreen\} \wedge sandals )}{Support(\{battery, sunscreen\})} = \frac{0.06}{0.15} = 0.4$

The rule $\{battery, sunscreen\}\to \{sandals\}$ is more interesting with 40% confidence.

In [7]:
# get support value from dataframe
get_support = lambda item: count.support[count.itemset == item].iloc[0]
# calculate confidence 
confidence = lambda support_X, support_XY: support_XY / support_X

battery = get_support('battery')
battery_sunscreen = get_support('battery, sunscreen')
battery_sunscreen_sandals = get_support('battery, sunscreen, sandals')
In [8]:
print(confidence(battery, battery_sunscreen))
0.25
In [9]:
print(confidence(battery_sunscreen, battery_sunscreen_sandals))
0.4

D) Candidate rules

In [10]:
# remember the table, used for the answer below
del count['frequent?']
count
Out[10]:
itemset appear_count support
0 battery 6000 0.600
1 sunscreen 5000 0.500
2 sandals 4000 0.400
3 bowls 2000 0.200
4 battery, sunscreen 1500 0.150
5 battery, sandals 1000 0.100
6 battery, bowls 250 0.025
7 battery, sunscreen, sandals 600 0.060

$C_k$: Set of candidates k-itemset.
$L_k$: Set of k-itemset that satisfies the minimum support "frequent itmesets".

In our case the minimum support = 0.05.

For simplicity, lets use
B: Battery
S: Sunscreen
D: sanDals
W: boWls

$C_1$: {B, S, D, W}
$L_1$: {B, S, D, W}

$C_2$: {BS, BD, BW, SD, SW, DW}
BW support is 0.025 (< 0.05) so we remove it. SD, SW, and DW supports are unknown, so we remove them as well.
$L_2$: {BS, BD}

$C_3$: {BSD}
BSD supports is 0.06 and it's the only itemset remains (no more combinations possible), so it will be the Frequently Bought Together itemset that we will be using to determine the association ruels.
$L_3$: {BSD}

Generating Association Rules from Frequent Itemsets

Procedure:

  1. For each frequent itemset $I$ in $L_3$(in our case, we have one itemset {BSD}), generate all nonempty subsets of $I$.
  2. For every nonempty subset $s$ of $I$, output the rule $s \to (l-s)$, if $\frac{support(l)}{support(s)}$ >= minimum confidence threshold,

Now, to proceed we list all the subsets of $L_3$:

  • {B}
  • {S}
  • {D}
  • {BS}
  • {BD}
  • {SD}

To find the association rules among these subsets, we get the confidence values between each two subsets, as follows:

All Candidate Rules:

$Confidence(B \to SD) = \frac{ Support(B \wedge SD)}{Support(B)} = \frac{0.06}{0.6} = 0.1 = 10\%$

$Confidence(S \to BD) = \frac{0.06}{0.5} = 0.12 = 12\%$

$Confidence(D \to BS) = \frac{0.06}{0.4} = 0.15 = 15\%$

$Confidence(BD \to S) = \frac{0.06}{0.1} = 0.6 = 60\%$

$Confidence(BS \to D) = \frac{0.06}{0.15} = 0.4 = 40\%$

$Confidence(B \to S) = \frac{0.15}{0.6} = 0.25 = 25\%$

$Confidence(B \to D) = \frac{0.1}{0.6} = 0.17 = 17\%$

$Confidence(S \to B) = \frac{0.15}{0.5} = 0.3 = 30\%$

$Confidence(D \to B) = \frac{0.1}{0.4} = 0.25 = 25\%$


Considering the Minimum Confidence value = 0.25 (25%), we are left with the following interesting rules:

$Confidence(BD \to S) = \frac{0.06}{0.1} = 0.6 = 60\%$

$Confidence(BS \to D) = \frac{0.06}{0.15} = 0.4 = 40\%$

$Confidence(B \to S) = \frac{0.15}{0.6} = 0.25 = 25\%$

$Confidence(S \to B) = \frac{0.15}{0.5} = 0.3 = 30\%$

$Confidence(D \to B) = \frac{0.1}{0.4} = 0.25 = 25\%$

So the rules, in English, are:

  • There is 60% chance that a customer will buy Sunscreen, if the customer buy Battery and Sandals together.
  • There is 40% chance that a customer will buy Sandals, if the customer buy Battery and Sunscreen together.
  • There is 25% chance that a customer will buy Sandals, if the customer buy Battery only.
  • There is 30% chance that a customer will buy Battery, if the customer buy Sunscreen only.
  • There is 25% chance that a customer will buy Battery, if the customer buy Sandals only.

We will use Lift rule, to decide on which rule (of the interesting rules) is not coincidental
$Lift(X \to Y) = \frac{Support(X \wedge Y)}{Support(X) \ast Support(Y)}$

As follows:

$Lift(BD \to S) = \frac{0.06}{0.1 \ast 0.5} = 1.2$

$Lift(BS \to D) = \frac{0.06}{0.15 \ast 0.4} = 1$

$Lift(B \to S) = \frac{0.15}{0.6 \ast 0.5} = 0.5$

$Lift(S \to B) = \frac{0.15}{0.5 \ast 0.6} = 0.5$

$Lift(D \to B) = \frac{0.1}{0.4 \ast 0.6} = 0.42$

In this case, we could say that the first rule ($BD \to S$) is the most useful (least coincidental), as it has highest confidence 60% and Lift 1.2.