HW 6¶

Big Data Analytics
Team 3

Extra Exerciese¶

Given the following 10 grocery store transactions, use appropriate association rule thresholds to find a few interesting rules both by hand and by using R.

1. beer, diapers
2. soda, potato chips, hamburger meat, milk, eggs
3. coffee, eggs
5. diapers, beer, potato chips
6. cheese, ham, beer
7. ham, cheese, bread, coffee, milk
9. coffee, hamburger meat
10. eggs, diapers, beer
In [1]:
dataset = """
1. beer, diapers
2. soda, potato chips, hamburger meat, milk, eggs
3. coffee, eggs
5. diapers, beer, potato chips
6. cheese, ham, beer
7. ham, cheese, bread, coffee, milk
9. coffee, hamburger meat
10. eggs, diapers, beer
"""

In [2]:
from collections import Counter
import re
import pandas as pd

In [3]:
# prepare purchases table
tokens = lambda text: re.findall('[a-z]+', text.lower())

trans_lists = [tokens(l) for l in dataset.split('\n') if l is not '']
trans_lists = [','.join(l) for l in trans_lists]
transactions = pd.DataFrame(trans_lists, columns=['purchase'])
transactions

Out[3]:
purchase
0 beer,diapers
1 soda,potato,chips,hamburger,meat,milk,eggs
2 coffee,eggs
4 diapers,beer,potato,chips
5 cheese,ham,beer
8 coffee,hamburger,meat
9 eggs,diapers,beer
In [4]:
no_transactions = 10

In [5]:
support = lambda x, num: float(x) / num

In [6]:
# count items appearance in transactions
counter = dict(Counter(tokens(dataset)))
count = pd.DataFrame(counter.items(), columns=['item', 'no_count'])
count.to_csv('count.csv', index=False)
count

Out[6]:
item no_count
0 cheese 4
1 coffee 3
2 hamburger 2
3 ham 4
4 potato 2
5 eggs 3
6 diapers 3
7 beer 5
8 soda 2
10 chips 2
11 milk 2
12 meat 2
In [7]:
# calculate the Support of each item
count['support'] = [support(n, no_transactions) for n in count.no_count]
count

Out[7]:
item no_count support
0 cheese 4 0.4
1 coffee 3 0.3
2 hamburger 2 0.2
3 ham 4 0.4
4 potato 2 0.2
5 eggs 3 0.3
6 diapers 3 0.3
7 beer 5 0.5
8 soda 2 0.2
10 chips 2 0.2
11 milk 2 0.2
12 meat 2 0.2
With the mimimum support 0.3, we end up with the following itemset¶
In [8]:
min_support = 0.3
count[count.support >= min_support]

Out[8]:
item no_count support
0 cheese 4 0.4
1 coffee 3 0.3
3 ham 4 0.4
5 eggs 3 0.3
6 diapers 3 0.3
7 beer 5 0.5