HW 6

Big Data Analytics
Team 3
Answered by: Aziz

Extra Exerciese

Given the following 10 grocery store transactions, use appropriate association rule thresholds to find a few interesting rules both by hand and by using R.

  1. beer, diapers
  2. soda, potato chips, hamburger meat, milk, eggs
  3. coffee, eggs
  4. beer, bread, cheese, ham
  5. diapers, beer, potato chips
  6. cheese, ham, beer
  7. ham, cheese, bread, coffee, milk
  8. soda, cheese, bread, ham
  9. coffee, hamburger meat
  10. eggs, diapers, beer
In [1]:
dataset = """
1. beer, diapers
2. soda, potato chips, hamburger meat, milk, eggs
3. coffee, eggs
4. beer, bread, cheese, ham
5. diapers, beer, potato chips
6. cheese, ham, beer
7. ham, cheese, bread, coffee, milk
8. soda, cheese, bread, ham
9. coffee, hamburger meat
10. eggs, diapers, beer
"""
In [2]:
from collections import Counter
import re
import pandas as pd
In [3]:
# prepare purchases table
tokens = lambda text: re.findall('[a-z]+', text.lower())

trans_lists = [tokens(l) for l in dataset.split('\n') if l is not '']
trans_lists = [','.join(l) for l in trans_lists]
transactions = pd.DataFrame(trans_lists, columns=['purchase'])
transactions.to_csv('extra_exercise.csv', index=False, header=False)
transactions
Out[3]:
purchase
0 beer,diapers
1 soda,potato,chips,hamburger,meat,milk,eggs
2 coffee,eggs
3 beer,bread,cheese,ham
4 diapers,beer,potato,chips
5 cheese,ham,beer
6 ham,cheese,bread,coffee,milk
7 soda,cheese,bread,ham
8 coffee,hamburger,meat
9 eggs,diapers,beer
In [4]:
no_transactions = 10
In [5]:
support = lambda x, num: float(x) / num
In [6]:
# count items appearance in transactions
counter = dict(Counter(tokens(dataset)))
count = pd.DataFrame(counter.items(), columns=['item', 'no_count'])
count.to_csv('count.csv', index=False)
count
Out[6]:
item no_count
0 cheese 4
1 coffee 3
2 hamburger 2
3 ham 4
4 potato 2
5 eggs 3
6 diapers 3
7 beer 5
8 soda 2
9 bread 3
10 chips 2
11 milk 2
12 meat 2
In [7]:
# calculate the Support of each item
count['support'] = [support(n, no_transactions) for n in count.no_count]
count
Out[7]:
item no_count support
0 cheese 4 0.4
1 coffee 3 0.3
2 hamburger 2 0.2
3 ham 4 0.4
4 potato 2 0.2
5 eggs 3 0.3
6 diapers 3 0.3
7 beer 5 0.5
8 soda 2 0.2
9 bread 3 0.3
10 chips 2 0.2
11 milk 2 0.2
12 meat 2 0.2
With the mimimum support 0.3, we end up with the following itemset
In [8]:
min_support = 0.3
count[count.support >= min_support]
Out[8]:
item no_count support
0 cheese 4 0.4
1 coffee 3 0.3
3 ham 4 0.4
5 eggs 3 0.3
6 diapers 3 0.3
7 beer 5 0.5
9 bread 3 0.3