Direct Hashing and Pruning (DHP) varies from the Apriori algorithm in the sense that it uses hashing technique to remove the unwanted itemsets for the generation of next set of candidate itemsets. In Apriori algorithm, determining all the large itemsets from the set of candidate large itemsets for a particular pass by scanning the database becomes very expensive. But in DHP the size of the candidate itemsets is reduced and then it performs much faster then the Apriori algorithm. But with DHP it generates a number of false positives and so Perfect Hashing and Pruning (PHP) is another algorithm which is suggested as an improvement over DHP. In this algorithm, during the first pass a hash table with size equal to the number of individual items in the database is created. Then each individual item is mapped to different item in the hash table. After the first pass the hash table consists of exact number of occurrences of each item in the database. Then the algorithm removes all the transactions which have no items that are frequent and also trims the items that are not frequent. Simultaneously, candidate k-itemsets are generated and the occurrences of each are found out. Thus at the completion of each pass, the algorithm gives a pruned database, occurrences of candidate k-itemsets and set of frequent k-itemsets and the algorithm terminates when no new frequent k-itemsets are found. Thus PHP is an improvement over DHP with no false positives and thus extra time required for counting frequency of each item set is eliminated. PHP also reduces the size of the database.
Working Example for Perfect Hashing and Pruning (PHP):
DK=PRUNED DATABASE
HK=OCCURANCES OF CANDIDATE K ITEMSET
FK =SET OF FREQUENT K ITEMSETS
Database:
TID
ITEMS
100
XZ
200
MXZ
300
MXYZ
400
MN
Support = 50% of total number of transactions
=50% of 4
= 2.
Iteration -1:
H1:
LOCATION
ITEMSET
SUPPORT
1
X
3
2
Z
3
3
M
3
4
Y
1
5
N
1
F1
ITEMSET
SUPPORT
H(1)
3
H(2)
3
H(3)
3
D1
TID
ITEMS
100
XZ
200
MXZ
300
MXZ
400
M
In Iteration-1, H1 maps each individual item to a particular different location as show in the table and we find out the support for each. In F1 we keep only those items that are large. For F1 we have used a hashing function where each itemset is referred to as H (n) where n signifies the location of the items in the hash table. From F1 we process and derive pruned database D1 which will contain only large itemsets. Since Y and N are found out to be small itemsets, accordingly their entries are eliminated in D1.
Iteration -2:
H2
LOCATION
ITEMSET
SUPPORT
1
MX
2
2
XZ
3
3
MZ
2
F2
ITEMSET
SUPPORT
H(1)
2
H(2)
3
H(3)
2
D2
TID
ITEMS
100
XZ
200
MXZ
300
MXZ
In Iteration-2, H2 maps each candidate large 2-itemset to a particular different location as show in the table and we find out the support for each. In F2 we keep only those items that are large. From F2 we process and derive pruned database D2 which will contain only large itemsets. Since TID 400 does not contain any large itemset its entry is eliminated from D2.
H3
LOCATION
ITEMSET
SUPPORT
1
MXZ
2
F3
ITEMSET
SUPPORT
H(1)
2
D3
TID
ITEMS
200
MXZ
300
MXZ
In Iteration-3, H3 maps each candidate large 3-itemset to a particular different location as show in the table and we find out the support for each. In F3 we keep only those items that are large. From F3 we process and derive pruned database D3 which will contain only large itemsets. Since F3 contains only 1 itemset the algorithm terminates since no candidate large 4 itemsets can be found out.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment