Tuesday, September 30, 2008

INVERTED HASHING and PRUNING ALGORITHM

For transactions with many items, i.e.: long transactions, use of inverted hashing and pruning algorithm is suggested. In case of long transactions, scanning the database consumes quite some time and so inverted hashing and pruning algorithm is more appropriate than the Apriori algorithm as it reduces the number of candidate item sets for the second and subsequent passes and thus saves time on scanning the transactions. Inverted hashing and pruning uses hash table to eliminate some of the candidate itemsets.

In Inverted hashing and pruning, for each item the transaction ID (TID) of the transaction containing the item are hashed into a hash table related to the specific item and is called as TID Hash Table (THT) for that item. For every item its THT will contain all the possible hash values for the transactions. Now when a particular item occurs in a transaction, the TID of that transaction is hashed to get a hash value. Then in the corresponding entry of the THT the count is incremented by one. During the first pass of the database as we count the number of occurrences for each item its THT will be generated. Eventually at the end of the first pass we can eliminate the THT of the items which are not in the set of frequent 1-itemsets and the remaining THT can prune some of the candidate k+1 itemsets. For the remaining THT we consider all possible k+1 itemset combinations and accordingly try to find out the support for that particular itemset and if it comes out to be less than minimum support we can eliminate that itemset. The items in each itemset are stored in a lexical order and itemsets are lexically ordered to have effective formation of candidate itemsets. The advantage with Inverse hashing and pruning is that between two consecutive passes a number of TID hash tables are eliminated and most of the memory used for holding TID hash tables in the earlier pass is available to hold the candidate itemset in the subsequent pass. Thus for transactions that have more number of items Inverse hashing and pruning is more advantageous as compared to Apriori algorithm.

No comments: