Tuesday, September 30, 2008

APRIORI, APRIORITID and APRIORI HYBRID

With large databases, it is required to have algorithms that process the data with high speed. For discovering association rules between items in a large database, we have had algorithms like AIS algorithm and SETM algorithm but the speed with which they processed the data in the databases was not fast and also they tend to find a lot of itemsets which were small and thus end up wasting time. Thus in the paper written by Agarwal, R. and Srikant, R., the authors have proposed two new algorithms Apriori and AprioriTid algorithms to overcome these problems. The Apriori algorithm makes multiple passes over the database. In the first pass individual item support is counted and the items that have support greater than or equal to minimum support are considered as large items. For each pass k after that, the large itemsets in the previous pass is grouped in sets of k items and these form candidate itemsets. The support for the different candidate itemsets is counted and if the support is found to be greater than minimum support then that itemset is considered to be large. This process continues until the large itemset in a particular pass comes out to be an empty set.

Similar to the Apriori algorithm, the AprioriTid algorithm also uses the apriori-gen function to determine the candidate itemsets but the difference is that the database is not used for counting support after the first pass. Instead set of candidate itemsets is used for this purpose for k>1. In case a transaction does not have any candidate k-itemset then the set of candidate itemsets would not have any entry for that transaction which will eventually decrease the number of transaction in the set containing the candidate itemsets as compared to the database. As value of k increases each entry will be smaller than the corresponding transactions because the number of candidates in the transactions will decrease. Apriori performs better than AprioriTid in the initial passes but in the later passes AprioriTid has better performance than Apriori. Due to this reason we can use another algorithm called Apriori Hybrid algorithm in which Apriori is used in the initial passes but we switch to AprioriTid in the later passes.

Working Example for Apriori:

Database:
TID
ITEMS
100
XZ
200
MXZ
300
MXYZ
400
MN

Support = 50% of total number of transactions
=50% of 4
= 2.

Iteration-1:

C1 (BAR)

TID
SET OF ITEMSETS
100
{X}, {Z}
200
{M},{X},{Z}
300
{M},{X},{Y},{Z}
400
{M},{N}



L1
ITEMSET
SUPPORT
{X}
3
{M}
3
{Z}
3

As shown in the above tables, the first step is to generate C1 (BAR) which contains all 1-itemsets from the database and are kept with the Tid’s of the generating transactions. The next step is to generate all large 1-itemsets from C1 (BAR). Thus, at the end of iteration 1 we get {X}, {M} and {Z} as the large 1-itemsets.

Iteration-2:
C2

ITEMSET
SUPPORT
{MX}
2
{XZ}
3
{MZ}
2


C2 (BAR)

TID
ITEMS
100
{XZ}
200
{MX},{XZ},{MZ}
300
{MX},{XZ},{MZ}
400
------------

L2

ITEMSET
SUPPORT
{MX}
2
{XZ}
3
{MZ}
2

In the Iteration -2 the first step is to find out C2 which is the set of candidate large 2-itemsets. For this we call apriori-gen function which processes L1 in order to get C2. In C2 we find the support of all the candidate large 2-itemsets. In C2 (BAR) we have now all those large 2-itemsets associated with their transaction Tid’s. We notice that Tid 400 has not got any candidate large 2-itemsets. From C2 (BAR) we process L2 which will have only large 2-itemsets.

Iteration -3:

C3

ITEMSET
SUPPORT
{MXZ}
2

C3 (BAR)

TID
ITEMS
100
------------
200
{MXZ}
300
{MXZ}
400
------------

L3
ITEMSET
SUPPORT
{XMZ}
2




In the Iteration -3, again the first step is to find out C3 which is the set of candidate large 3-itemsets from L2. For this we call apriori-gen function which processes L2 in order to get C3. In C3 we find the support of all the candidate large 3-itemsets. In C3 (BAR) we have now all those large 3-itemsets associated with their transaction Tid’s. We notice that only Tid 200 and 300 have got candidate large 3-itemsets. From C3 (BAR) we process L3 which will have only large 3-itemsets. Thus at the end of iteration-3 we are left with only one large itemset {XMZ} and so we cannot have any further candidate large 4-itemset and thus the algorithm is terminated.

1 comment:

Rishik Mani Tripathi said...

Finally some easy explanation to understand the TID algorithm. Thank you very much. Cheers!!