Tuesday, September 30, 2008

CONCLUSION

Thus it can be concluded that data mining is an ongoing activity which involves pattern recognition in the datasets. Association Rule Discovery can be a helpful and important data mining technique if used effectively. We find that the algorithms Apriori and AprioriTid form one of the most important works done on Association Rule Theory. Many of the other algorithms are modifications of these algorithms and some use hashing and pruning methods. Finally, the paper also examined two situations where the Association Rule Technique has been implemented in a real-world scenario. It is worth noting that operations on real data, as opposed to synthetically generated data, include tasks such as pre-processing and noise filtering before the association algorithm can be effectively applied.

CASE STUDY 2

This case study explains the use of association rules for outlining the main spots and zones which are highly prone to accidents in Belgium. According to the statistics, the rate of traffic accidents in Belgium was as high as 50,000 causing injury to around 70,000 people and resulting in deaths of approximately 1,500 people. These high rates of accidents not only resulted in increase of insecurity on roads but also affected the economic costs associated with these accidents. Therefore, traffic safety became a serious issue of concern for the Belgium government.

Initially, statistical methods were being used to analyze the accidents registered but soon it was realized that association rules is a better alternative for finding the relevant variables which can help in understanding the situations in which the accidents took place.


Data:

The data under the study is about the traffic accidents that occurred in the region of Flanders (Belgium) in the year 1999. This data was obtained from the National Institute of Statistics (NIS) and had 34,353 accident records for analysis. This data stored information about the course of the accident along with various traffic, environmental, human, road and geographical conditions. The various attributes of these conditions amounted to the total of 45 on average. Maximum speed, weather, time of accident, road obstacles, alcohol, location, etc, were some of those attributes. When this data was first analyzed, it became clear that some of the attributes in the dataset have a constant value for all the accidents but this will have no effect on the association rules discovered because the association rule discovery algorithms count the frequency of the attributes in the dataset.


Mining Process:

The mining process was divided into three steps: Pre-processing step in which the data was prepared in the form which could be used by the association algorithm, Mining step in which all the association rules were generated and the Post-processing step in which the interesting association rules were identified.

These steps in detail are as follows:

a) Pre-processing Step:

The very first step in pre-processing involved the assignment of a location parameter to each accident so that the high frequency accident locations could be outlined from low frequency accident locations. This location parameter corresponds to a unique geographical location. This step was carried out by assigning the unique road identification number and the kilometer mark to accidents which took place either in district or in province. The other accidents were assigned a location using the name of the street and the city in which the accident took place.

Second step included the selection of two different datasets in order to determine association between the different accidents attributes. The first analysis was done on the data related to high frequency accident locations and the second on the data corresponding to low frequency accident locations. The comparison of the analysis of these two different datasets helped in determining the accident variables prevalent in the high frequency accident locations. For the first analysis, the criterion of five accidents per location was set for identifying the locations which are highly prone to accidents which resulted in the total of 3368 traffic accidents. The remaining accident records amounting to 30,985 were included in the second analysis for low frequency accident locations.

In the next step, all the continuous variables that existed in the dataset was separated into individual variables for generating the association rules in the second step of mining process. A continuous attribute was divided into different intervals or discrete attribute by grouping them into partitions. These intervals for the attribute were either obtained from the knowledge of some expert or were generated from the method called Equal Frequency Binning.

Lastly, the pre-processing of the nominal values was performed by converting those values into binary equivalent for the attributes.

b) Association Rules Generation:

Minimum support (minsup) and minimum confidence (minconf) are the foremost requirements for the generation of association rules. Using the trial and error methods it was decided that the minimum support for the analysis of two datasets will be 5% and the minimum confidence will be 30%. This means that only those itemsets will be considered frequent which occurs at least 165 times in the accidents. The minimum support was set too high so that only non-trivial rules are generated. Also, only those rules will be considered reliable whose consequent occurs at least once out of the three times that the antecedent appears.

After applying the association algorithm on both the data sets, the results obtained for the high frequency accident locations included “187,829 frequent item sets of maximum size 4 for which 598.584 association rules could be generated”. The result obtained from analysis of second dataset for low frequency accident locations included “183.730 frequent item sets of maximum size 4 for which 575.974 association rules could be generated”.


c) Post-processing of Association Rules:

The last step in mining process was the post processing step in which the generated association rules in the previous step were further evaluated based on the interestingness criteria in order to narrow down the number of association rules initially generated. After carrying out the post process step using the lift value and eliminating the trivial rules, only 14,690 association rules were left for the high frequency accident locations and 77,282 association rules were left for the low frequency accident locations. But still the problem for finding the discriminating factor between the high and low frequency accident locations remained. So, a measure based on deviation of rules discovered for different accident locations was used to evaluate the interestingness of the rule. This measure is as follows:

“ I = Sh-Sl
___________

Max{Sl,Sh}” [pdf10]


Here I stand for interest. “The nominator Sh-Sl measures the difference in support for the rules in the high frequency accident data set (Sh) and the low frequency accident data set (Sl). The expression max {Sl, Sh} is called the normalizing factor as it normalizes the interestingness measure onto the scale [-1, 1]”.

Results:

After the analysis of the two datasets, the association rules obtained, which were common to both high frequency accident locations and low frequency accident locations, were further post-processed on the basis of interestingness measure which resulted in the total of 50 discriminating association rules. These results showed that the rule having high interestingness value may have low lift value and vice versa, thus lacking correlation. Exhibit 3 and Exhibit 4 explained that rules having high lift value and low interestingness value were common to both high and low frequency accident locations.

Exhibit 4 divided the association rules according to three categories which are: highest interestingness, low lift value and high life value. In the first section, the 10 rules based on highest interestingness values indicated that the geographical conditions was one of the most discriminating variable between high and low frequent accident locations. For example, the roadways outside the inner city having separated lanes are a characteristic of black spots and black zones. It was also interpreted from the result that the driver involved in accidents should be between 30 to 45 years of age. The second section includes 10 association rules based on high lift value. These rules indicated accident characteristics like number of people involved in accident and mostly apply to low frequency accident locations than that to high frequency. The 10 association rules in the third section based on small lift value explained that the accident patterns mainly correspond to human characteristics and discriminates a little between high and low frequency accident locations. From the observation of the support factor in all these parts of the table, it was clear that support remains stable in the case of low frequency accident locations but it was quite high for high frequency accident locations in first section and quite low for the second and third parts. Therefore, an increase in the interest value and discriminating characteristic of these rules are features contributing to the occurrence of accidents in high frequency accident locations. Apart from this, there were some association rules which mainly related to high frequency accident locations and do not apply to low frequency accident locations and so are called as discriminating characteristics between high and low frequency accident locations.

Conclusion:

The case study was concluded stating that the analysis of the datasets helped in finding out the combination of attributes which are together responsible for the occurrence of accidents in the high frequency accident locations. However, a lot more is still left to be discovered which include extent of casualties from these accidents. It also gave scope for future research on the discriminating factors between high and low frequency accident locations. As of now, the only suggestion was to appoint special traffic police in areas that are highly prone to accidents under special circumstances in which these accidents take place. These circumstances were described in the form of rules in tables.

CASE STUDY 1

Retail organizations generate large volume of data called POS (Point of Sale) data through daily transactions. This data when collected and analyzed can be used to make important business decisions and plan strategies for innovations and competitive edge over their competitors. It is not possible to analyze this data manually and hence techniques such as data mining are used. To analyze these data it is necessary to use an appropriate framework, tool or environment. In this case study we discuss the use of Association rules to find the associations that exist between products in the market basket data.

Association rules technique has been widely used in retail industry with the name “Market Basket Analysis”. This technique when applied can deliver measurable benefits to the organizations such as improved profitability and improved quality of service. The transactional data can be quite challenging for the Data Mining approach due to it’s:

Massiveness: The transactional data collected is vast with thousands of transactions.
Sparseness: A basket contains only a small fraction of the total possible items.
Heterogeneity: Variability in purchasing behavior across different individual and purchasing pattern of an individual over a certain period of time.

Strategy:
Due to the massive amount of data collected it is difficult to explore it for analysis purpose as we can have as many as 4 trillion possible rules generated from 10,000 transactions. In other words, if the transactions are accumulated over a longer period of time then it will become all the more difficult to analyze the collected data. To overcome this situation the divide-and-conquer strategy can be powerful. It is based on problem decomposition, solution and aggregation. This means that it will be more advantageous if data is collected over smaller time intervals and then analyzed. The output is collected over each of these small intervals and then combined to solve the entire problem. All this is done by creating a separate database to store the results collected over smaller intervals and then retrieving interesting rules. A rule is interesting if it remains stable and satisfies a minimum threshold within a specific number of intervals. This strategy allows extraction of interesting rules with good support and confidence on a daily basis. Also, it helps in dumping the raw data after analysis, which can reduce the unwanted data in the database.

This strategy was applied to two supermarket stores in Porto Alegre, Brazil. Data corresponding to the purchases made by consumers over a period of one hundred and twenty days were used.

Mining Process:
The transactional data were recorded and associations were generated on a daily basis. Certain criterions were defined to reduce the number of rules discovered. The aggregation level was defined to treat products mainly on their functions and far from brand and major physical aspects. This approach enabled to reduce the number of transactions to 4500 and thus reducing the total number of possible rules to 10 million. But as a trade of to this, the approach required transformation of each product prior to mining. This considerably reduced the time spent on analysis although including extra time to get the data ready. Also, by the time of this conversion, all the operational transactions were removed to leave behind only the useful information for data mining.

In order to exclude the products that are not important in the total sales of the stores, the daily extracted rules were minimized by setting the extraction support parameter to a minimum limit of 1%. Therefore, the products below this minimum limit were excluded and were not considered for the rule formation. This step was also helpful in excluding the groups that didn’t represent 1% of the total number of transaction for that day. The products in the special offers were excluded from the base of rules for the number of days the products were in the special offer. The products in special offer were analyzed in a separate database so that they do not affect the final results of associations.

Results:
The resulting rules base stored more than 6000 extracted rules. The rules differed in the frequency of their occurrence. The resulting rules for both the stores were compared. In this the first store was selected as a reference to select the associations and then their presence was checked in the second store. There were many associations found common to both the rules base but for space constraints only six associations were shown. The associations were classified as “Usual and Non-usual”. Usual products are the one with shared use or application which justifies the choice of consumers whereas Non-usual products have no direct link between the associated product in terms of application or use. It is seen that both the “Usual and Non-usual products” have different usage patterns and intensities in both the stores. For the Usual products the stability in both the stores is recorded as around 100%. But the average of the confidence differs for both the stores. This indicates that the associations though stable in both the stores had different intensities. In case of the Non-usual products both the stability and confidence factors fluctuate considerably. In store A the associations for Non-usual products were greater than 95% whereas in store B the associations were well below the 95% suggested for a stable association. This study of the Non-usual products reveals that their associations are peculiar to the store in which they’ve been found and may not be found in another store. It also reveals that the Usual products’ associations can repeat themselves in the stores around the same culture. Thus, it can be concluded that the associations for the usual products tend to repeat themselves in a group of consumers of similar profile whereas for the association of the Non-usual products is specific to the store following the same degree of consumer specifics. The results also revealed that there were huge fluctuations in the occurrences of the associations of both “Usual and Non-usual products” in store B as compared to store A. But there are no significant differences in the confidence of the products in both Store A and Store B. This meant that confidence cannot be used as a classification criterion.

Conclusion:
From the above case study, it was concluded that association rules enable the retail organizations to analyze the data collected from the transactions to make intelligent decisions. The technique can be helpful in finding interesting associations from their transaction data. It can also be useful in cross selling of the associated products and carefully plan promotions for the products. It also helps in setting up the margins of profits of the associated products. But it is difficult to classify customers into specific classes and to set commercial value for the customers. The technique does not provide precise information, as the quantity involved is not addressed. This inclusion of quantity in the analysis can lead to better promotions and even increased profits.

PERFECT HASHING AND PRUNING ALGORITHM

Direct Hashing and Pruning (DHP) varies from the Apriori algorithm in the sense that it uses hashing technique to remove the unwanted itemsets for the generation of next set of candidate itemsets. In Apriori algorithm, determining all the large itemsets from the set of candidate large itemsets for a particular pass by scanning the database becomes very expensive. But in DHP the size of the candidate itemsets is reduced and then it performs much faster then the Apriori algorithm. But with DHP it generates a number of false positives and so Perfect Hashing and Pruning (PHP) is another algorithm which is suggested as an improvement over DHP. In this algorithm, during the first pass a hash table with size equal to the number of individual items in the database is created. Then each individual item is mapped to different item in the hash table. After the first pass the hash table consists of exact number of occurrences of each item in the database. Then the algorithm removes all the transactions which have no items that are frequent and also trims the items that are not frequent. Simultaneously, candidate k-itemsets are generated and the occurrences of each are found out. Thus at the completion of each pass, the algorithm gives a pruned database, occurrences of candidate k-itemsets and set of frequent k-itemsets and the algorithm terminates when no new frequent k-itemsets are found. Thus PHP is an improvement over DHP with no false positives and thus extra time required for counting frequency of each item set is eliminated. PHP also reduces the size of the database.

Working Example for Perfect Hashing and Pruning (PHP):

DK=PRUNED DATABASE
HK=OCCURANCES OF CANDIDATE K ITEMSET
FK =SET OF FREQUENT K ITEMSETS





Database:
TID
ITEMS
100
XZ
200
MXZ
300
MXYZ
400
MN







Support = 50% of total number of transactions
=50% of 4
= 2.

Iteration -1:

H1:
LOCATION
ITEMSET
SUPPORT
1
X
3
2
Z
3
3
M
3
4
Y
1
5
N
1









F1

ITEMSET
SUPPORT
H(1)
3
H(2)
3
H(3)
3


D1

TID
ITEMS
100
XZ
200
MXZ
300
MXZ
400
M


In Iteration-1, H1 maps each individual item to a particular different location as show in the table and we find out the support for each. In F1 we keep only those items that are large. For F1 we have used a hashing function where each itemset is referred to as H (n) where n signifies the location of the items in the hash table. From F1 we process and derive pruned database D1 which will contain only large itemsets. Since Y and N are found out to be small itemsets, accordingly their entries are eliminated in D1.

Iteration -2:

H2

LOCATION
ITEMSET
SUPPORT
1
MX
2
2
XZ
3
3
MZ
2






F2

ITEMSET
SUPPORT
H(1)
2
H(2)
3
H(3)
2

D2

TID
ITEMS
100
XZ
200
MXZ
300
MXZ

In Iteration-2, H2 maps each candidate large 2-itemset to a particular different location as show in the table and we find out the support for each. In F2 we keep only those items that are large. From F2 we process and derive pruned database D2 which will contain only large itemsets. Since TID 400 does not contain any large itemset its entry is eliminated from D2.

H3

LOCATION
ITEMSET
SUPPORT
1
MXZ
2






F3

ITEMSET
SUPPORT
H(1)
2




D3

TID
ITEMS
200
MXZ
300
MXZ



In Iteration-3, H3 maps each candidate large 3-itemset to a particular different location as show in the table and we find out the support for each. In F3 we keep only those items that are large. From F3 we process and derive pruned database D3 which will contain only large itemsets. Since F3 contains only 1 itemset the algorithm terminates since no candidate large 4 itemsets can be found out.

INVERTED HASHING and PRUNING ALGORITHM

For transactions with many items, i.e.: long transactions, use of inverted hashing and pruning algorithm is suggested. In case of long transactions, scanning the database consumes quite some time and so inverted hashing and pruning algorithm is more appropriate than the Apriori algorithm as it reduces the number of candidate item sets for the second and subsequent passes and thus saves time on scanning the transactions. Inverted hashing and pruning uses hash table to eliminate some of the candidate itemsets.

In Inverted hashing and pruning, for each item the transaction ID (TID) of the transaction containing the item are hashed into a hash table related to the specific item and is called as TID Hash Table (THT) for that item. For every item its THT will contain all the possible hash values for the transactions. Now when a particular item occurs in a transaction, the TID of that transaction is hashed to get a hash value. Then in the corresponding entry of the THT the count is incremented by one. During the first pass of the database as we count the number of occurrences for each item its THT will be generated. Eventually at the end of the first pass we can eliminate the THT of the items which are not in the set of frequent 1-itemsets and the remaining THT can prune some of the candidate k+1 itemsets. For the remaining THT we consider all possible k+1 itemset combinations and accordingly try to find out the support for that particular itemset and if it comes out to be less than minimum support we can eliminate that itemset. The items in each itemset are stored in a lexical order and itemsets are lexically ordered to have effective formation of candidate itemsets. The advantage with Inverse hashing and pruning is that between two consecutive passes a number of TID hash tables are eliminated and most of the memory used for holding TID hash tables in the earlier pass is available to hold the candidate itemset in the subsequent pass. Thus for transactions that have more number of items Inverse hashing and pruning is more advantageous as compared to Apriori algorithm.

MiRABIT ALGORITHM

The Apriori algorithm is the most basic algorithm for finding association rules in large databases but for those databases in which the number of items per transaction is less, the Apriori algorithm is not very optimum. For this purpose ‘MiRABIT’ Algorithm is used to optimize databases with low average of items per transaction. In Apriori algorithm in the first pass individual item support is counted and the items with at least minimum support are considered as large items. Later on for each pass candidate itemsets are found out. But the candidate itemset generation becomes important for the databases in which the items per transaction are less. For this purpose the MiRABIT algorithm generates candidate itemsets in each transaction rather than in the beginning of each pass. The MiRABIT algorithm makes multiple passes over the database. Like the Apriori algorithm in the first pass the individual item support is counted and all large items are considered. In the second pass at each transaction the items are grouped into sets of two items. Now for each set with two items it’s verified whether the subset is large or not. If it is large then the set support is counted or else it is not counted. Consequently all other passes are undergone and at the end of each pass all sets of candidates with at least minimum support are considered as large itemsets. This process continues until the large itemsets have only one element or are empty. The advantage of MiRABIT algorithm is that it generates a smaller candidate itemsets as compared to Apriori. Thus for the databases in which the average number of items per transaction is less, the performance of MiRABIT algorithm is better as compared to the Apriori algorithm.


Working Example for MiRABIT:

Database:
TID
ITEMS
100
XZ
200
MXZ
300
MXYZ
400
MN

Support = 50% of total number of transactions
=50% of 4
= 2.


Iteration-1:

Individual Item Support

ITEMSET
SUPPORT
{X}
3
{Z}
3
{M}
3
{Y}
1
{N}
1

In MiRABIT algorithm in Iteration-1 first the individual item support is counted as show in the table above.



Iteration – 2:
ITEMSET
SUPPORT
{XZ}
3
{MX}
2
{XY}
-
{MY}
-
{YZ}
-
{MZ}
2
{MN}
-










Then for Iteration-2 we find all candidate large 2- itemsets. Once this is done we need to find out support of only those candidate 2 -itemsets whose subsets are large. Thus in this example {Y} and {N} are not large itemsets because they have a support of only 1. Thus we only find out support of {XZ},{MX} and {MZ} and not for the others which involve {Y} or {N}. Thus {XZ},{MX} and {MZ} are large -2 itemsets.


Iteration-3:
ITEMSET
SUPPORT
{MXZ}
2

In iteration -3 we first find out all candidate large 3- itemsets from all large 2-itemsets. Since this contains only one itemset {MXZ} the algorithm terminates.

APRIORI, APRIORITID and APRIORI HYBRID

With large databases, it is required to have algorithms that process the data with high speed. For discovering association rules between items in a large database, we have had algorithms like AIS algorithm and SETM algorithm but the speed with which they processed the data in the databases was not fast and also they tend to find a lot of itemsets which were small and thus end up wasting time. Thus in the paper written by Agarwal, R. and Srikant, R., the authors have proposed two new algorithms Apriori and AprioriTid algorithms to overcome these problems. The Apriori algorithm makes multiple passes over the database. In the first pass individual item support is counted and the items that have support greater than or equal to minimum support are considered as large items. For each pass k after that, the large itemsets in the previous pass is grouped in sets of k items and these form candidate itemsets. The support for the different candidate itemsets is counted and if the support is found to be greater than minimum support then that itemset is considered to be large. This process continues until the large itemset in a particular pass comes out to be an empty set.

Similar to the Apriori algorithm, the AprioriTid algorithm also uses the apriori-gen function to determine the candidate itemsets but the difference is that the database is not used for counting support after the first pass. Instead set of candidate itemsets is used for this purpose for k>1. In case a transaction does not have any candidate k-itemset then the set of candidate itemsets would not have any entry for that transaction which will eventually decrease the number of transaction in the set containing the candidate itemsets as compared to the database. As value of k increases each entry will be smaller than the corresponding transactions because the number of candidates in the transactions will decrease. Apriori performs better than AprioriTid in the initial passes but in the later passes AprioriTid has better performance than Apriori. Due to this reason we can use another algorithm called Apriori Hybrid algorithm in which Apriori is used in the initial passes but we switch to AprioriTid in the later passes.

Working Example for Apriori:

Database:
TID
ITEMS
100
XZ
200
MXZ
300
MXYZ
400
MN

Support = 50% of total number of transactions
=50% of 4
= 2.

Iteration-1:

C1 (BAR)

TID
SET OF ITEMSETS
100
{X}, {Z}
200
{M},{X},{Z}
300
{M},{X},{Y},{Z}
400
{M},{N}



L1
ITEMSET
SUPPORT
{X}
3
{M}
3
{Z}
3

As shown in the above tables, the first step is to generate C1 (BAR) which contains all 1-itemsets from the database and are kept with the Tid’s of the generating transactions. The next step is to generate all large 1-itemsets from C1 (BAR). Thus, at the end of iteration 1 we get {X}, {M} and {Z} as the large 1-itemsets.

Iteration-2:
C2

ITEMSET
SUPPORT
{MX}
2
{XZ}
3
{MZ}
2


C2 (BAR)

TID
ITEMS
100
{XZ}
200
{MX},{XZ},{MZ}
300
{MX},{XZ},{MZ}
400
------------

L2

ITEMSET
SUPPORT
{MX}
2
{XZ}
3
{MZ}
2

In the Iteration -2 the first step is to find out C2 which is the set of candidate large 2-itemsets. For this we call apriori-gen function which processes L1 in order to get C2. In C2 we find the support of all the candidate large 2-itemsets. In C2 (BAR) we have now all those large 2-itemsets associated with their transaction Tid’s. We notice that Tid 400 has not got any candidate large 2-itemsets. From C2 (BAR) we process L2 which will have only large 2-itemsets.

Iteration -3:

C3

ITEMSET
SUPPORT
{MXZ}
2

C3 (BAR)

TID
ITEMS
100
------------
200
{MXZ}
300
{MXZ}
400
------------

L3
ITEMSET
SUPPORT
{XMZ}
2




In the Iteration -3, again the first step is to find out C3 which is the set of candidate large 3-itemsets from L2. For this we call apriori-gen function which processes L2 in order to get C3. In C3 we find the support of all the candidate large 3-itemsets. In C3 (BAR) we have now all those large 3-itemsets associated with their transaction Tid’s. We notice that only Tid 200 and 300 have got candidate large 3-itemsets. From C3 (BAR) we process L3 which will have only large 3-itemsets. Thus at the end of iteration-3 we are left with only one large itemset {XMZ} and so we cannot have any further candidate large 4-itemset and thus the algorithm is terminated.

ASSOCIATION RULE TECHNIQUE

Let us first understand the basic methodology of the Association Rule Theory. Let I be a set of literals called items, D be a set of transactions where each transaction T is a set of items such that T is a subset of I. If X is a set of some items in I and if X is a subset of T then we can say that T contains X. Association Rule Theory is a proposition of the form X => Y where X, Y both are subsets of I and X∩Y = Φ. For the rule X => Y, if c% of the transactions in D that contain X also contain Y then we can say that for that rule the confidence is c and the support for the rule is s if s% of the transactions in D contain X U Y. Itemsets with minimum support are called as large itemsets and those with support less than minimum support are referred to as small itemsets. For a given set of transactions D, the problem of mining association rules is to generate all the association rules that have confidence and support greater than the user specified minimum confidence and support.

BACKGROUND

Association Rule Discovery, a data mining technique, was formally introduced by Rakesh Agarwal, Ramakrishnan Srikant and Imielinski in the year 1993. The research on this technique was carried out under the “Quest Project at IBM Almaden Research Center” and at the “University of Helsinki”. The technique association rules works on the principle of finding relationships between transactions in a database. Since each transaction contains a set of items the algorithm specifies to find an item x such that every occurrence of x specifies occurrence of another item say y. Hence the association rule x=>y will hold true if the confidence and support for this rule is greater than the user specified minimum confidence (minconf) and minimum support (minsup). Support is counted as the percentage of transactions containing the itemset to the total number of transactions in the database. Confidence is the percentage of transactions containing an itemset to the number of transactions containing the subset of the itemset. Basically, there are two steps involved in finding out the association rules in the transactional databases. First is to generate the frequent k itemsets. Second is to generate the association rules between those itemsets which will hold true if they satisfy the minimum support and minimum confidence.

Association Rule Discovery is one of the most famous and widely accepted strategies of data mining that focuses on detecting interesting associations between items in the large databases. Commonly used for market basket analysis in order to determine the likely combinations of items that will appeal to a consumer based on prior records, it has also been used to create predictive association rules for classification problems. The information collected using Association Rule Discovery technique also help the companies in making decisions, forecasting sales, determining frauds, etc.

The paper below summarizes the basic methodology of association rules along with the mining association algorithms in the third section. The algorithms include the most basic Apriori algorithm along with other algorithms such as AprioriTid, AprioriHybrid, MiRABit, Inverted Hashing and Pruning and Perfect Hashing and Pruning. Fourth section of the paper includes two case studies which demonstrate the usefulness of association rule technique in the real world. The first case study relates to the continuous analysis of the market basket data in retail organizations in order to help the management to take decisions based on the associations revealed between various products. The second case study addresses the crucial problem of traffic safety in Belgium and explains the use of association rules as an effective data mining technique for outlining the main spots and zones which are highly prone to accidents in Belgium. This case study also use association rules to discover the discriminatory characteristics that exist between the high frequency accident locations and the low frequency accident locations. The fifth and the final section contains the conclusion for the paper.

INTRODUCTION ASSOCIATION RULE DISCOVERY

T.S.Eliot, the famous poet, once said “Where is the wisdom we have lost in knowledge? Where is the knowledge we have lost in information?” . In every sphere of life, important decisions have to be taken in order to overcome the problems or to make profits in business. The information required for taking such decisions comes from the study of the data which is stored in large databases. Without rigorous analysis of these databases, it is not possible to extract important information underlying in that data. The sudden increase in the size of databases has resulted in the need for the development of such tools and techniques that are able to extract useful knowledge automatically by analyzing these large transactional databases in an efficient manner. Discovering the different patterns and behavior in these large databases is called knowledge discovery in database or data mining. Data Mining has been described as “"The non trivial extraction of implicit, previously unknown, and potentially useful information from data"”. It is used to find out some trends in real life which may not have been noticed but can be highly significant once they are discovered.

Most of the times the term “Data Mining” is confused with the term “Statistics” as both aims at exploring the structure of the data. While the groundwork was laid in the traditions of a statistician’s penchant for exhaustive collections of data, data mining techniques differ fundamentally from the conventional methods of analysis. Central to this difference is the weightage placed on the algorithm rather than the model. There are various data mining techniques used in the industry which can be broadly classified into classical techniques and next generation techniques. Some of them are clustering, decision trees, bayesian network, classification rules, data summarization, neural networks and many others. Association Rule Discovery is one such dimension of data mining which comes under the next generation techniques.