Association Rule Discovery: CASE STUDY 2

This case study explains the use of association rules for outlining the main spots and zones which are highly prone to accidents in Belgium. According to the statistics, the rate of traffic accidents in Belgium was as high as 50,000 causing injury to around 70,000 people and resulting in deaths of approximately 1,500 people. These high rates of accidents not only resulted in increase of insecurity on roads but also affected the economic costs associated with these accidents. Therefore, traffic safety became a serious issue of concern for the Belgium government.

Initially, statistical methods were being used to analyze the accidents registered but soon it was realized that association rules is a better alternative for finding the relevant variables which can help in understanding the situations in which the accidents took place.

Data:

The data under the study is about the traffic accidents that occurred in the region of Flanders (Belgium) in the year 1999. This data was obtained from the National Institute of Statistics (NIS) and had 34,353 accident records for analysis. This data stored information about the course of the accident along with various traffic, environmental, human, road and geographical conditions. The various attributes of these conditions amounted to the total of 45 on average. Maximum speed, weather, time of accident, road obstacles, alcohol, location, etc, were some of those attributes. When this data was first analyzed, it became clear that some of the attributes in the dataset have a constant value for all the accidents but this will have no effect on the association rules discovered because the association rule discovery algorithms count the frequency of the attributes in the dataset.

Mining Process:

The mining process was divided into three steps: Pre-processing step in which the data was prepared in the form which could be used by the association algorithm, Mining step in which all the association rules were generated and the Post-processing step in which the interesting association rules were identified.

These steps in detail are as follows:

a) Pre-processing Step:

The very first step in pre-processing involved the assignment of a location parameter to each accident so that the high frequency accident locations could be outlined from low frequency accident locations. This location parameter corresponds to a unique geographical location. This step was carried out by assigning the unique road identification number and the kilometer mark to accidents which took place either in district or in province. The other accidents were assigned a location using the name of the street and the city in which the accident took place.

Second step included the selection of two different datasets in order to determine association between the different accidents attributes. The first analysis was done on the data related to high frequency accident locations and the second on the data corresponding to low frequency accident locations. The comparison of the analysis of these two different datasets helped in determining the accident variables prevalent in the high frequency accident locations. For the first analysis, the criterion of five accidents per location was set for identifying the locations which are highly prone to accidents which resulted in the total of 3368 traffic accidents. The remaining accident records amounting to 30,985 were included in the second analysis for low frequency accident locations.

In the next step, all the continuous variables that existed in the dataset was separated into individual variables for generating the association rules in the second step of mining process. A continuous attribute was divided into different intervals or discrete attribute by grouping them into partitions. These intervals for the attribute were either obtained from the knowledge of some expert or were generated from the method called Equal Frequency Binning.

Lastly, the pre-processing of the nominal values was performed by converting those values into binary equivalent for the attributes.

b) Association Rules Generation:

Minimum support (minsup) and minimum confidence (minconf) are the foremost requirements for the generation of association rules. Using the trial and error methods it was decided that the minimum support for the analysis of two datasets will be 5% and the minimum confidence will be 30%. This means that only those itemsets will be considered frequent which occurs at least 165 times in the accidents. The minimum support was set too high so that only non-trivial rules are generated. Also, only those rules will be considered reliable whose consequent occurs at least once out of the three times that the antecedent appears.

After applying the association algorithm on both the data sets, the results obtained for the high frequency accident locations included “187,829 frequent item sets of maximum size 4 for which 598.584 association rules could be generated”. The result obtained from analysis of second dataset for low frequency accident locations included “183.730 frequent item sets of maximum size 4 for which 575.974 association rules could be generated”.

c) Post-processing of Association Rules:

The last step in mining process was the post processing step in which the generated association rules in the previous step were further evaluated based on the interestingness criteria in order to narrow down the number of association rules initially generated. After carrying out the post process step using the lift value and eliminating the trivial rules, only 14,690 association rules were left for the high frequency accident locations and 77,282 association rules were left for the low frequency accident locations. But still the problem for finding the discriminating factor between the high and low frequency accident locations remained. So, a measure based on deviation of rules discovered for different accident locations was used to evaluate the interestingness of the rule. This measure is as follows:

“ I = Sh-Sl
___________

Max{Sl,Sh}” [pdf10]

Here I stand for interest. “The nominator Sh-Sl measures the difference in support for the rules in the high frequency accident data set (Sh) and the low frequency accident data set (Sl). The expression max {Sl, Sh} is called the normalizing factor as it normalizes the interestingness measure onto the scale [-1, 1]”.

Results:

After the analysis of the two datasets, the association rules obtained, which were common to both high frequency accident locations and low frequency accident locations, were further post-processed on the basis of interestingness measure which resulted in the total of 50 discriminating association rules. These results showed that the rule having high interestingness value may have low lift value and vice versa, thus lacking correlation. Exhibit 3 and Exhibit 4 explained that rules having high lift value and low interestingness value were common to both high and low frequency accident locations.

Exhibit 4 divided the association rules according to three categories which are: highest interestingness, low lift value and high life value. In the first section, the 10 rules based on highest interestingness values indicated that the geographical conditions was one of the most discriminating variable between high and low frequent accident locations. For example, the roadways outside the inner city having separated lanes are a characteristic of black spots and black zones. It was also interpreted from the result that the driver involved in accidents should be between 30 to 45 years of age. The second section includes 10 association rules based on high lift value. These rules indicated accident characteristics like number of people involved in accident and mostly apply to low frequency accident locations than that to high frequency. The 10 association rules in the third section based on small lift value explained that the accident patterns mainly correspond to human characteristics and discriminates a little between high and low frequency accident locations. From the observation of the support factor in all these parts of the table, it was clear that support remains stable in the case of low frequency accident locations but it was quite high for high frequency accident locations in first section and quite low for the second and third parts. Therefore, an increase in the interest value and discriminating characteristic of these rules are features contributing to the occurrence of accidents in high frequency accident locations. Apart from this, there were some association rules which mainly related to high frequency accident locations and do not apply to low frequency accident locations and so are called as discriminating characteristics between high and low frequency accident locations.

Conclusion:

The case study was concluded stating that the analysis of the datasets helped in finding out the combination of attributes which are together responsible for the occurrence of accidents in the high frequency accident locations. However, a lot more is still left to be discovered which include extent of casualties from these accidents. It also gave scope for future research on the discriminating factors between high and low frequency accident locations. As of now, the only suggestion was to appoint special traffic police in areas that are highly prone to accidents under special circumstances in which these accidents take place. These circumstances were described in the form of rules in tables.

Association Rule Discovery

Tuesday, September 30, 2008

CASE STUDY 2

No comments:

Followers

Blog Archive

About Me