The Pareto Principle in Datamining : an Above-Average Fencing Algorithm

This paper formulates a new datamining problem: which subset of input space has the relatively highest output where the minimal size of this subset is given. This can be useful where usual datamining methods fail because of error distribution asymmetry. The paper provides a novel algorithm for this datamining problem, and compares it with clustering of above-average individuals.


Introduction
In some cases, usual methods of supervised learning are not able to provide satisfactory results.This may occur in data with asymmetric distributed error, which is typical in insurance.In order to manage the asymmetry in the sense of the law of large numbers, this paper offers a new algorithm, which constructs a predictor not for points, but for sets.We will show an algorithm for finding sets of units with above--average outputs.
Let X be a set, and let m, n be measures over it.The Pareto principle arises if there is a set P X Ì where p P X P P X X ( , ) ( and r P P X ( ) ( ) ( ) º > > m m 0. Let m stand for volume, n for production, p for productivity, r for proportion.Typically, the Pareto principle is considered as a rule that 20 % of elements "produces" 80 % or more of the output.(This principle was discovered by Vilfredo Pareto while assessing the welfare distribution in the UK at the end of the 19th century.His ideas were systematically described, applied and extended by Max Lorenz [7].)In this case, r P ( ) .= 02 and p P ( ) ³ 4. Managerial science often works with the Pareto diagram [1].
, , , K .The elements are ordered by their production n( ) x i , and the production is drawn in a chart.In stastitics,the Pareto principle is represented by the continuous Pareto distribution: for all x x m ³ , where x m is the (necessarily positive) minimum possible value of X, and k is a positive parameter.Pareto distribution has positive skewness 2 1 3 which means that the below-average subset is bigger than the above-average complement.This occurs in many real life situations: the median citizen has a below-average salary, the median driver causes below-average claims, etc.
Let x : X n ® R be attributes of X. (Attributes can be considered as columns in a data table, i.e. the n-tuple from the i-th row.Mapping x can also involve preprocessing.If X is R k for some k Î N, the x mapping may also be an identity.)The problem of prediction consists in constructing the mapping $ y so that $ ( ) However, the construction of mapping $ y may be difficult if the Pareto principle arises.A small subset of high productivity (called outliers) corrupts usual assumptions.Usual datamining techniques propose removing the set and working only with the rest.However, in case of the Pareto principle, the small set is very interesting.It is not adequate to speak about outliers, because such data is relevant and obvious.Therefore, we are dealing with more humble result, i.e. with finding the set P defined in (1).
The formulated problem, i.e. finding arg max ( , ) is new, and has not been found in the current literature.However, many other topics are related to it.First, clustering methods [12] can be employed.Creating clusters of above-average individuals, the set P will be defined as these clusters.It is important to define the border of the clusters somehow.If these clusters are well found, they can be employed for a more precise approximation [6].Another approach is to attempt to find a prediction mapping $ y where the set P is afterwards defined at some level of this mapping.RBF neural networks [5] provide an example where the approach of rough sets is employed.Finally, effective P can also be detected also Data Envelopment Analysis [2].However, none of these methodsas the research in the bibliography shows -has been applied explicitly to the problem of above-average subsets.

The Fencing algorithm
The following algorithm is the first attempt to solve this problem (3).It offers the construction of $ P X Ì with above--average production, i.e. with high p.The space R n will be considered as X.
Other ways to construct such intervals may be considered.
The following algorithm uses fencing.Fencing is a heuristic approachwhich anticipates that areas of higher average production are located between mutually close points with high production.The algorithm attempts to build a rectangular fence around the area of above-average production, as shown in Fig. 1.

Measuring and data preprocessing
x mapping is necessary.This mapping involves measuring and data preprocessing.The simplest way is to transform binary attributes into real attributes by 0-1 coding.Categorical attributes are transformed into more binary binary attributes.It is very useful to reduce the input vector dimension, e.g. by Principal Components Analysis [11].Let us define " Î .The vector x max is used for complementary coding x.

Data splitting
The data set D is divided randomly into three subsets: base subset B, training subset T and validation subset V. Sets B and T are used for constructing the predictor y, whereby their size will be represented by B T << , say10 × = B T .The size of V is chosen with respect to cross validation [8].

Starting set of intervals
The starting set of intervals is defined as follows , cc is the complement coding cc :R R

Interval expansion
Two intervals r 1 , r 2 can be expanded as follows r r r where d is a metrics.The first version of the algorithm worked with Hamming distance [6], but other metrics can be also applied.In each step, a pair of intervals is tested for expansion.The pair is selected partly randomly as follows: the pair with highest suitability (probability 0.8), the pair with lowest suitability (0.1), or a random pair (0.1).If a pair is expanded, its suitability is recalculated for all other intervals.
Fig. 1: Fencing: Expanding the square, a new fence (dashed) is recommanded to the rectangle that is close and has high productivity r

Termination
The algorithm terminates after all pairs of intervals have been tested and none can be expanded.Afterwards, unexpanded intervals (i.e.points) are deleted.Because intervals may overlap, their conjunction may have lower p than average p of all intervals.Therefore, only intervals with highest p are considered as results so their conjunction has p high enough (e.g. higher than a given threshold).

Validation
Finally, the results are validated with respect to the vali- , and p P V V ( , ) U are calculated.Such values can be considered as the quality of the algorithm.

Results
The Fencing algorithm has been applied successfully on data on 18 177 insurance claims related to traffic accidents in the Czech Republic in 2003-2005.Categorical and numerical attributes were transformed into binary attributes.There was a total of 135 binary attributes.The considered attributes and their transformation is summarized in Table 1.The Fencing algorithm has been implemented in Matlab as a set of simple scripts.It should be mentioned that this particular data set inspired the author to invent of the Fencing Algorithm, after attempts to build some regression model failed.Generalized Linear Models [4], which are typical in insurance mathematics, and multilayer perceptrons [8] did not provide sufficient results, as shown in Table 2.The data set was split into 10 subsets and one subset was always tested.The logarithm of total costs was taken as an output variable.However, the mean absolute error remains very high (the prediction and reality differ over twentyfold on average!).
First experiments showed that the 135 dimensional space is too sparse and than there are many futher unexpandable intervals.Therefore, the dimension was reduced by selecting 36 attributes describing the region of the claimant, road type, and cause of the accident.After next unsuccessful experiments with p 1 4 = and p 1 2 = , it was neccessary to set p 1 15 = . .Then 9 intervals were found.However, the conjunction of them had p =119 .only.

Comparison
The problem (3) formulated here is novel and the Fencing algorithm is the only solution so far.However, for a simple comparison a clustering based method was involved that can be described briefly as follows:

Building above-average clusters from training data:
Best 20 % records were extracted and clustered via the k-means algorithm.For each cluster, the diameter was calculated as the maximum of distance between the center and the record belonging to it.

Finding above-average records in the testing data:
For each record, we test whether there is a cluster whose center is closer to the record than the c multiplied diameter of the cluster.Parameter c is set up so that the level of r is satisfied.So P is defined and r ensured.
3. Calculation of p from provided data, p is calculated.
Table 3 shows the results achieved by this method, and compares them with the Fencing Algorithm: the alternative method based on known algorihm provides less narrow results.However, the goal of this paper was not test proposed Fencing Algorithm, but to show that this algorithm is able to solve problem formulated above (3).More experiments with the k-means based approach might provide better results.

Discussion and further work
The Fencing algorithm can be modified so the suitability v a,b is calculated in another way.There should be an increase in p in both intervals and a decrease in distance between r(a) and r(b).The randomized selection rule can also be modified.
If a pair of intervals is tested, the whole training set T is gone through.This is probably the Achiles tendon, because the size of T is usually very large.Therefore more detailed examination complexity and the design of more suitable data structures are desirable.
The basic idea of constructing an above-average subset can be evolved in many ways.The subset need not be a union of intervals, but they may be simplexes.The set must not be narrow, it may be fuzzy.Or the subset can be given in an algebraic form and detected by genetic programming or other optimization methods, such as Ant Colony Optimization [10].
The Fencing algorithm will be compared with these other approaches in terms of complexity and effectiveness on more data sets.Systematic examination of relevant preprocessing methods is also desirable.Finally, the algorithm could be modified not for data, but for an estimated probability function, e.g. in form of copulas [9] which are more appropriate for assymetric distributions.

Conclusion
The Fencing algorithm is a novel heuristic method for finding a subset of with above-average production.The main idea of the algorithm is to join intervals with high production and small mutual distance.The Fencing Algorithm has been successfully applied to insurance data.Further work has been discussed above.

® 2
and k is a parameter.The definition of BO ensures that p B B k O ( , ) > .
[3] set P is represented by union intervals (from points to hyperboxes) represented by means of complement coding.(Complementcoding is a concept applied in ART and ARTMAP neural networks, e.g.[3].However, the objective of the Fencing algorithm is different.While ART and ARTMAP work iteratively, the Fencing algorithm must often x ( ) ( )

Table 1 :
Transformation of observed values into input and output variables

Table 2 :
Mean absolute error of machine learning for different validation sets

Table 3 :
Comparison of the Fencing algorithm with the k-means based approach