Multivariate Discretization of Continuous Variables for Set MiningStephen D. BayDepartment of Information and Computer Science University of California, Irvine Irvine, CA 92697, USA sbay@ics.uci.edu |
|
Many algorithms in data mining can be formulated as a set mining problem where the goal is to find conjunctions (or disjunctions) of terms that meet user specified constraints. Set mining techniques have been largely designed for categorical or discrete data where variables can only take on a fixed number of values. However, many data sets also contain continuous variables and a common method of dealing with these is to discretize them by breaking them into ranges. Most discretization methods are univariate and consider only a single feature at a time (sometimes in conjunction with the class variable). We argue that this is a sub-optimal approach for knowledge discovery as univariate discretization can destroy hidden patterns in data. Discretization should consider the effects on all variables in the analysis and that two regions X and Y should only be in the same cell after discretization if the instances in those regions have similar multivariate distributions ($F_x \sim F_y$) across all variables and combinations of variables. We present a bottom up merging algorithm to discretize continuous variables based on this rule. Our experiments indicate that the approach is feasible, that it does not destroy hidden patterns and that it generates meaningful intervals. Comments: The journal version of this paper has a much more detailed description of my approach for multivariate discretization and it includes many graphs to help visualize the results. |
|
Postscript. PDF.   Home |