MFS 1.0 ------- MFS is a program that implements a multiple nearest neighbor classification system. It creates an ensemble of classifiers by selecting the features used for the component NN classifiers at random. The predictions from the individual NN classifiers are combined with simple voting. The features used for the component NN classifiers are determined by randomly selecting X features from the data set. The value of X is determined by using cross-validation on the training set (i.e. try ten values from 0 to |F|). The feature sampling function is done either with or without replacement. Since the individual NN classifiers require no training, the features used are picked randomly at runtime. DISCLAIMER ---------- MFS Copyright 1997, 1998, 1999, 2000. Stephen D. Bay sbay@ics.uci.edu THIS SOFTWARE IS PROVIDED "AS IS" WITH NO WARRANTIES OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO IMPLIED WARRANTIES AS TO THE PERFORMANCE, FITNESS, AND MECHANTABILITY OF THE SOFTWARE FOR A PARTICULAR PURPOSE. THE ENTIRE RISK OF USING THE SOFTWARE IS WITH THE USER. THE SOFTWARE IS PROVIDED WITHOUT ANY SUPPORT OR OBLIGATION TO ASSIST WITH ITS USE. THIS SOFTWARE MAY BE USED FOR EDUCATION OR NON-COMMERCIAL RESEARCH PURPOSES ONLY. USAGE ----- Usage: mfs train_file test_file class_feature [options] There are three required parameters: train_file - contains a training set of data test_file - contains a testing set of data class_feature - the column number of the class feature (starting count at 0) MFS runs by default in ensemble mode with 100 classifiers and cross-validation selection of the subset size (number of features selected). However it can also run in the following modes: mode: -knn k nearest neighbour -fss 1nn with forward subset selection -bss 1nn with backward subset selection -mfs (default) multiple feature subsets The evaluation function for forward and backword feature selection is estimated accuracy with leave one out cross-validation. Input The training and test data files should be formated as follows: * one example per line * comma separated values * values can be continuous or categorical Note: continuous values should be normalized before use. Output MFS produces a file mfs.out which contains the class predictions for the test set of data. Parameters There are a number of optional parameters that can be used with MFS. Note that the parameters used depend on the mode. For example, -k and -kcv are only used when MFS is running as the k-nn classifier (-knn). optional parameters: -c number of classifiers (default 100) -k number of neighbours (default 1) -kcv determine k by cross-validation (-knn only) -seed random number seed (default based on time) -ss subset size (force a specific size) -sscv determine subset size through cv (default for mfs mode) -sscv_spread (10) set the spread for the '-sscv' option -swor subsets without replacement (default with replacement) -tprint print intermediate results -p print intermediate results to get help: mfs -h (-help or --help) Examples: mfs train.set test.set 9 mfs train.set test.set 9 -knn -k 1 (kNN mode with k = 1) mfs train.set test.set 9 -knn -k 5 (kNN mode with k = 5) mfs train.set test.set 9 -knn -kcv (kNN mode with k set by cross-validation) mfs train.set test.set 9 -fss (1NN with forward subset selection of features) mfs train.set test.set 9 -bss (1NN with backword subset selection of features) mfs train.set test.set 9 -ss 4 (mfs with each component NN classifier using 4 features) SCRIPTS ------- The /bin directory contains a number of scripts to make using and evaluating MFS easier. These scripts need perl5 to run. cveval - estimate performance using cross-validation cvsplit - partition a data set for cross-validation errorstat - summarize errors from cv folds scale - scales data set from (approximately) [0,1] errors - calculates errors from mfs predictions cex - extract a column from a data set (this is a helper function) max - find the maximum values in the data set min - find the minimum values in the data set trash - overwrite columns in a data set (to hide class values) To estimate the performance of MFS type: cveval data_set class_feature where data_set is the name of the data file and class_feature is the column number of the class variable (the first column starting with 0). CVEVAL automatically scales continuous variables from (approximately) [0,1]. If you would like to evaluate the other classifiers such as 1NN, kNN, FSS, or BSS simply modify the cveval script. Example: cveval glass.data 9 DISTANCE METRIC --------------- MFS uses unweighted euclidean distance for continuous features and hamming distance for symbolic features. Missing values are treated as informative and are considered to be a specific symbolic value. In the case of continuous features, a missing value is considered to have a distance of 1 to all non-missing values. Note: MFS implicitly assumes that continuous features have been scaled to (approximately) [0,1]. The scale program will do this. Not scaling the data typically results in much poorer performance. FUTURE WORK ----------- Faster Parameter Selection Currently MFS uses leave one out cross-validation to select the best parameter values. This is overkill and even using 10 fold cross-validation can speed things up considerably. Other methods that would be faster include racing and just using a tuning set (instead of full cv). Spatial Indexing Trees MFS uses a subset of features which is often much smaller than the total number of features. In experiments on databases in the UCI repository (Bay, 1999) many times MFS selected less than 10 features for classification. This suggests that spatial indexing trees which can work very efficiently for low dimensional spaces (but break down as the dimensionality increases) may be ideal here. Online Learning and Adaptation One advantage of MFS is that it preserves many of the lazy properties of the NN classifier in the ensemble. So for example, it would be trivial to modify MFS to handle incremental additions or deletions of training data at runtime. Additionally, because the NN classifier requires no learning time, we can use as many classifiers as we need at runtime, thus we can set the number of classifiers used adaptively according to the difficulty of the test pattern. REFERENCES ---------- Bay, S. D. (1999). Nearest Neighbor Classification from Multiple Feature Subsets. Intelligent Data Analysis. 3(3):191-209. Bay, S. D. (1998). Combining Nearest Neighbor Classifiers Through Multiple Feature Subsets. In Proceedings of the Fifteenth International Conference on Machine Learning. Madison, WI.