k-nearest neighbors is remarkably powerful right out of the box with the exception that it can be quite sensitive to how features are weighted. The first line of defense is to normalize all of the features to have a variance of one. This helps to ensure that a unit change in each feature is comparable to a unit change in any other. It's a desirable characteristic to have when computing a distance metric in a space composed of all these features. However, this doesn’t account for the fact that some features can be much more important than others.
Not everything about a cat distinguishes it from a dog
Take for example a k-NN classifier that categorized animals as cats or dogs, based on a collection of characteristics. Color is not likely to be a strong way to distinguish. Presence of fur gives no help at all. But retractible claws would be highly discriminative. It's a far more important feature than the other two. The ability to stare you in the face while disobeying you may be a strong feature too, but that is still under active investigation.
A feature can be assigned additional weight by scaling it up, that is, by multiplying all instances of it by a constant. The result of this is that small differences in that feature value will be exaggerated. Nearest neighbors will then be much more likely to be similar in that feature than in the others.
On the other hand, features can be made less important by scaling them down, by dividing by a constant. The result of this would be that large changes in that feature's value would only have a modest impact in distance. It might tip the balance between two otherwise identical data points, but all by itself it’s not going to determine which data points fall within the nearest neighborhood of another.
The problem of assigning feature weights is that of choosing one scalar multiplier per feature. It is an N-dimensional optimization problem, where N is the number of features. This family of problems has a broad toolbox of solutions. Here I describe one that is simplistic, but has proven to be robust and well-suited to the quirks of the k-nearest neighbor feature weighting problem.
- Start with features that have been normalized to have a unit variance.
- Initialize weights for all features to be one.
- Iteratively try out new sets of feature weights.
- If the new set gives a better performance, keep it. Otherwise reject it and try another.
- taking the best set so far,
- randomly choosing one of them, and
- deciding by a virtual coin flip to either multiply or divide it by a factor, weight_scale_step.
In practice, although the feature weights can have a powerful influence on the performance of k-nearest neighbors, they tend to be rather insensitive to small changes. This
weight_scale_step factor for multiplying or dividing a feature weight at each iteration to get a new candidate can be large. I’ve had good success with
weight_scale_step = 2. This is a hyperparameter for the method, and one that can be experimented with.
The method doesn’t have any intrinsic termination condition, but like most optimization processes, can be terminated either when it ceases to improve the results after some number of iterations, or when you run out of patience.
This method is well-suited to an incremental or online version of k-nearest neighbors. The feature weighting method has to be modified, since there is not a testing data set to work with in that case. Instead, the performance of the candidate feature weights can be determined by comparing the results of a leave-one-out cross validation (LOO-CV). In LOO-CV the entire set of previously seen data points serves as the training data set, except for one removed at random. The predicted value of that point is compared to its actual value, and the score obtained. This is repeated for every data point in the set. The aggregate score represents the ovearll LOO-CV performance. If the new feature weights give a better LOO-CV score, then they are retained, and if not then the previous set of weights is kept.
This method introduces two new parameters. The first is a waiting period of how many data points to collect before first trying to adjust feature weights,
weight_adjustment_delay. Starting with too few data points could lead to some degenerate optimization runs and poor feature weights. The other parameter is how often to try a new set of feature weights,
weight_adjustment_interval. In practice, on a data set with over 40,000 points, I have found that a
weight_adjustment_delay = 1000 and a
weight_adjustment_interval = 10 works well, but my experimentation suggests that performance is not very sensitive to either of those numbers.
It is possible that this method has been described already. I haven’t performed A literature search to check. My purpose in documenting it here is to have a record of what I did and why, and to share the method in case it’s useful to anyone else. If you’re aware of similar or identical work, please let me know and I will happily add the citation here.
This method has some similarities to Powell's method which performs a form of gradient descent in one dimension at a time. There are some differences. Here the step size is fixed, and it is geometric. Also the dimensions are not handled in a cyclical fixed order, but at random. However, this doesn’t escape the fact that the method is subject to local minima just like Powell's method is. It’s possible that the optimum combination of feature weights lies in a global minimum well that can’t be reached by gradient descent. My conjecture is that for all but the most pathological data sets, the optimization landscape for this problem is broad and shallow and that local optimization is likely to find the global optimum. However I have nothing to back this up other than intuition. It remains the subject of future research.
The strength of this method is that it requires very little twiddling . While there are hyperparameters to choose, the
weight_scale_step, the termination criterion, and for the online implementation,
weight_adjustment_interval, experience suggests that the results are insensitive to the exact values of these and that reasonable defaults provide results every bit as good as carefully chosen values. The ability to get high-quality answers without extensive fiddling makes this method a rarity.