The thing that makes k-nearest neighbors challenging to use with large data sets is the time it takes to calculate the distance between each new data point and all the previously observed data points. One strategy for dealing with this is to be selective about which previously seen data points to keep. A straightforward approach is to select a random subset.
This performs surprisingly well, but it’s tempting to try and improve upon it. Efficient data reduction for KNN is an open research problem. Here’s one method that performs better than random subsetting. It's shown implemented in an online fashion, where each data point is processed at one at a time.
Early stopping
The simplest approach is to simply stop collecting points after
a certain point, n_examples
. For all points
with index i_example < n_examples
, add it to the
collection or retained points, keepers
. After that,
for i_example >= n_examples
, discard all new data
points.
Early stopping with replacement
Early stopping is a surprisingly effective way to get
an efficient k-nearest neighbors model, as long as the first
n_examples
are representative of the entire data set.
This is likely to happen when the data set has been shuffled
beforehand, but otherwise it’s not at all guaranteed.
To account for this, the early stopping method can be augmented
with a replacement step.
It starts out the same way that
early stopping does, keeping the first n_examples
.
After that, each new example (with index i_example
)
is added to the set of keepers
with probability
n_examples / i_example
. When a new data point
is added, it replaces a randomly selected data point from the
set of keepers
. Using this scheme each data point
has roughly the same probability of being included in the set
of keepers
, regardless of its position in the
original data set. This gets us an incremental version of
random subsampling.
Keep misclassified or wrongly predicted points
Random subsampling performs surprisingly well, but it’s tempting
to try to improve on it. How to best do this is still an open
research question, but one method that has shown some promise
is to add mispredicted points to the set of keepers
.
In the case of classification, all misclassified
examples are added to the set of keepers
.
In the case of regression,
examples whose value is outside the range of all of its nearest
neighbors (higher than the highest or lower than the lowest)
are added to the set of keepers
.
Whenever a point is selected for inclusion, it replaces a randomly selected point from the set of its k nearest neighbors.
This method of data reduction focuses on the difficult to predict examples in a data set. It overrepresents outliers and anomalies. The flipside of this is that it underrepresents easy to predict examples. This is exactly what we want. It focuses the resources of our k-nearest neighbors model in the trickiest regions of the feature space.
The parts of the feature space with many similar examples get thinned out a little bit. Having a large cluster of points with similar labels adds a computational burden to the model without adding much information. Because k-nearest neighbors is local, including anomalies doesn’t hurt its performance, even if they are errors.
My limited experimentation with this method shows that it improves on random subselection a little bit, but not much. That result may be different with each data set.
For a detailed Python implementation and case study on a data set of diamond prices, check out this repository. To my knowledge this is new, but I haven’t done a serious literature search. If you want to cite it you can reference this post. I don’t have anything more formal published on it just yet.