The thing that makes k-nearest neighbors challenging to use with large data sets is the time it takes to calculate the distance between each new data point and all the previously observed data points. One strategy for dealing with this is to be selective about which previously seen data points to keep. A straightforward approach is to select a random subset.

This performs surprisingly well, but it’s tempting to try and improve upon it. Efficient data reduction for KNN is an open research problem. Here’s one method that performs better than random subsetting. It's shown implemented in an online fashion, where each data point is processed at one at a time.

## Early stopping

The simplest approach is to simply stop collecting points after a certain point, `n_examples`. For all points with index `i_example < n_examples`, add it to the collection or retained points, `keepers`. After that, for `i_example >= n_examples`, discard all new data points.

## Early stopping with replacement

Early stopping is a surprisingly effective way to get an efficient k-nearest neighbors model, as long as the first `n_examples` are representative of the entire data set. This is likely to happen when the data set has been shuffled beforehand, but otherwise it’s not at all guaranteed. To account for this, the early stopping method can be augmented with a replacement step.

It starts out the same way that early stopping does, keeping the first `n_examples`. After that, each new example (with index `i_example`) is added to the set of `keepers` with probability `n_examples / i_example`. When a new data point is added, it replaces a randomly selected data point from the set of `keepers`. Using this scheme each data point has roughly the same probability of being included in the set of `keepers`, regardless of its position in the original data set. This gets us an incremental version of random subsampling.

## Keep misclassified or wrongly predicted points

Random subsampling performs surprisingly well, but it’s tempting to try to improve on it. How to best do this is still an open research question, but one method that has shown some promise is to add mispredicted points to the set of `keepers`.

In the case of classification, all misclassified examples are added to the set of `keepers`.

In the case of regression, examples whose value is outside the range of all of its nearest neighbors (higher than the highest or lower than the lowest) are added to the set of `keepers`.

Whenever a point is selected for inclusion, it replaces a randomly selected point from the set of its k nearest neighbors.

This method of data reduction focuses on the difficult to predict examples in a data set. It overrepresents outliers and anomalies. The flipside of this is that it underrepresents easy to predict examples. This is exactly what we want. It focuses the resources of our k-nearest neighbors model in the trickiest regions of the feature space.

The parts of the feature space with many similar examples get thinned out a little bit. Having a large cluster of points with similar labels adds a computational burden to the model without adding much information. Because k-nearest neighbors is local, including anomalies doesn’t hurt its performance, even if they are errors.

My limited experimentation with this method shows that it improves on random subselection a little bit, but not much. That result may be different with each data set.

For a detailed Python implementation and case study on a data set of diamond prices, check out this repository. To my knowledge this is new, but I haven’t done a serious literature search. If you want to cite it you can reference this post. I don’t have anything more formal published on it just yet.