The thing that makes *k-*nearest neighbors challenging to use
with large data sets is the time it takes to calculate the distance
between each new data point and all the previously observed
data points. One strategy for dealing with this is to be selective
about which previously seen data points to keep.
A straightforward approach is to select a random subset.

This performs surprisingly well, but it’s tempting to try and improve upon it. Efficient data reduction for KNN is an open research problem. Here’s one method that performs better than random subsetting. It's shown implemented in an online fashion, where each data point is processed at one at a time.

## Early stopping

The simplest approach is to simply stop collecting points after
a certain point, `n_examples`

. For all points
with index `i_example < n_examples`

, add it to the
collection or retained points, `keepers`

. After that,
for `i_example >= n_examples`

, discard all new data
points.

## Early stopping with replacement

Early stopping is a surprisingly effective way to get
an efficient *k-*nearest neighbors model, as long as the first
`n_examples`

are representative of the entire data set.
This is likely to happen when the data set has been shuffled
beforehand, but otherwise it’s not at all guaranteed.
To account for this, the early stopping method can be augmented
with a replacement step.

It starts out the same way that
early stopping does, keeping the first `n_examples`

.
After that, each new example (with index `i_example`

)
is added to the set of `keepers`

with probability
`n_examples / i_example`

. When a new data point
is added, it replaces a randomly selected data point from the
set of `keepers`

. Using this scheme each data point
has roughly the same probability of being included in the set
of `keepers`

, regardless of its position in the
original data set. This gets us an incremental version of
random subsampling.

## Keep misclassified or wrongly predicted points

Random subsampling performs surprisingly well, but it’s tempting
to try to improve on it. How to best do this is still an open
research question, but one method that has shown some promise
is to add mispredicted points to the set of `keepers`

.

In the case of classification, all misclassified
examples are added to the set of `keepers`

.

In the case of regression,
examples whose value is outside the range of all of its nearest
neighbors (higher than the highest or lower than the lowest)
are added to the set of `keepers`

.

Whenever a point is selected for inclusion, it replaces a randomly
selected point from the set of its *k* nearest neighbors.

This method of data reduction focuses on the difficult to predict
examples in a data set. It overrepresents outliers and anomalies.
The flipside of this is that it underrepresents easy to predict
examples. This is exactly what we want. It focuses the resources
of our *k-*nearest neighbors model in the trickiest regions of
the feature space.

The parts of the feature space with many similar examples get
thinned out a little bit. Having a large cluster of points with
similar labels adds a computational burden to the model
without adding much information. Because *k-*nearest neighbors
is local, including anomalies doesn’t hurt its performance,
even if they are errors.

My limited experimentation with this method shows that it improves on random subselection a little bit, but not much. That result may be different with each data set.

For a detailed Python implementation and case study on a data set of diamond prices, check out this repository. To my knowledge this is new, but I haven’t done a serious literature search. If you want to cite it you can reference this post. I don’t have anything more formal published on it just yet.