MNIST, 99% accuracy

1,398 parameters. Three-layer Sharpened Cosine Similarity with paired depthwise and pointwise operations. (code)

Fashion MNIST, 90% accuracy

7,659 parameters. Three-layer Sharpened Cosine Similarity with 12 5x5 kernels in each layer. (code)

CIFAR-10, 80% accuracy

47,643 parameters. Three-layer Sharpened Cosine Similarity with 30 5x5 kernels in each layer. (code)

CIFAR-10, 90% accuracy

103,000 parameters. ConvMixer-256/8 (Patches Are All You Need?), achieved 91.26%. (paper, code)
639,702 parameters. kEffNet-B0, an EfficientNet with paired pointwise convolutions, achieved 91.64%. (paper)
1.2M parameters. SCS-based network achieved 91.3%. (code)

ImageNet top-1, 80% accuracy

21.1M parameters. ConvMixer-768/32 (Patches Are All You Need?), achieved 80.16%. (paper, code)

Why parameter efficiency?

There are a lot of different dimensions to a model's performance and parameter efficiency is one that gets overlooked. If two models have similar accuracy, but one has fewer parameters it will probably be cheaper to store, run, distribute, and maintain. Some model families are inherently more parameter efficient than others, but those differences aren't showcased in accuracy leaderboards. This is a chance for parameter efficient architectures to get their time in the spotlight.

Isn't this just a cherry-picked metric that sharpened cosine similarity does well on?

Yes.