Reservoir Sampling, Or How To Sample Sets Of Unknown Size

Selecting a random sample from a set is simple. But what about selecting a fair random sample from a set of unknown or indeterminate size? That’s where reservoir sampling comes in, and [Sam Rose] has a beautifully-illustrated, interactive guide to how reservoir sampling works. As far as methods go, it’s as elegant as it is simple, and particularly suited to fairly sampling dynamic datasets like sipping from a firehose of log events.

While reservoir sampling is simple in principle it’s not entirely intuitive to everyone. That’s what makes [Sam]’s interactive essay so helpful; he first articulates the problem before presenting the solution in a way that makes it almost self-evident.

[Sam] uses an imaginary deck of cards to illustrate the problem. If one is being dealt cards one at a time from a deck of unknown size (there could be ten cards, or a million), how can one choose a single card in a way that gives each an equal chance of having been selected? Without collecting them all first?

In a nutshell, the solution is to make a decision every time a new card arrives: hold onto the current card, or replace it with the new one. Each new card is given a 1/n chance of becoming held, where n is the number of cards we’ve seen so far. That’s all it takes. No matter when the dealer stops dealing, each card that has been seen will have had an equal chance of ending up the one selected.

There are a few variations which [Sam] also covers, and practical ways of applying it to log collection, so check it out for yourself.

If [Sam]’s knack for illustrating concepts in an interactive way is your jam, we have one more to point out. Our own Al Williams wrote a piece on Turing machines; the original “universal machine” being a theoretical device with a read/write head and infinite paper tape. A wonderful companion to that article is [Sam]’s piece illustrating exactly how such a Turing machines would work in an interactive way.

Intuitive Explanation Of Arithmetic, Geometric & Harmonic Mean

The simple definition of a mean is that of a numeric quantity which represents the center of a collection of numbers. Here the trick lies in defining the exact type of numeric collection, as beyond the arithmetic mean (AM for short, the sum of all values divided by their number) there are many more, with the other two classical Pythagorean means being the geometric mean (GM) and harmonic mean (HM).

The question that many start off with, is what the GM and AM are and why you’d want to use them, which is why [W.D.] wrote a blog post on that topic that they figure should be somewhat intuitive relative to digging through search results, or consulting the Wikipedia entries.

Compared to the AM, the GM uses the product of the values rather than the sum, which makes it a good fit for e.g. changes in a percentage data set. One thing that [W.D] argues for is to use logarithms to grasp the GM, as this makes it more obvious and closer to taking the AM. Finally, the HM is useful for something like the average speed across multiple trips, and is perhaps the easiest to grasp.

Ultimately, the Pythagorean means and their non-Pythagorean brethren are useful for things like data analysis and statistics, where using the right mean can reveal interesting data, much like how other types using something like the median can make a lot more sense. The latter obviously mostly in the hazy field of statistics.

No matter what approach works for you to make these concepts ‘click’, they’re all very useful things to comprehend, as much of every day life revolves around them, including concepts like ‘mean time to failure’ for parts.


Top image: Cycles of sunspots for the last 400 years as an example data set to apply statistical interpretations to. (Credit: Robert A. Rohde, CC BY-SA 3.0)

A Second OctoPrint Plugin Has Been Falsifying Stats

The ongoing story of bogus analytical data being submitted to the public OctoPrint usage statistics has taken a surprising turn with the news that a second plugin was being artificially pushed up the charts. At least this time, the developer of the plugin has admitted to doing the deed personally.

Just to recap, last week OctoPrint creator [Gina Häußge] found that somebody had been generating fictitious OctoPrint usage stats since 2022 in an effort to make the OctoEverywhere plugin appear to be more popular than it actually was. It was a clever attempt, and if it wasn’t for the fact that the fake data was reporting itself to be from a significantly out of date build of OctoPrint, there’s no telling how long it would have continued. When the developers of the plugin were confronted, they claimed it was an overzealous user operating under their own initiative, and denied any knowledge that the stats were being manipulated in their favor.