-
Enhancement
-
Resolution: Unresolved
-
P4
-
None
-
None
Add APIs for selecting a specific number of random samples, without replacement, from an input consisting of a known or unknown number of elements.
If the input size is known, one can use Algorithm S from Knuth, TAOCP Vol 2, sec 3.4.2, Random Sampling and Shuffling. See also https://stackoverflow.com/a/28655112/1441122 . This could be implemented with a method in the Collections class that takes a Collection, the sample count, and optionally a source of randomness.
If the input size is not known, one can use a "reservoir sampling" technique. See https://en.wikipedia.org/wiki/Reservoir_sampling . A reservoir sampling method could be implemented as a Gatherer for a Stream. It would also take a sample count and optionally a source of randomness.
It might be possible for the Gatherer-based API to be adaptive, depending upon whether the stream is SIZED. However, it doesn't appear that the Gatherer has access to the stream's size even if the stream is SIZED.
An edge case is when the requested sample size is smaller than the actual population. Options are to return a short Stream or List, or to throw an exception.
If the input size is known, one can use Algorithm S from Knuth, TAOCP Vol 2, sec 3.4.2, Random Sampling and Shuffling. See also https://stackoverflow.com/a/28655112/1441122 . This could be implemented with a method in the Collections class that takes a Collection, the sample count, and optionally a source of randomness.
If the input size is not known, one can use a "reservoir sampling" technique. See https://en.wikipedia.org/wiki/Reservoir_sampling . A reservoir sampling method could be implemented as a Gatherer for a Stream. It would also take a sample count and optionally a source of randomness.
It might be possible for the Gatherer-based API to be adaptive, depending upon whether the stream is SIZED. However, it doesn't appear that the Gatherer has access to the stream's size even if the stream is SIZED.
An edge case is when the requested sample size is smaller than the actual population. Options are to return a short Stream or List, or to throw an exception.