Saturday, January 15, 2022

[vgkevtpx] sampling large binomial with random-fu

here are some results of single samples from the binomial distribution using Data.Random.Distribution.Binomial in the random-fu Haskell package, version 0.2.7.0 .

N=320000000000000000000000000000000 p=0.5 sample=160000000000000017763568049748834

N=330000000000000000000000000000000 p=0.5 sample=165000000000000000000000000000000

N=320000000000000000000000000000000 p=6.25e-2 sample=20000000000000004440892003541839

N=330000000000000000000000000000000 p=6.25e-2 sample=20625000000000000000000000000000

Haskell source code.

in the first line, our one sample from the binomial distribution is equivalent to simulating flipping a fair coin 3.2*10^32 times and counting the number of heads.  (however, all these results are calculated instantly.)  things seem to work correctly: about half heads, with statistical noise of order sqrt(N) as would be expected from a random sample.

in the second line, we flip 3.3*10^32 times.  things have gone wrong: the number of heads is precisely half with no statistical noise.

the following two examples use an unfair coin whose probability of success is 1/16.  (we use probabilities expressible exactly in binary to eliminate decimal-to-binary conversion as a possible source of noise).  things go wrong similarly: the threshold does not depend on the probability.

the threshold seems to be around N=2^108, probably related to 53 bits of mantissa in double precision (108 / 2 = 54).  (incidentally, 2^108 ~= 3.2*10^32.  the digits of the decimal mantissa coincide with the exponent.  previously similar.)

this issue is mentioned in comments in the source code of integralBinomial random-fu, but not in the Haddock documentation.  (that Hackage makes it easy to browse source code makes it much better than similar library documentation for other programming languages.  one wonders if the Log4j/Log4shell vulnerability would have been discovered sooner if Java documentation had made it easier to view implementation source code.)

on one hand, of course it's unfortunate that things go wrong for large N.  on the other hand, the failure mode is not too bad.  (there might not be much incentive to fix the bug.)  if you're going to sample the binomial, getting the mode (the most likely result) as an answer might be fine for many applications.

inspired by wanting to draw lots and lots of stars.  if one always gets the mode, there will not be pixels containing a statistically unusual large number of stars.

sampling the binomial distribution efficiently for large N is a non-trivial but seemingly well studied topic.  the comment in random-fu's source code cites Knuth's TAOCP.  wikipedia cites:

Devroye, Luc (1986) Non-Uniform Random Variate Generation, New York: Springer-Verlag.

and links to an online chapter of the book.

No comments :