Saturday, February 22, 2014

[rzzllapd] Nonrandomness of the first random sample

The first random sample after a call to mkStdGen seed is extremely poorly distributed:

module Main where {
import System.Random (StdGen, mkStdGen, randomR);
import Data.Array.IArray (Array, accumArray, assocs);
rollDice :: StdGen -> (Int, StdGen);
rollDice = randomR (1,6);

get_first :: Int -> Int;
get_first seed = fst $ rollDice $ mkStdGen seed;

histogram :: [Int] -> Array Int Int;
histogram l = accumArray (+) 0 (1,6) $ do { x <- l; return (x,1) };
main :: IO ();
main = mapM_ print $ assocs $ histogram $ map get_first [0..53667];
}

Output, with GHC 7.4.2:
(1,0)
(2,0)
(3,0)
(4,0)
(5,0)
(6,53668)

All seeds from 0 to 53667 roll an initial 6. The second sample seems not bad (maybe even too good!), suggesting a fix of discarding the first sample:
better_mkStdGen seed = snd $ rollDice $ mkStdGen seed;

(1,8947)
(2,8942)
(3,8943)
(4,8948)
(5,8944)
(6,8944)

Here is a comparable C++ program which demonstrates that glibc does a good job for the first sample, which is not too surprising given the extensive set up in the internal procedure __srandom_r().

#include <cstdlib>
#include <cstdio>
int rollDice(){
  return rand() / (RAND_MAX + 1.0) * 6 + 1;
}
int get_first(unsigned int seed){
  srand(seed);
  return rollDice();
}
int main(){
  //We only need 1 through 6, quick hack around 0-indexing.
  int histogram[7] = {0};
  for (unsigned int seed=0; seed<=53667; ++seed) {
    histogram[get_first(seed)]++;
  }
  for (int i=1;i<=6;++i){
    printf("%d %d\n", i, histogram[i]);
  }
}

Output with GCC 4.7.2:
1 8984
2 9112
3 8929
4 8885
5 8788
6 8970

My previous commentary on GHC Random.

No comments :