## Monday, August 13, 2018

### [byhnlnng] Commonly occurring substrings in pi

We calculate the most frequently occurring N-digit sequences in the first 10^N digits of pi past the decimal point.

N=1: The most frequently occurring single digit in the first 10^1=10 digits of pi past the decimal point is 5.  It occurs 3 times: 1415926535.

N=2: The most frequently occurring 2-digit sequence in the first 10^2=100 digits of pi past the decimal point is 62.  It occurs 4 times: 1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679.

Below is a table extended up to 10-digit sequences in the first 10^10 digits of pi.  When there are ties, all sequences are listed.

(N, number of occurrences, [the N-digit sequences that occur most frequently])
(1,3,)
(2,4,)
(3,5,[019,105,171,446,609])
(4,6,[2019,2916,7669])
(5,8,[17663,91465])
(6,9,)
(7,9,[0123880,1296878,2009579,2399789,2572735,3707493,4515163,4900343,5707393,8730991,9219118,9435538])
(8,11,[12058610,77114204])
(9,12,)
(10,12,[2063323454,2612299524,2907884134,4880022211,5078602578,5421790974,7646964010,7711937174,8193684176,9522257435,9725290655])

We stopped at N=10 because, for a computation involving pi in base 10, stopping at 10^10 digits seemed appropriate.

It is curious that the middle column, number of occurrences, is non-decreasing.  We don't expect this trend to always be true.

Assuming the digits of pi are random, we expect a typical N-digit sequence to occur once in 10^N digits.  This isn't exactly correct because digit sequences overlap so are not independent.  I don't know the probability distribution of the number of occurrences of the most common sequence.

Here is Haskell code used in the computation.  We used strict state threads (Control.Monad.ST) to avoid lazy thunks from piling up, a frequent occurrence in long computationally intensive tasks which produce output only at the end.  We used `seq` to get rid of one giant space leak at the end when finding the largest item (and its index) of the array discussed below.

We used the unboxed `STUArray s Integer Word8` to compactly store the counts for each possible digit string.  We furthermore packed 2 entries into each Word8, one in the upper 4 bits and one in the lower, at the cost of limiting maximum counts to 15 (which turned out to be enough: the maximum seen was 12).  For N=10, we were therefore expecting 5 GB memory usage for the 10^10 digit run, but memory use was 9848268k maxresident (9.8 GB) according to /usr/bin/time, so there is still a space leak somewhere.  CPU time was 6000 seconds.