Friday, November 15, 2019

[qqpkuwpt] Approximately sampling Zipf

Here is Pari/GP code that generates 1000 samples from a distribution that approximates the (exponent=1) Zipf distribution over 100 items. The approximation is, instead of Harmonic sums, we substitute log and exp.

for(i=1, 1000, print(floor(exp(log(101) * random(1.0)))))

Here is sample output:

19 59 20 5 88 11 1 36 19 20 19 8 10 34 4 3 2 32 4 35 4 12 6 1 2 3 33 17 31 1 2 16 16 48 58 61 4 16 2 66 61 7 43 3 11 18 9 1 1 1 22 6 4 32 33 24 4 18 43 18 5 26 2 62 82 3 7 1 3 35 1 6 30 1 2 8 1 4 14 58 6 17 1 1 9 99 4 2 2 1 1 1 51 2 16 6 19 3 24 69 28 23 12 12 15 1 3 2 5 49 85 2 7 4 4 28 1 5 36 55 32 1 2 2 29 15 17 4 4 31 1 2 1 1 19 49 12 4 3 7 4 71 6 98 1 83 8 8 53 1 1 1 93 9 5 18 4 16 90 2 12 1 7 31 16 6 7 4 17 27 4 64 63 8 7 10 35 29 88 1 2 1 66 12 65 1 3 3 70 70 9 5 17 4 4 1 1 7 3 6 3 15 71 10 26 52 1 14 1 21 1 20 2 4 36 43 4 3 1 1 18 28 25 9 45 20 99 1 72 58 1 38 9 10 1 1 1 8 71 55 3 15 2 9 4 68 1 42 11 92 1 2 4 21 2 4 28 17 31 1 62 4 2 3 1 8 2 4 65 5 3 4 3 25 3 21 55 9 6 6 4 30 44 6 1 32 2 80 1 6 33 1 77 67 16 2 6 13 10 8 1 14 7 2 6 3 4 16 3 20 7 56 20 3 7 4 75 54 2 31 3 13 4 20 1 8 8 55 11 16 1 67 8 2 40 5 4 13 24 1 12 12 8 21 1 1 70 13 32 12 3 4 5 8 1 12 1 5 2 2 13 1 49 19 19 28 3 53 1 10 8 77 4 71 10 47 38 55 1 1 2 3 5 2 11 2 27 90 4 2 14 80 2 67 4 14 3 12 37 1 36 12 17 4 1 7 75 1 48 6 1 62 29 97 32 8 23 9 29 19 1 13 5 5 4 64 1 36 8 12 39 7 56 2 20 30 46 16 9 40 14 1 1 56 1 10 61 90 61 1 1 2 1 58 2 4 34 1 17 1 41 3 6 3 6 3 50 8 1 32 5 26 78 1 100 37 1 11 2 56 4 12 2 27 12 21 2 4 2 11 2 29 74 58 1 43 3 29 3 2 8 1 2 7 14 38 91 6 3 10 3 21 57 4 1 2 29 4 38 7 2 2 45 18 8 1 49 4 1 7 38 2 16 46 1 7 1 22 2 94 25 12 5 25 89 32 46 3 1 27 28 4 14 5 1 5 7 3 1 5 2 12 16 5 16 46 12 56 9 5 1 1 1 30 45 34 2 7 69 1 64 9 7 1 3 36 11 58 2 43 5 83 23 17 23 48 50 2 1 84 3 5 2 4 1 27 55 8 9 1 74 4 90 1 5 68 23 1 31 14 4 70 25 2 93 21 9 46 3 9 27 2 8 4 28 75 2 96 70 21 53 10 1 4 72 24 46 83 1 97 70 10 4 6 72 22 4 55 11 33 35 6 2 8 3 41 2 5 2 52 2 17 6 7 37 14 10 19 94 3 22 70 54 53 35 3 40 8 99 3 8 14 56 2 13 1 22 9 60 1 12 5 3 52 1 18 46 3 7 2 33 1 12 29 14 18 2 80 71 1 40 16 6 17 7 28 4 8 2 2 6 1 2 83 8 20 47 49 7 1 1 14 78 9 30 4 47 25 1 14 32 98 5 11 9 17 3 34 55 11 3 100 8 99 1 2 2 30 13 11 1 50 8 1 9 4 12 95 14 1 26 7 33 2 6 38 2 50 1 24 5 9 43 95 15 2 2 5 100 3 4 40 33 4 34 76 1 2 10 5 2 25 50 11 1 18 2 95 31 79 65 35 8 1 12 15 20 6 80 11 4 8 2 3 16 5 1 3 12 3 5 2 2 1 1 77 17 1 1 1 36 7 1 9 14 45 2 5 20 1 2 37 93 13 60 5 45 35 20 54 2 67 30 4 6 6 61 1 7 32 45 98 99 1 15 2 6 10 93 5 1 14 6 7 1 10 21 39 20 11 2 88 16 7 1 3 1 14 51 2 29 7 1 1 81 77 15 1 2 6 6 5 1 10 1 52 9 22 19 10 5 17 97 6 8 3 3 26 5 10 2 18 12 6 25 14 6 47 22 15 1 1 12 43 6 14 2 2 13 2 19 10 2 4 1 3 4 32 1 65 48 14 27 2 10 3 1 18 4 1 3 1 25 1 75 2 51 1 1 28 13 37 1 8 9 29 4 58 1 5

Here is Perl code which computes the exact probabilities of each number.

perl -we 'for $i ( 1 .. 100 ){ printf "( %d, %.3f%% )\n", $i, 100 * (log($i+1)-log($i)) / log(101) }'

Note that the multiplication by 100 is simply to convert the probability into a percentage and has nothing to do with sampling among 100 items.

The distribution is as follows:

( 1, 15.019% ) ( 2, 8.786% ) ( 3, 6.233% ) ( 4, 4.835% ) ( 5, 3.951% ) ( 6, 3.340% ) ( 7, 2.893% ) ( 8, 2.552% ) ( 9, 2.283% ) ( 10, 2.065% ) ( 11, 1.885% ) ( 12, 1.734% ) ( 13, 1.606% ) ( 14, 1.495% ) ( 15, 1.398% ) ( 16, 1.314% ) ( 17, 1.239% ) ( 18, 1.172% ) ( 19, 1.111% ) ( 20, 1.057% ) ( 21, 1.008% ) ( 22, 0.963% ) ( 23, 0.922% ) ( 24, 0.885% ) ( 25, 0.850% ) ( 26, 0.818% ) ( 27, 0.788% ) ( 28, 0.760% ) ( 29, 0.735% ) ( 30, 0.710% ) ( 31, 0.688% ) ( 32, 0.667% ) ( 33, 0.647% ) ( 34, 0.628% ) ( 35, 0.610% ) ( 36, 0.594% ) ( 37, 0.578% ) ( 38, 0.563% ) ( 39, 0.549% ) ( 40, 0.535% ) ( 41, 0.522% ) ( 42, 0.510% ) ( 43, 0.498% ) ( 44, 0.487% ) ( 45, 0.476% ) ( 46, 0.466% ) ( 47, 0.456% ) ( 48, 0.447% ) ( 49, 0.438% ) ( 50, 0.429% ) ( 51, 0.421% ) ( 52, 0.413% ) ( 53, 0.405% ) ( 54, 0.398% ) ( 55, 0.390% ) ( 56, 0.384% ) ( 57, 0.377% ) ( 58, 0.370% ) ( 59, 0.364% ) ( 60, 0.358% ) ( 61, 0.352% ) ( 62, 0.347% ) ( 63, 0.341% ) ( 64, 0.336% ) ( 65, 0.331% ) ( 66, 0.326% ) ( 67, 0.321% ) ( 68, 0.316% ) ( 69, 0.312% ) ( 70, 0.307% ) ( 71, 0.303% ) ( 72, 0.299% ) ( 73, 0.295% ) ( 74, 0.291% ) ( 75, 0.287% ) ( 76, 0.283% ) ( 77, 0.280% ) ( 78, 0.276% ) ( 79, 0.273% ) ( 80, 0.269% ) ( 81, 0.266% ) ( 82, 0.263% ) ( 83, 0.259% ) ( 84, 0.256% ) ( 85, 0.253% ) ( 86, 0.250% ) ( 87, 0.248% ) ( 88, 0.245% ) ( 89, 0.242% ) ( 90, 0.239% ) ( 91, 0.237% ) ( 92, 0.234% ) ( 93, 0.232% ) ( 94, 0.229% ) ( 95, 0.227% ) ( 96, 0.225% ) ( 97, 0.222% ) ( 98, 0.220% ) ( 99, 0.218% ) ( 100, 0.216% )

It's a bit more heavy tailed than the true Zipf distribution.  In the true Zipf distribution, the probability of "1" is 100 times the probability of "100".  However, heavy-tailed might be desirable anyway, because the point of choosing Zipf (e.g., for game design) might be to to avoid distributions like the geometric distribution whose tail thins out very quickly.

Previously, for the exponent 2, for which the infinite series converges, allowing an infinite number of items.

No comments :