Tuesday, October 20, 2020

[bfxtuwae] Zipf 24 entropy

Consider a Zipf distribution over 24 items. In particular, consider the exponent-1 Zipf distribution over taoiswcbphfmdrelnguvyjkq, the 24 most common initial letters in order, dropping z and x which are rare as initial letters.  Note that we are throwing away the empirical frequencies and keeping only their ranks, then assigning Zipf distribution probabilities to the letters ranked.

This Zipf distribution has entropy of 3.84 bits per letter.

s=1/sum(k=1 , 24 , 1/k)
0-sum(k=1 , 24 , s/k * log(s/k) / log(2))

It is slightly easier to sample from a different distribution that approximates Zipf.  This distribution achieves entropy of 4.02 bits per letter because it has a heavier tail.

perl -lwe '$e=0 ; for $i (1..24) { $p = (log($i+1)-log($i)) / log(25) ; $e += $p*log($p)/log(2) } print 0-$e'

(The empirical distribution, limited to 24 letters, has entropy of 4.09 bits per letter.  Uniform sampling over 24 letters has entropy of 4.58 bits per letter.  Uniform over 26 letters has 4.70 bits per letter.)

Here is a random sampling of 64 letters from this easier-to-implement modified Zipf distribution, representing about 257 bits of entropy.

o b a o t w w n f g b l u t i i t w i o t o a w f i i w t a c w o o w o w t a v q w r w s e t e s a m i t p m w p t t d n i t d

Consider composing memorable prose (or poetry) for a given sequence of initial letters.  64 letters might be useful for encoding a 256-bit Elliptic Curve Cryptography public key, e.g., Curve25519.  To translate given data into a sampling over a Zipf distribution, use arithmetic coding, in particular, decoding.  (Previously, on doing arithmetic coding in fixed-precision binary, though we don't have to do that if the data is small.)  Input data will result in Zipf-distributed output if the input resembles uniform randomness.  Perhaps compress input first if it does not.  (Note that we can do better for creating memorable private keys, but remember to do key stretching if you are deriving a private key from a short password.)

Consider arranging the 64 letters into an 8x8 grid (other shapes possible, e.g., 4x4x4), then reading the letters off in several directions for redundancy, composing prose for each of the directions.  Avoid using the same word for a letter in the same grid position among different texts.  This will be tricky for common letters like A.

(How many nice orderings are there of letters in a grid?)

Given a random string with a certain amount of entropy, adding letters does not decrease entropy and could make the string easier to memorize.  Adding letters is acceptable when the string represents random entropy, not coded data.  In such a case, it is better to sample letters from a uniform distribution and not Zipf, though probably still omitting X and Z: 4.58 bits per letter.

No comments :