Wednesday, October 03, 2018

[xbbyusrf] Entropy of initial letters

Letter frequency of initial letters in English, extracted from Peter Norvig's analysis of Google N-grams:

t 118849337945
a 86893227065
o 56762601060
i 54256321754
s 49734339142
w 40889321958
c 38962746440
b 32978416403
p 32124438693
h 31238837884
f 29952197540
m 28460312383
d 23610994165
r 21020892617
e 20821104413
l 17967162608
n 16990703679
g 12211759680
u 8801470813
v 6129410533
y 5677068230
j 3801073430
k 3390029632
q 1647734497
z 336016172
x 335403585

sum 743842922321

Entropy (calculated as the sum of p log p): 4.1009 bits per letter.  Omitting Z and X lowers it to 4.0933.  Compare with uniform distribution: log2 26=4.7004, log2 24=4.5850.

The distribution above is from a corpus.  If the same word occurs multiple times, then it is counted multiple times.  T occurs so frequently because "the" occurs so frequently.  In contrast, the distribution is not from a word list with duplicates eliminated.  Such a word list approach would explain what section of a dictionary is thickest or what volumes of an encyclopedia are the thickest.

Create a memorable mnemonic sentence from randomly sampled letters.  Does having the letter distribution match real English make it easier to create mnemonic sentences?

Are some of the high frequency initial letters dominated by a few very frequent words?  Sampling will yield many Ts in a row, but the common word can't be repeated: "the the" (but we could continue adding extra words in between).  Maybe we want a first-order Markov chain.  Can't quite easily get it from Google N-grams because it treats comma as a word, but we freely allow adding commas to a sentence.

Calculating the entropy rate of a Markov chain is a bit more involved but well studied.

No comments :