Friday, June 06, 2014

[qplzrxby] Letter and polygram frequencies

Part 1

In order of decreasing frequency, here are the common English polygrams (single letters, bigrams, trigrams, etc., also called n-grams) gathered from a private corpus.  We arbitrarily stop at the least frequent single letter, namely "z". There are 326 polygrams including the 26 single letters.

e t a o i n s r h l d u c th m g y p he w f in b the re er an on at ha v ou it en ng or k to es st ing te ar is ti al nd se nt ed as ve le me ll ea hi ne co hat of li de be tha no ri ca ic ot ro ly and ho ut om us that so io ra el et ma wh ce pe ta wa ch la fo ion ur si ent ec di il do ge ee pr her thi yo ac ns ow for ul ke all un ere we ss id ct em rs lo tio wi tion you rt ay ad mo ld po ai tr bu su mi wo hin oo pa pl ts ter gh av sh os nc ol ig ther ey ab am op ir na ni im ie if ate j thin not sa x ver fi bl are go bo one ev rea out ave tu ati oul was ry uld ould hey ome vi ba hav they here iv res ci tt ith ck ag con fe but ers have ia lly eve ap rd wit fr ally da pro sta mp his tin with ill ny ex hou ess ty cl com ht nk up sp ye ei can ear ght int est nce ki ff ep ted use ug ons od igh men ov ting som some pp ive atio our ore ation hing eo ect ak uc sti cou ple any ue ide ust ju ik pre gr cr der ef gi tho rr um ink ga ls there ike wha fa act han sc what ant thing ment q ell ua cu ight art ew bi whe oi oun ugh this nte per ica ble ist wou lik au like ru ove ob woul would rn z

Part 2

To avoid the redundancy between polygrams like "woul" and "would" in Part 1, we choose a subset of them by iteratively finding the length of the longest polygrams more frequent than the original frequency of "z", then greedily selecting the most frequent polygram among those with that length.  We cut out that polygram from the corpus, then reanalyze and repeat.  (This was a long computation.)  The polygrams got eliminated in the following order.  This list of 126 polygrams has far less redundancy.

5 letters: ation there thing would
4 letters: that they have tion ally with ting some ther what this ight like thin
3 letters: the ing and ent for you ter ver one are not ate out was con all but pro res ere ill com rea ear use can cou any nce ple ust ive hou ers ell ess ant ble
2 letters: in to it re is on ed er es of en ar or as al an ic be st se ly ch ow at id wh le ac ay me un ad ro su ge if th mo go pe so lo do de us po ab he ir im am ld tr ur ne ap gh il we ve ts ke no ma ex ul ta te co pl

The remaining single letters in decreasing order of frequency were
s t i a m e d o p c h l u f y r b n w g k v j q z

The letter "x" dropped below the "z" frequency threshold after elimination of the "ex" bigram.

Part 3.1

Part 2 cut out occurrences of a polygram out of a word, so, for example, the word "others" became two words "o" and "s" in the corpus after elimination of "ther". In this part, we instead replace a common polygram with a single composite character, so the 5-letter word "other" becomes the 3-letter word "o(the)r", where (the) counts as only one letter. We choose composite characters greedily, maximizing at each step the scoring function N*(L-1), where N is the number of occurrences and L is the length of the polygram. This scoring function is equivalent to the total number of keystrokes that would be saved to type the entire corpus if the polygram could be typed with one keystroke. After substituting highest scoring composite polygram throughout the corpus, we reanalyze and repeat. (This was a long computation.) We arbitrarily stop when the best score drops below the frequency of "z", finding 126 polygrams:

the in re an that on er ou it st en (in)g or al at to is ar le as ed th es of ic (an)d ly have om be se wh no (ou)ld i(on) but ac ight peop(le) (en)t id ll w(it)h like f(or) ch y(ou) ay ad pro me ab(ou)t we su s(om)e ge he lo ve if a(re) do v(er) un (th)(in)k go (ou)gh de (no)t po w(as) (on)e mo (the)y ju(st) fr(om) am so ir us im te ne c(on) ce il tr t(er) ap how (al)(ly) (be)cau(se) ct c(om) ab (at)e (al)l i(ll) o(the)r ur (at)(i(on)) ak pe ig (an)y ts p(re) ol sh (the)(re) k(no)w la (th)((in)g) d(if)fe(re)n ex (ou)nd (th)(is) (wh)(at) (wh)(ic)h c(an) (ac)t li ri ro w((ou)ld) m((en)t)

Part 3.2

Here are the 218 composite polygrams selected in greedy order when using the scoring function N*L instead of N*(L-1). We now need to explicitly forbid creating useless one-letter composite characters.

th in (th)e re an on at er ou st en (in)g it or al to is ar le as ed es of ic (an)d ly om (th)(at) be se wh no ve i(on) ac (en)t ha id ll f(or) y(ou) ay ch (ou)ld me ut ro li we su gh ad ge lo he ab wi ke pe do a(re) v(er) if un go de (no)t ri s(om)e po (ou)t w(as) (on)e mo ((th)e)y so am ir us ne c(on) b(ut) ce ct (pe)op(le) (ha)(ve) te (al)(ly) t(er) (th)(in)k il (wi)(th) ho c(om) im p(ro) ju(st) (gh)t fr(om) (at)e (al)l (at)(i(on)) (an)y ((th)e)r ur fe p(re) la ma tr ((th)e)(re) (th)((in)g) ap ex ts co (th)(is) (wh)(at) c(an) (ac)t (er)s (ou)n w((ou)ld) m((en)t) ss ta (li)(ke) (no)w ye qu (be)cau(se) (ar)t (an)t up sh d(on) (ab)((ou)t) ul ig ag bo my (en)d (ge)t (ar)d pl sa i(st) (ou)r by (wh)(ic)h w(or) em (ho)w (wh)o (mo)(re) ol k((in)g) ti h(as) (ou)(gh) i(ll) um d(er) i(ve) p(er) (es)s w(ay) (in)d od (it)y (re)d v(en) ok (th)(an) gu (in)e ra (re)n o(((th)e)r) (do)(es) (ac)k sp ((th)e)m publ(ic) (or)t (id)e fo (wh)e(re) (ic)e d(id) ty s(ay) mu(ch) (wi)(ll) (we)(ll) (as)s k((no)w) pr t((in)g) op ud d(ay) cl (wh)(en) (ou)s (an)s ck a(st) ((ou)n)d b(le) ci gr wn (ab)(le) (v(er))y (of)f (in)t

Part 3.3

Here are the 62 composite polygrams selected in greedy order when using the scoring function N*(L-2).

the that ing tion and ould ent have for all people think with you about some ter though what ver ight this like not one are ate because was just from con but which pro res more ere ill com rea ear actu(all)y can than any does nce time ive well know (the)re ers when ess ound ant able th(ing) out part

Source code

Here is the Haskell source code.  The main optimization was to preprocess the corpus into a list of words with word counts with sort | uniq -c.

The Google Books n-grams corpus is another choice, but I don't like how it includes as common words words like "iii" which is probably an artifact of OCR scanning Roman numeral page numbers.

Previous vaguely similar attempt, in Perl.. Previous thoughts invoking Braille contractions. I have heard anecdotes that Braille contractions were derived from frequencies in the Bible.

No comments :