Thursday, November 23, 2017

[ccyzhjzt] Words colliding on merged letters

What if two letters in English were written with the same symbol? Which pairs of letters would result in little ambiguity -- only one of the letter possibilities forms a valid word? Which result in a lot of ambiguity -- the ambiguous symbol being either letter both resulting in valid words?

We investigate this using the /usr/share/dict/american-english-huge word list from the wamerican-huge package version 7.1-1 in Ubuntu 14.04. For each pair of letters we calculate a score based on the number and size of collisions that occur if both letters were replaced by the same symbol.

For a collision set of size N, we assess a quadratic penalty of N*(N-1)/2. This choice of penalty formula was somewhat arbitrary. The largest set was of size 6, when D and S are merged to the same glyph: dodded dossed dosses sodded sossed sosses. This contributed a penalty of 15. The sum of penalties for all the collisions merging D and S was 9379. We give the scores for all 325 pairs of letters below.

Merging D and S actually had the worst score. Many verbs ending with E have a 3rd-person singular conjugation -ES and a past tense conjugation ending -ED.

We took some care with capital letters. The assumption was, when merging two letters, the lowercase letters would be merged in to one glyph, and the corresponding uppercase letters would be merged into a different single glyph, different from the merged lowercase glyph.

The rare collisions JQ, QU, and QZ all have the same form. The single-letter "words" j and q collide, and J and Q, and the lowercase pluralizations js and qs. The collision QY has the same 3 collisions plus coreq and corey. I speculate coreq is short for co-requisite, and corey is core + y, meaning core-like.

Future work:

Incorporate word frequencies into the score. Google Books n-grams (with n=1).

Design a keyboard layout in which pairs of letters which score poorly (so yielding errors not catchable by a spell checker) are far apart. Having D and S next to each other is terrible on the QWERTY keyboard layout. Having I and O adjacent is also very bad if word frequency is considered: if/of in/on.

Considering pairs of letters gives us the best way to reduce the alphabet from 26 to 25 letters for a Polybius square (tap code). This could be taken further. What is the best way to reduce the alphabet to 24 letters while minimizing the ambiguity created? Maybe merge 2 pairs, or maybe merge 3 letters into 1.

Source code in Perl, and output listing all the word collisions for all letter pairs.

Scores for every pair:

jq=3 qu=3 qz=3 qy=4 aq=6 iq=6 qx=6 ju=7 qv=7 eq=8 hq=8 oq=9 fq=10 qw=10 bq=13 lq=13 nq=14 pq=15 qr=16 jx=19 uz=19 mq=20 ux=20 qt=21 aj=29 dq=30 qs=30 cq=31 ej=33 jo=34 fx=37 ix=39 uv=42 hx=43 gq=44 iz=45 oz=48 ox=50 az=54 ex=54 ij=54 kq=60 ov=61 hu=64 iv=64 xz=65 av=67 fu=68 ku=73 ax=76 ez=77 jz=81 bu=87 fi=90 ev=94 xy=96 bx=99 vx=101 yz=103 kx=109 wx=110 bi=111 mu=119 cu=120 gu=126 du=127 jv=131 jk=134 vy=134 fo=137 gx=142 hi=142 mx=143 cx=147 im=150 hz=151 wz=151 ik=152 ip=155 pu=155 dx=157 fz=161 lx=165 vz=165 jy=166 px=169 ko=178 ci=179 gi=180 kz=184 iw=187 ef=190 bo=194 fy=195 af=202 cz=211 go=211 jw=214 aw=216 sx=223 rx=231 tx=234 bz=239 hv=239 am=244 uy=244 ab=245 di=246 tu=247 jn=257 uw=258 ky=260 ak=262 ag=263 mo=265 ow=267 fj=270 gz=271 cj=274 do=276 cy=279 lu=283 ew=288 co=297 nu=297 mz=298 nx=298 ho=299 ah=301 pz=304 hj=309 rz=309 op=315 su=319 nz=321 no=325 be=328 js=328 jm=337 dj=338 fk=339 ap=343 gj=353 lz=353 jp=362 ad=364 em=364 fv=366 it=367 jl=374 hy=376 vw=379 jr=381 jt=383 ce=392 by=395 gy=398 kv=399 wy=404 bj=405 gv=406 my=413 ly=414 in=417 ot=427 py=427 dz=428 cv=441 eh=456 bk=468 ny=469 ru=470 bv=474 ep=479 kw=480 ir=510 ek=512 at=524 tz=529 os=554 lo=563 is=567 an=568 nv=569 gh=574 il=576 dv=587 dy=593 fh=593 oy=603 eg=604 hk=622 rv=626 mv=627 ty=630 pv=633 ry=637 ay=639 sv=642 al=657 fw=658 lv=668 or=678 hn=690 gw=703 tv=709 sz=711 fn=730 nw=742 en=743 gk=745 dw=765 de=775 ar=782 ac=784 kr=785 dh=788 mw=800 km=816 cw=844 fg=858 bh=874 cf=882 sw=887 fr=895 kp=896 kn=902 hw=903 df=904 rw=904 ks=912 bw=921 lw=927 fm=928 bn=953 ch=955 dk=965 kl=994 gm=1041 gr=1055 hm=1067 pw=1074 as=1078 gl=1083 er=1087 el=1091 et=1099 fl=1101 fs=1112 cl=1128 hr=1132 hp=1146 iy=1155 ck=1161 cg=1167 cm=1197 fp=1197 cd=1200 gn=1219 hl=1230 cr=1231 tw=1244 bc=1292 bs=1313 gp=1316 gs=1316 bl=1321 ft=1324 hs=1332 bf=1342 kt=1378 mn=1387 pr=1402 ms=1428 cn=1478 dp=1505 bd=1516 ht=1523 cp=1542 mr=1550 bg=1576 br=1578 bm=1657 dg=1660 np=1660 lp=1679 ps=1726 dm=1732 lm=1795 bt=1809 cs=1838 bp=1860 eu=1870 ct=1874 gt=1926 dl=1944 es=2024 dn=2056 mp=2182 ns=2195 ls=2246 ey=2276 ou=2326 ln=2412 sy=2524 nt=2562 rt=2582 pt=2585 iu=2639 lt=2657 dt=2755 nr=2827 io=2875 au=3006 st=3566 lr=3713 mt=3765 eo=4036 ai=4498 ei=4616 rs=4661 ao=5230 ae=5982 dr=7439 ds=9379

No comments :