Thursday, June 18, 2020

[uschhyiz] Binary data as letters

We consider the task of encoding binary data using the letters A through Z.  (Perhaps then transmitting it using the NATO phonetic alphabet.)

We first consider the case in which the data is in 8-bit bytes.  The following block sizes achieve increasing efficiency:

1 byte = 2 letters
4 bytes = 7 letters (not a convergent)
7 bytes = 12 letters
17 bytes = 29 letters (not a convergent)
27 bytes = 46 letters (not a convergent)
37 bytes = 63 letters (not a convergent)
47 bytes = 80 letters
151 bytes = 257 letters

We do not go beyond the above sizes because, if we are using letters, that probably means there is a human in the loop, and humans are going to struggle with random-looking sequences of letters longer than this.

It was surprising that there were efficient sizes which are not continued fraction convergents.  (I had previously assumed continued fractions got all the best approximations.)  The sequence 7 17 27 37 47 is also surprising.

Next, we examine the case where data is in bits.  The following block sizes achieve increasing efficiency:

4 bits = 1 letter
9 bits = 2 letters (not a convergent)
14 bits = 3 letters
47 bits = 10 letters
1114 bits = 237 letters (not a convergent)

contfracpnqn(contfrac(log(26)/log(2)),15)

? eff=0 ; for(i=3 , +oo , letters=ceil(log(2)/log(26)*i) ; m=i/letters ; if(m>eff , eff=m ; print(i," ",letters," ",eff*1.0)))

Particularly intriguing is 47 bits, because it could be 40 bits (5 bytes) of payload data and 7 bits of sequence and parity (error-detection, ECC error-correcting code) information.  Or 32 bits of data and 15 bits of sequence and error detection.

However, instead of using blocks, better would be to use arithmetic coding.  In a future post, we will describe how to do arithmetic coding efficiently.   However, for small amounts of data, naively doing arithmetic coding with arbitrary precision arithmetic is not too bad.

No comments :