Friday, September 03, 2021

[pvldatlw] improving octal (with continued fractions)

if you have data in bytes but wish to transmit through a channel that only does decimal digits, one way to encode is to turn each byte into a 3-digit number between 000 and 255, then concatenate the digits.  this achieves an efficiency of 1/3 = 0.333 bytes per digit.  we can do better by considering a chunk of 2 bytes at a time, encoding 2 bytes into a 5-digit number 00000-65535. this achieves an efficiency of 2/5 = 0.4 bytes per digit.  (unspecified: what to do, typically some sort of padding, if the data has an odd number of bytes.)  the next best chunk size is 7 bytes going to 17 digits (00000000000000000-72057594037927935), or 7/17 = 0.412 bytes per digit.  the theoretical maximum efficiency is log(10)/log(256) = 0.4152 bytes per digit.  the sequence of increasing-efficiency byte-chunk sizes is as follows:

1 2 7 12 17 22 71 120 169 218 485 752 1019 1286 1553 1820 2087 2354 2621 2888 3155 3422 3689 3956 4223 4490 4757 5024 5291 5558 5825 6092 6359 6626 6893 7160 7427 7694 7961 8228 8495 8762 9029 9296 9563 9830 10097 10364 10631 10898 11165 11432 11699 11966 12233 12500 12767 13034 13301 13568 13835 14102 113083 438230 763377 7958917 15154457 22349997 29545537 36741077 190900925 345060773 499220621 653380469 807540317 961700165

for high efficiency N, the number 256^N expressed in decimal begins with a long string of initial 9s.

the largest chunk size that GNU MP (gmp) can do radix conversion is 17.2 metric gigabytes (16 gibibyte, 2^34), so we've stopped the list above at the practically largest usable chunk size 961700165 (961 Mbyte).  the next term, 21311563478 (19.8 GiB), is too large.

here is a brute-force PARI/GP script to generate the sequence above:

l1=List() ; r=log(256)/log(10) ; best=+oo ; for(denom=1 , 42*10^9 , if(denom%10^9==0 , print("check ",denom)) ; x=denom*r ; numer=ceil(x) ; delta=ceil(x)-x ; if(delta<best , best=delta ; listput(l1,denom) ; print(denom," ",delta," ",contfrac(numer/denom))))

here is a much more efficient script.  (how much more efficient?)  the script works by truncating continued fractions.  for odd length truncations, increment the final term by one.  evaluate the truncated continued fraction and return the denominator.  for even length truncations, evaluate a sequence of truncated continued fractions, each with its last term modified, ranging from 2 to its actual value.  (if the last term is 1, then there is no even length truncation.)

l2=List() ; c=contfrac(log(256)/log(10)) ; for(i=1 , matsize(c)[2] , e=vecextract(c,vector(i,u,u)) ; if(i%2 , e[i]=e[i]+1 ; denom=contfracpnqn(e)[2,1] ; listput(l2,denom) ; print(denom," ",e), savelast=e[i] ; for(j=2 , savelast , e[i]=j ; denom=contfracpnqn(e)[2,1] ; print(denom," ",e) ; listput(l2,denom))))

one can verify that list l1 (the brute-force calculation above) is a prefix of l2.

although i had previously known generally that finding good fractional approximations of an irrational number involved continued fractions, the extra detail of systematically modifying the last term I had not learned until now.  because we only want fractions on one side of the given value, the modified last term goes from 2 to N.  for fractions on both sides, modified last term goes from N/2 to N.  this is best understood by printing the (terminating) continued fraction representations of the best fractions found by brute force, demonstrated in the first script above.

slicing data into bits achieves slightly higher efficiency.  below is a sequence of bit chunk sizes of increasing efficiency.  encoding 3 bits at time is well known as octal, 3 bits to 1 decimal digit 0-7, for an efficiency of 3 bits per digit or 0.375 bytes per digit.  (this is an improvement over 0.333 bytes per digit, the smallest byte chunk encoding mentioned above.)  the theoretical maximum efficiency is log(10)/log(2) = 3.32 bits per digit or 0.4152 bytes per digit, the same as above with byte chunks.

1 2 3 13 23 33 43 53 63 73 83 93 289 485 2621 4757 6893 9029 11165 13301 42039 112816 183593 254370 579517 904664 1229811 1554958 1880105 2205252 2530399 2855546 3180693 3505840 3830987 4156134 4481281 4806428 5131575 5456722 5781869 6107016 12539179 18971342 25403505 31835668 38267831 44699994 95832151 146964308 345060773 1923400330 84284553747

as before, GNU MP's limit is 2^37 bits = 137,438,953,472.  the last entry, 84,284,553,747, is the largest bitsize chunk that GNU MP can handle.  the next entry is 166 Gbit.

13 bits to 4 decimal digits 0000-8191 plus a Verhoeff check digit seems elegant.  several ways to avoid leading zero if desired: add 1000, add 1024, remap 0000-0999 to 9000-9999, remap 0000-0999 to 8192-9191.

below is a Pari/GP function that will print out the sequence of increasingly efficient chunk sizes for any pair of input (source) and output (destination) alphabet sizes.

chunksizes(source,destination)=my(c,e,f,savelast) ; c=contfrac(log(source)/log(destination)) ; for(i=1 , matsize(c)[2] , e=vecextract(c,vector(i,u,u)) ; if(i%2 , e[i]=e[i]+1 ; f=contfracpnqn(e) ; print("source=",f[2,1]," destination=",f[1,1]," ",e) , savelast=e[i] ; for(j=2 , savelast , e[i]=j ; f=contfracpnqn(e) ; print("source=",f[2,1]," destination=",f[1,1]," ",e))))

for example, chunksizes(10,2) will include "source=3 destination=10", packing decimal 000-999 (10^3) into binary 0-1023 (2^10).  however, it's rare to have large decimal input data of uniformly distributed decimal digits.  the only example i know of is efficient storage of large decimal computations of irrational numbers.  if input decimal data is not uniformly randomly distributed, use conventional data compression (e.g., gzip, bzip2, lzip) to more efficiently pack into bytes.

previously, bytes to letters (chunksizes(256,26)).

chunks might be useful for arithmetic coding.

No comments :