## Monday, December 21, 2015

### [igptrdec] Encoding arbitrary precision integers in binary

Unary can encode arbitrarily large non-negative integers.  Adding a trailing 0 allows encoding a sequence of numbers by concatenating their representations.  Let's call the following encoding Encoding 1.

xEncoding 1Bit length
001
1102
21103
311104
4111105
51111106
611111107
7111111108
81111111109
9111111111010

Unary takes up a lot of space; binary is better.  For arbitrary sized integers (and especially, to encode a collection of integers), we need a way of encoding the length of the binary portion.  This can be done by prefixing a header which denotes the length of the following binary portion, with the header encoded in Encoding 1.  The header is highlighted in bold below. We form all possible bit strings of this format and line them up with the natural numbers to yield a bijection.  Let's call this combined scheme Encoding 2.

xEncoding 2Bit length
001
11003
21013
3110005
4110015
5110105
6110115
711100007
811100017
911100107
1011100117
1111101007
1211101017
1311101107
1411101117
151111000009
301111011119
311111100000011
63111111000000013
12711111110000000015
2551111111100000000017

Because we reset the binary portion back to 000...0 for each new length, converting back and forth from a number x to its encoding is a little bit tricky.

Encoding 2 still takes up quite a bit of space, with the header consuming over half the length of an encoded number.  We can improve this by recursion: encode the header (highlighted in bold below) in Encoding 2 (instead of Encoding 1), then follow it with a binary portion as before.  Let's call this combined scheme Encoding 3.

xEncoding 3Bit length
001
110004
210014
3101005
4101015
5101105
6101115
7110000008
8110000018
9110000108
10110000118
11110001008
12110001018
13110001108
14110001118
151100100009
31110100000010
631101100000011
1271110000000000014
25511100010000000015

We can recurse until infinity, with Encoding N having its header encoded in Encoding N-1.  There is always an integer large enough such that Encoding N saves space over N-1.  (Practically, there is no benefit beyond Encoding 4.). Translating back and forth efficiently from a (large) number and its encoding continues to be a little tricky.

This scheme is similar to encoding the length of the length of a number in decimal.

Next, we can recurse "beyond infinity".  For each bit string in Encoding N, prefix it with N encoded in Encoding 1.  This prefix is highlighted in bold below, demonstrating prefixing on Encoding 1, 2, and 3.  (We had arbitrarily chosen for the Encoding indices to start from 1 not 0 to preserve the elegance of encoding 0 as just 0.)

[100 1010 10110 101110 1011110 10111110 ...]
[1100 110100 110101 11011000 11011001 11011010 ...]
[11100 11101000 11101001 111010100 111010101 111010110 ...]
...

Sort the elements of this infinite list of infinite lists of bit strings first by bit length then lexicographically.  Call this Encoding Omega (inspired by ordinal numbers, where omega is the first number beyond all the finite numbers).

xEncoding OmegaBit length
001
11003
210104
311004
4101105
5111005
61011106
71101006
81101016
91111006
1010111107
1111111007
12101111108
13110110008
14110110018
15110110108
16110110118
17111010008
18111010018
19111111008
201011111109
211110101009
221110101019
231110101109
241110101119
251111111009
26101111111010
27110111000010
28110111000110
29110111001010
30110111001110
31110111010010
32110111010110
33110111011010
34110111011110
35111101000010
36111101000110
37111111110010
381011111111011
391111010010011
401111010010111
411111010011011
421111010011111
431111111110011
4410111111111012
4511011110000012
4611011110000112
4711011110001012
4811011110001112
4911011110010012
5011011110010112
5111011110011012
5211011110011112
5311011110100012
5411011110100112
5511011110101012
5611011110101112
5711011110110012
5811011110110112
5911011110111012
6011011110111112
6111101100000012
6211101100000112
6311101100001012
6411101100001112
6511101100010012
6611101100010112
6711101100011012
6811101100011112
6911111010000012
7011111010000112
7111111111110012
72101111111111013
73111011001000013
74111011001000113
75111011001001013
76111011001001113
77111011001010013
78111011001010113
79111011001011013
80111011001011113
81111011001100013
82111011001100113
83111011001101013
84111011001101113
85111011001110013
86111011001110113
87111011001111013
88111011001111113
89111101010000013
90111101010000113
91111101010001013
92111101010001113
93111101010010013
94111101010010113
95111101010011013
96111101010011113
97111110100010013
98111110100010113
99111110100011013
100111110100011113
101111111111110013
1021011111111111014
1031101111100000014
1041101111100000114
1841111110100000114
1851111111111110014
18610111111111111015
18711101101100000015
29411111101000011115
29511111111111110015
296101111111111111016
4441011111111111111017
49010111111111111111018
830101111111111111111019
11321011111111111111111020
211210111111111111111111021
3566101111111111111111111022
69781011111111111111111111023
1278410111111111111111111111024
25412101111111111111111111111025
486261011111111111111111111111026
6432610111111111111111111111111027
91636101111111111111111111111111028
1503441011111111111111111111111111029
25957410111111111111111111111111111030
518986101111111111111111111111111111031
9886641011111111111111111111111111111032
197716410111111111111111111111111111111033
3888634101111111111111111111111111111111034
77771021011111111111111111111111111111111035
1545574010111111111111111111111111111111111036

Efficiently translating between a number x and its Encoding Omega seems so difficult that Encoding Omega is probably practically useless.

Going beyond Encoding Omega is possible, choosing a different encoding (instead of Encoding 1) to express the header.  Encoding Omega Plus One uses Encoding 2 as the header, and Encoding Omega Plus Omega, i.e., Encoding Two-Omega, uses Encoding Omega for the header.

Here is some Haskell code to (inefficiently) compute these encodings up to and including Encoding Omega.  Encoding Omega required merge sorting an infinite list of infinite lists, which would be normally be impossible (when scanning the heads, you never know if you have found the smallest item).  However, all bit strings originating from Encoding N (and above) have length at least N+2, so finding all bit strings of a given length can be done in finite time.