Ken's blog: [igptrdec] Encoding arbitrary precision integers in binary

Unary can encode arbitrarily large non-negative integers. Adding a trailing 0 allows encoding a sequence of numbers by concatenating their representations. Let's call the following encoding Encoding 1.

x	Encoding 1	Bit length
0	0	1
1	10	2
2	110	3
3	1110	4
4	11110	5
5	111110	6
6	1111110	7
7	11111110	8
8	111111110	9
9	1111111110	10

Unary takes up a lot of space; binary is better. For arbitrary sized integers (and especially, to encode a collection of integers), we need a way of encoding the length of the binary portion. This can be done by prefixing a header which denotes the length of the following binary portion, with the header encoded in Encoding 1. The header is highlighted in bold below. We form all possible bit strings of this format and line them up with the natural numbers to yield a bijection. Let's call this combined scheme Encoding 2.

x	Encoding 2	Bit length
0	0	1
1	100	3
2	101	3
3	11000	5
4	11001	5
5	11010	5
6	11011	5
7	1110000	7
8	1110001	7
9	1110010	7
10	1110011	7
11	1110100	7
12	1110101	7
13	1110110	7
14	1110111	7
15	111100000	9
30	111101111	9
31	11111000000	11
63	1111110000000	13
127	111111100000000	15
255	11111111000000000	17

Because we reset the binary portion back to 000...0 for each new length, converting back and forth from a number x to its encoding is a little bit tricky.

Encoding 2 still takes up quite a bit of space, with the header consuming over half the length of an encoded number. We can improve this by recursion: encode the header (highlighted in bold below) in Encoding 2 (instead of Encoding 1), then follow it with a binary portion as before. Let's call this combined scheme Encoding 3.

x	Encoding 3	Bit length
0	0	1
1	1000	4
2	1001	4
3	10100	5
4	10101	5
5	10110	5
6	10111	5
7	11000000	8
8	11000001	8
9	11000010	8
10	11000011	8
11	11000100	8
12	11000101	8
13	11000110	8
14	11000111	8
15	110010000	9
31	1101000000	10
63	11011000000	11
127	11100000000000	14
255	111000100000000	15

We can recurse until infinity, with Encoding N having its header encoded in Encoding N-1. There is always an integer large enough such that Encoding N saves space over N-1. (Practically, there is no benefit beyond Encoding 4.). Translating back and forth efficiently from a (large) number and its encoding continues to be a little tricky.

This scheme is similar to encoding the length of the length of a number in decimal.

Next, we can recurse "beyond infinity". For each bit string in Encoding N, prefix it with N encoded in Encoding 1. This prefix is highlighted in bold below, demonstrating prefixing on Encoding 1, 2, and 3. (We had arbitrarily chosen for the Encoding indices to start from 1 not 0 to preserve the elegance of encoding 0 as just 0.)

[100 1010 10110 101110 1011110 10111110 ...]
[1100 110100 110101 11011000 11011001 11011010 ...]
[11100 11101000 11101001 111010100 111010101 111010110 ...]
...

Sort the elements of this infinite list of infinite lists of bit strings first by bit length then lexicographically. Call this Encoding Omega (inspired by ordinal numbers, where omega is the first number beyond all the finite numbers).

x	Encoding Omega	Bit length
0	0	1
1	100	3
2	1010	4
3	1100	4
4	10110	5
5	11100	5
6	101110	6
7	110100	6
8	110101	6
9	111100	6
10	1011110	7
11	1111100	7
12	10111110	8
13	11011000	8
14	11011001	8
15	11011010	8
16	11011011	8
17	11101000	8
18	11101001	8
19	11111100	8
20	101111110	9
21	111010100	9
22	111010101	9
23	111010110	9
24	111010111	9
25	111111100	9
26	1011111110	10
27	1101110000	10
28	1101110001	10
29	1101110010	10
30	1101110011	10
31	1101110100	10
32	1101110101	10
33	1101110110	10
34	1101110111	10
35	1111010000	10
36	1111010001	10
37	1111111100	10
38	10111111110	11
39	11110100100	11
40	11110100101	11
41	11110100110	11
42	11110100111	11
43	11111111100	11
44	101111111110	12
45	110111100000	12
46	110111100001	12
47	110111100010	12
48	110111100011	12
49	110111100100	12
50	110111100101	12
51	110111100110	12
52	110111100111	12
53	110111101000	12
54	110111101001	12
55	110111101010	12
56	110111101011	12
57	110111101100	12
58	110111101101	12
59	110111101110	12
60	110111101111	12
61	111011000000	12
62	111011000001	12
63	111011000010	12
64	111011000011	12
65	111011000100	12
66	111011000101	12
67	111011000110	12
68	111011000111	12
69	111110100000	12
70	111110100001	12
71	111111111100	12
72	1011111111110	13
73	1110110010000	13
74	1110110010001	13
75	1110110010010	13
76	1110110010011	13
77	1110110010100	13
78	1110110010101	13
79	1110110010110	13
80	1110110010111	13
81	1110110011000	13
82	1110110011001	13
83	1110110011010	13
84	1110110011011	13
85	1110110011100	13
86	1110110011101	13
87	1110110011110	13
88	1110110011111	13
89	1111010100000	13
90	1111010100001	13
91	1111010100010	13
92	1111010100011	13
93	1111010100100	13
94	1111010100101	13
95	1111010100110	13
96	1111010100111	13
97	1111101000100	13
98	1111101000101	13
99	1111101000110	13
100	1111101000111	13
101	1111111111100	13
102	10111111111110	14
103	11011111000000	14
104	11011111000001	14
184	11111101000001	14
185	11111111111100	14
186	101111111111110	15
187	111011011000000	15
294	111111010000111	15
295	111111111111100	15
296	1011111111111110	16
444	10111111111111110	17
490	101111111111111110	18
830	1011111111111111110	19
1132	10111111111111111110	20
2112	101111111111111111110	21
3566	1011111111111111111110	22
6978	10111111111111111111110	23
12784	101111111111111111111110	24
25412	1011111111111111111111110	25
48626	10111111111111111111111110	26
64326	101111111111111111111111110	27
91636	1011111111111111111111111110	28
150344	10111111111111111111111111110	29
259574	101111111111111111111111111110	30
518986	1011111111111111111111111111110	31
988664	10111111111111111111111111111110	32
1977164	101111111111111111111111111111110	33
3888634	1011111111111111111111111111111110	34
7777102	10111111111111111111111111111111110	35
15455740	101111111111111111111111111111111110	36

Efficiently translating between a number x and its Encoding Omega seems so difficult that Encoding Omega is probably practically useless.

Going beyond Encoding Omega is possible, choosing a different encoding (instead of Encoding 1) to express the header. Encoding Omega Plus One uses Encoding 2 as the header, and Encoding Omega Plus Omega, i.e., Encoding Two-Omega, uses Encoding Omega for the header.

Here is some Haskell code to (inefficiently) compute these encodings up to and including Encoding Omega. Encoding Omega required merge sorting an infinite list of infinite lists, which would be normally be impossible (when scanning the heads, you never know if you have found the smallest item). However, all bit strings originating from Encoding N (and above) have length at least N+2, so finding all bit strings of a given length can be done in finite time.

Ken's blog

Monday, December 21, 2015

[igptrdec] Encoding arbitrary precision integers in binary

No comments :