consider packing a bitstring of known but arbitrarily long length into a packet of bytes, encoding into the packet where its end is. (we imagine there are more bytes afterwards encoding other objects.)
- let NUMBYTES be a state variable in the encoder. initialize NUMBYTES to 1.
- consider a chunk of NUMBYTES bytes.
- let U be the upper 2 bits of the chunk. the remainder of the chunk is filled with payload.
- possibilities for U:
- 00 = this is final chunk, exit loop.
- 01 = NUMBYTES remains unchanged.
- 10 = double NUMBYTES.
- 11 = quadruple NUMBYTES.
- goto 2.
note: the scheme cannot handle all possible input bit lengths, for example, 1 bit. the user must supply a padding mechanism.
values 0 through 63 (6 bits) are encoded in 1 byte. this is the smallest possible message, setting the upper two bits set to zero to signify a message of only 1 byte. if the initial value of NUMBYTES were larger (this would require modifying the protocol), payload of the smallest message could be larger.
there's a little bit of a challenge choosing how to grow NUMBYTES to encode a payload of known length into the minimum number of bytes. note that NUMBYTES never decreases, so the "leftover" smaller chunks should be at the beginning, vaguely like little-endian. the greedy algorithm is not optimal; this is a job for dynamic programming. below, we have calculated some optimal packing sequences.
are the possible growth factors of doubling and quadrupling optimal? not sure how to quantify this. if we made more aggressive growth possible, there would be less overhead for larger input data but more for smaller. instead of quadrupling, squaring NUMBYTES is an intriguing alternative, not explored here.
for doubling and quadrupling, for a given number of bytes, the optional packing sequence seems to be unique. for other growth factors, this is not true. for example, for growth factors 2 and 3, there are two optimal ways to pack 5 bytes: [1,1,3] and [1,2,2]. growth factors of N and N^2 for any N seem to always have unique optimal packings, and the converse: only N and N^2 seem to have unique optimal packings.
another solution to the same problem: consistent overhead byte stuffing (COBS). unlike COBS, the overhead of this method asymptotically approaches 0.
is this a solution in search of a problem? if we know the size beforehand, there are more efficient ways of packing bits. if we don't know the size, we can't compute the optimal growth steps.
what if we are given a prior probability distribution over possible lengths? the probabilities must be updated as data arrives online.
below are the maximum number of payload bits that can fit in a given number of bytes, with chunk sizes.
1 byte [1] 6 bits
2 bytes [1,1] 12 bits
3 bytes [1,2] 20 bits
4 bytes [1,1,2] 26 bits
5 bytes [1,4] 36 bits
6 bytes [1,1,4] 42 bits
7 bytes [1,2,4] 50 bits
8 bytes [1,1,2,4] 56 bits
9 bytes [1,4,4] 66 bits
10 bytes [1,1,4,4] 72 bits
11 bytes [1,2,8] 82 bits
12 bytes [1,1,2,8] 88 bits
13 bytes [1,4,8] 98 bits
14 bytes [1,1,4,8] 104 bits
15 bytes [1,2,4,8] 112 bits
16 bytes [1,1,2,4,8] 118 bits
17 bytes [1,4,4,8] 128 bits
18 bytes [1,1,4,4,8] 134 bits
19 bytes [1,2,8,8] 144 bits
20 bytes [1,1,2,8,8] 150 bits
21 bytes [1,4,16] 162 bits
22 bytes [1,1,4,16] 168 bits
23 bytes [1,2,4,16] 176 bits
24 bytes [1,1,2,4,16] 182 bits
25 bytes [1,4,4,16] 192 bits
26 bytes [1,1,4,4,16] 198 bits
27 bytes [1,2,8,16] 208 bits
28 bytes [1,1,2,8,16] 214 bits
29 bytes [1,4,8,16] 224 bits
30 bytes [1,1,4,8,16] 230 bits
31 bytes [1,2,4,8,16] 238 bits
32 bytes [1,1,2,4,8,16] 244 bits
33 bytes [1,4,4,8,16] 254 bits
34 bytes [1,1,4,4,8,16] 260 bits
35 bytes [1,2,8,8,16] 270 bits
36 bytes [1,1,2,8,8,16] 276 bits
37 bytes [1,4,16,16] 288 bits
38 bytes [1,1,4,16,16] 294 bits
39 bytes [1,2,4,16,16] 302 bits
40 bytes [1,1,2,4,16,16] 308 bits
41 bytes [1,4,4,16,16] 318 bits
42 bytes [1,1,4,4,16,16] 324 bits
43 bytes [1,2,8,32] 336 bits
44 bytes [1,1,2,8,32] 342 bits
45 bytes [1,4,8,32] 352 bits
46 bytes [1,1,4,8,32] 358 bits
47 bytes [1,2,4,8,32] 366 bits
48 bytes [1,1,2,4,8,32] 372 bits
49 bytes [1,4,4,8,32] 382 bits
50 bytes [1,1,4,4,8,32] 388 bits
51 bytes [1,2,8,8,32] 398 bits
52 bytes [1,1,2,8,8,32] 404 bits
53 bytes [1,4,16,32] 416 bits
54 bytes [1,1,4,16,32] 422 bits
55 bytes [1,2,4,16,32] 430 bits
56 bytes [1,1,2,4,16,32] 436 bits
57 bytes [1,4,4,16,32] 446 bits
58 bytes [1,1,4,4,16,32] 452 bits
59 bytes [1,2,8,16,32] 462 bits
60 bytes [1,1,2,8,16,32] 468 bits
61 bytes [1,4,8,16,32] 478 bits
62 bytes [1,1,4,8,16,32] 484 bits
63 bytes [1,2,4,8,16,32] 492 bits
64 bytes [1,1,2,4,8,16,32] 498 bits
65 bytes [1,4,4,8,16,32] 508 bits
66 bytes [1,1,4,4,8,16,32] 514 bits
67 bytes [1,2,8,8,16,32] 524 bits
68 bytes [1,1,2,8,8,16,32] 530 bits
69 bytes [1,4,16,16,32] 542 bits
70 bytes [1,1,4,16,16,32] 548 bits
71 bytes [1,2,4,16,16,32] 556 bits
72 bytes [1,1,2,4,16,16,32] 562 bits
73 bytes [1,4,4,16,16,32] 572 bits
74 bytes [1,1,4,4,16,16,32] 578 bits
75 bytes [1,2,8,32,32] 590 bits
76 bytes [1,1,2,8,32,32] 596 bits
77 bytes [1,4,8,32,32] 606 bits
78 bytes [1,1,4,8,32,32] 612 bits
79 bytes [1,2,4,8,32,32] 620 bits
80 bytes [1,1,2,4,8,32,32] 626 bits
81 bytes [1,4,4,8,32,32] 636 bits
82 bytes [1,1,4,4,8,32,32] 642 bits
83 bytes [1,2,8,8,32,32] 652 bits
84 bytes [1,1,2,8,8,32,32] 658 bits
85 bytes [1,4,16,64] 672 bits
86 bytes [1,1,4,16,64] 678 bits
87 bytes [1,2,4,16,64] 686 bits
88 bytes [1,1,2,4,16,64] 692 bits
89 bytes [1,4,4,16,64] 702 bits
90 bytes [1,1,4,4,16,64] 708 bits
91 bytes [1,2,8,16,64] 718 bits
92 bytes [1,1,2,8,16,64] 724 bits
93 bytes [1,4,8,16,64] 734 bits
94 bytes [1,1,4,8,16,64] 740 bits
95 bytes [1,2,4,8,16,64] 748 bits
96 bytes [1,1,2,4,8,16,64] 754 bits
97 bytes [1,4,4,8,16,64] 764 bits
98 bytes [1,1,4,4,8,16,64] 770 bits
99 bytes [1,2,8,8,16,64] 780 bits
100 bytes [1,1,2,8,8,16,64] 786 bits
101 bytes [1,4,16,16,64] 798 bits
102 bytes [1,1,4,16,16,64] 804 bits
103 bytes [1,2,4,16,16,64] 812 bits
104 bytes [1,1,2,4,16,16,64] 818 bits
105 bytes [1,4,4,16,16,64] 828 bits
106 bytes [1,1,4,4,16,16,64] 834 bits
107 bytes [1,2,8,32,64] 846 bits
108 bytes [1,1,2,8,32,64] 852 bits
109 bytes [1,4,8,32,64] 862 bits
110 bytes [1,1,4,8,32,64] 868 bits
111 bytes [1,2,4,8,32,64] 876 bits
112 bytes [1,1,2,4,8,32,64] 882 bits
113 bytes [1,4,4,8,32,64] 892 bits
114 bytes [1,1,4,4,8,32,64] 898 bits
115 bytes [1,2,8,8,32,64] 908 bits
116 bytes [1,1,2,8,8,32,64] 914 bits
117 bytes [1,4,16,32,64] 926 bits
118 bytes [1,1,4,16,32,64] 932 bits
119 bytes [1,2,4,16,32,64] 940 bits
120 bytes [1,1,2,4,16,32,64] 946 bits
121 bytes [1,4,4,16,32,64] 956 bits
122 bytes [1,1,4,4,16,32,64] 962 bits
123 bytes [1,2,8,16,32,64] 972 bits
124 bytes [1,1,2,8,16,32,64] 978 bits
125 bytes [1,4,8,16,32,64] 988 bits
126 bytes [1,1,4,8,16,32,64] 994 bits
127 bytes [1,2,4,8,16,32,64] 1002 bits
128 bytes [1,1,2,4,8,16,32,64] 1008 bits
129 bytes [1,4,4,8,16,32,64] 1018 bits
130 bytes [1,1,4,4,8,16,32,64] 1024 bits
future post (floohhig): source code
future work: draw these possibilities as a tree.
No comments :
Post a Comment