Sunday, February 20, 2022

[jeylehwy] byte stuffing with run length encoding

let ESC be the an escape character, a byte value specified in metadata.  metadata may specify any byte value to be the escape character.  choose a rarely occurring byte, or (because the protocol optionally offers run length compression) choose a byte value that has many repetitive occurrences.

ESC 0 signifies end of string.  ESC EOF, where EOF is not a character but the end of transmission, signifies a literal escape character.  or maybe it is an error, suggesting interrupted transmission.

ESC n where 1 <= n <= 124 indicates a run of n repetitions of the literal escape character.  ESC 1 therefore encodes a single instance of the escape character.

if 125 <= n <= 185, then the next n-123 bytes are interleaved with the literal escape character.  for example, ESC 125 99 98 decodes to ESC 99 ESC 98.

in those n-123 bytes after ESC n, the escape character has no special meaning.  if EOF occurs before n-123 bytes, the decoder should decode as much as possible and issue a warning or error.

if 186 <= n <= 225, then the next 2*(n-184) bytes have a literal escape character inserted every third byte.  for example, ESC 186 99 98 97 96 decodes to ESC 99 98 ESC 97 96.  three-byte chunks are common when encoding RGB colors; maybe you have a run of pixels one of whose color components is the escape character.

if 226 <= n <= 255, the next 3*(n-224) bytes have a literal escape character inserted every fourth byte.  for example, ESC 226 99 98 97 96 95 94 decodes to ESC 99 98 97 ESC 96 95 94.  four-byte chunks are common when encoding 32-bit values.

the design principle was to have a simple encoding system but avoid output from growing too much.  the maximum possible expansion for long strings happens when a literal escape character occurs every fifth byte of the input.  5 bytes become 6, or 20% expansion.  if you need to have less worst-case expansion, consider Consistent Overhead Byte Stuffing (COBS).  however, COBS does not do compression.

the thresholds were chosen so that the amount of lookahead needed to achieve maximum compression is constant -- 124 bytes -- across all interleaving possibilities.  (but one does not have try to compress efficiently: one can use a smaller lookahead buffer, or no lookahead at all, just encode all ESC in input at ESC 1.)  COBS requires a lookahead of 254 bytes.

No comments :