Ken's blog: [ljdyledo] Sort of UTF-6

We present an encoding of Unicode code points to printable ASCII characters that keeps English text vaguely readable. It accomplishes a similar purpose as Quoted-Printable. This design is modeled after UTF-8. We use 6-bit bytes instead of UTF-8's 8-bit bytes. We also do not encode the number of continuation bytes in the first byte; instead, the most significant bit (high bit) signifies whether a byte is followed by further continuation bytes. The last byte of a sequence does not have its high bit set, signifying it is the last byte of a sequence representing one Unicode character. The binary bits of the Unicode code point numbers of various magnitudes are encoded big-endian into the x's as follows:

0xxxxx
1xxxxx 0xxxxx
1xxxxx 1xxxxx 0xxxxx
1xxxxx 1xxxxx 1xxxxx 0xxxxx
...

This encoding can encode values of arbitrary size, in contrast to the 36-bit limit of UTF-8 extended in the obvious way. But this is a moot point (for now) as Unicode was originally limited to 31 bits, and later (including currently) limited further to 21 bits.

Here, in order, are the 64 printing characters used to encode a 6-bit byte. It is modeled after ASCII, but some common punctuation namely ;,-.? has been moved so that it remains unchanged when encoded. (The moved punctuation was moved only up or down a column of the 16-wide ASCII table, preserving low-order bits out of tradition). The second row, beginning with @, are characters that will only be used in multibyte sequences. The first row, beginning with space, represents characters which will be unchanged when encoded but can also occur as the last byte of a multibyte sequence.

abcdefghijklmnopqrstuvwxyz;,-.? @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_

The following is how Unicode code points 0 through 127 are encoded. Here is a 16-wide ASCII table to see the corresponding original characters.

B-

;

A-

C-

The fact that some multibyte sequences are terminated by space (e.g., code points 48 and 64, corresponding to the digit zero and the at-sign respectively) is a little unfortunate.

Here is some more text containing higher Unicode code points, followed by its vaguely readable encoding:

lowercase Capitalized camelCase UPPERCASE numbers 0123456789 common punctuation ;,-.? at-sign @ bob@example.com http://example.com exclamation ! (parentheses) time 12:01 tilde ~ ‘single curly quotes’ “double curly quotes” yen ¥ euro € en–dash em—dash ellipsis … y-umlaut ÿ eng ŋ tent emoji ⛺ fire kanji 火 canoe emoji 🛶 newline

lowercase Acapitalized camelAcase AuApApAeArAcAaAsAe numbers C CaCbCcCdCeCfCgChCi common punctuation ;,-.? at-sign A bobA example.com httpCjC?C?example.com exclamation Cq CxparenthesesCy time CaCbCjC Ca tilde C. H@xsingle curly quotesH@y H@,double curly quotesH@- yen Ee euro HEl enH@sdash emH@tdash ellipsis HAf y-umlaut G? eng Jk tent emoji IWz fire kanji \Ck canoe emoji C]Wv newlineBj

We implemented an encoder and decoder in Haskell (after first trying in Perl but finding it to be the wrong tool). Some code excerpts:

We represented bytes as lists of Bool, sometimes little-endian sometimes big-endian depending on where we were in the conversion. This is highly space-inefficient, but does allow some elegant code. Given a list of big-endian bytes, splitting off the initial bytes that all have their high (first) bit set is elegantly span head. This is used in decoding.

type Bstring=[Bool]; myspan :: [Bstring] -> ([Bstring],[Bstring]); myspan = span head; -- takeWhile (\l -> head l == True)

Conversion to binary is an unfold. Output is little-endian.

to_binary :: Integer -> [Bool]; to_binary = unfoldr (\n -> if n==0 then Nothing else Just (case divMod n 2 of {(q,r) -> (r==1,q)}));

Conversion from big-endian binary to an integer is a fold.

from_binary :: [Bool] -> Integer; from_binary = foldl' (\accum b -> 2*accum+(if b then 1 else 0)) 0;

A few times we relied on the fact that mod x y with x negative and y positive yields a positive result. This is in contrast to rem x y which would return a negative result.

Ken's blog

Monday, April 16, 2018

[ljdyledo] Sort of UTF-6

2 comments :