Saturday, February 11, 2012

[ntzbocwt] Encoding data into words and spaces

You have some data that you want to encode into a stream of letters and spaces (to make it sort of look like natural language). The main constraint is you cannot have two spaces in a row.

The solution is a variation of arithmetic coding.  Express the data as a decimal fraction.  Connect the letters and space into a state transition diagram: any state can transition to any state except space cannot transition to space.  What ever state you are at, multiply the fraction by the number of outgoing edges, and take the integer part.  For more fanciness, you can weight the edges, perhaps even by letter width or kerning.

Also would like to avoiding ending with a space.  (Avoid starting with a space by simply eliminating the transition to space from the start state.)  Ending elegantly is unsolved. Perhaps repeat the first character.  (Arithmetic coding itself has complications because it's an infinite decimal expansion.)

(The alternative is just to insert spaces randomly, or periodically, but you are losing the opportunity to encode data into spaces.)

No comments :