Sunday, November 30, 2014

[goynoqan] Trie-to-trie compression

There are two alphabets, a Source Alphabet and a Destination Alphabet.  The alphabets do not need to be the same size.

There are two tries (prefix trees): The Source Trie uses the Source alphabet.  The root node must be full, but there are otherwise no constraints. The Destination Trie uses the Destination alphabet.

Finally, there is a Mapping from nodes in the Source Tree (excepting the root) to leaves in the Destination Trie.  For simplicity, we assume this Mapping is one-to-one.

Given these Tries and Mapping, we can straightforwardly encode any string in the Source Alphabet by repeatedly taking the longest prefix of the remaining unencoded string and emitting the string in the Destination Trie from the root to the corresponding leaf.  And we can losslessly and unambiguously decode any correctly encoded string back to the original string.  (Note: because the Destination Trie is not full, not all strings in the Destination Alphabet may be decoded.)

This encoding from Source to Destination strings can achieve data compression.  The grand challenge is to construct the two Tries and the node-to-leaf Mapping to maximize data compression of a given Source corpus (a collection of strings, probably each string is a word) and Destination Alphabet.

It would be nice if there is an efficient algorithm (like constructing Huffman tree) to construct these, but failing that, we might consider generic optimization algorithms like simulated annealing.  Efficient is perhaps unnecessary if we need to run the algorithm only once for a given language, for example, one computation to determine the optimal encoding of all English words to digits.

Although we place no constraints on the size of the tries, I suspect optimization will yield small tries.

Here is a greedy algorithm; unknown if optimal:

Start with the simplest Source Trie, the root with each letter of the Source Alphabet as children.

To evaluate a given Source Trie, assign weights to each node of how many times it gets used in encoding the corpus.  Explode the tree into a disconnected collection of weighted nodes and compute the Huffman tree of them constraining the branching factor to the size of the Destination Alphabet.  This constructs the Destination Trie.  The score of the Source Trie is the length of the encoding of the corpus using this Destination Tree.

(Incidentally, to generate an n-ary Huffman tree, first pad with zero nodes until the initial count is 1 mod (n-1).  Then take the smallest n each iteration, decreasing the number of items by n-1.)

Next, consider all the possible ways to add one more node to the Source Trie.  Evaluate the score of all possibilities, then choose the best one as the next Source Trie.  Repeat until the score does not improve.

A more elaborate algorithm could enumerate all the one- and two- node additions to the Source Trie in each step.

We may be able to get greater compression by not requiring the encoder to use the longest possible prefix at each step of encoding.  Instead the encoder could find a concatenation shorter prefixes which hopefully concatenate to a shorter Destination string.  The shortest encoding could be found by branch-and-bound search.

Given the encoder working this way, finding the optimal Source and Destination Tries and Mapping seems like a formidable problem.  Here is a modification of the original algorithm.

Each Source Trie update step becomes iterative.  Having constructed the Huffman Destination Trie, evaluate the corpus using the more sophisticated encoder.  Note the use counts of the nodes of Source Trie. Compute a new Huffman Destination Trie using these new counts.

From the standpoint of data compression, many Destination Tries are isomorphic, yielding encodings that are simply transliterations of each other.  Then, if the Source and Destination alphabets are similar (perhaps a subset relationship), among the maximal Destination Tries, find the one Destination Trie which best preserves appearance of the Source text.

The assumption that isomorphic Destination Tries yield the same compression no longer holds if different Destination Alphabet letters cost different amounts to transmit.

Previously, ngram frequencies.

Update: implementation of simplest algorithm in Haskell. The simplest algorithm is definitely not optimal.

No comments :