Tuesday, April 20, 2010

[iyaxetsm] letter classes

28 different Huffman trees are a bit much, so let's restrict the choice to (say) two trees, depending on the previous letter.

Partition the possible previous letters (i.e., the alphabet) into two sets.  Create an independent Huffman tree for each set.  The optimization search over partitions (power set of subsets) can be done exhaustively (2^28).

Sometimes, significant gains in compression can be made by looking back two letters instead of just one, still maintaining the same number of Huffman trees.  Which letters?

This experiment will reveal structure within words about how which letters carry information.

Arithmetic coding is just fine for this, probably simpler to do calculations of compression rate.

No comments :