Sunday, April 17, 2016

[xomvdqgt] Words of random lengths

Given a corpus, construct a trie annotated with counts or probabilities.  Implicitly or explicitly include the probability of the end of a word.  What features does this trie have, in particular features of the probability distribution of children of a node?

Synthesize a similar trie for an artificial language, seeking to emulate the "shape" of words of a natural language, for example, the distribution of word lengths.

Original goal was to create a red herring: text that looks like natural language that has been coded by a substitution cipher, having structures resembling natural language, but which is actually randomness.  (Kind of the opposite of cryptography: cryptography is things that look random which aren't; this is things that don't look random but which are.)

Common roots and suffixes are probably too difficult for a trie; we would need a directed graph, with words generated as a Markov chain.  Maybe cheat and assume that all roots have been compressed down to one character in the artificial language.

Easiest might be a trie whose probability of the end-of-word symbol increases by depth.  The non-end-of-word children of a node have some skewed probability distribution.  Generate this trie on the fly while generating random text, expanding (and remembering) nodes only as needed.

No comments :