Given a corpus, construct a trie annotated with counts or probabilities. Implicitly or explicitly include the probability of the end of a word. What features does this trie have, in particular features of the probability distribution of children of a node?
Synthesize a similar trie for an artificial language, seeking to emulate the "shape" of words of a natural language, for example, the distribution of word lengths.
Original goal was to create a red herring: text that looks like natural language that has been coded by a substitution cipher, having structures resembling natural language, but which is actually randomness. (Kind of the opposite of cryptography: cryptography is things that look random which aren't; this is things that don't look random but which are.)
Common roots and suffixes are probably too difficult for a trie; we would need a directed graph, with words generated as a Markov chain. Maybe cheat and assume that all roots have been compressed down to one character in the artificial language.
Easiest might be a trie whose probability of the end-of-word symbol increases by depth. The non-end-of-word children of a node have some skewed probability distribution. Generate this trie on the fly while generating random text, expanding (and remembering) nodes only as needed.
No comments :
Post a Comment