To (hopefully) accelerate the parsing of text, prefix each string (probably each line) with a decimal number indicating its length in bytes (Unicode might make this tricky). However, the length can be from 1 to about 16 digits long, so prefix that length with the length of the length. Finally, the length of the length can be 1 or 2 digits long, so a one-digit prefix with the length of the length of the length. For redundancy, after the end of each string should be a string terminator, i.e., newline.
Write a program to do this; this should be easy.
Some examples:
Zero character string
1,1,0:
3 character string
1,1,3:foo
10-character string
1,2,10:dictionary
Billion-character string
2,10,1000000000:<...>
The scheme will overflow for strings longer than about 10^1000000000, which surprisingly is not that unreasonable for some hypothetical dynamically generated randomly accessed data.
10,1000000000,<billion digit number denoting the length>:<very long string of length 10^billion>
(Approximately, the list of all billion-digit numbers).
If we were beings who enjoyed numbers in base 27, then two lengths would suffice up to about 27^27 = 1.3 * 2^128, which should be enough for everyone.
No comments :
Post a Comment