Saturday, July 28, 2007

Optimized Sort

I want a highly optimized "sort" command for files of fixed-length records to be compared with memcmp. I can usually massage my data into such a format in linear time, so any speedup of the constant factor of the O(n log n) sort would be great. Regular Unix sort is too fancy (detecting line endings, locale-specific alphabetization), and I suspect it has significant performance loss. This optimized sort program not the easiest program in the world to write because I'm particularly interested in external sort (it does not all fit into memory), so we need to take into account things like various cache sizes and main memory, what to do if there is more than one hard drive available (and how much space is available on each), how many channels to merge (if the algorithm to be used is mergesort) at once, and how to use multiple processors.

And compression, e.g., LZO LZOP, to conserve external disk space. Sorted records should compress well.

No comments :