Quote:
Originally Posted by
DGPickett
Well, you are approaching the ideal, that the first pass create N files with one sorted string on each, and the next pass is the final merge. If you fall short, then there needs to be at least one intermediate pass to merge to fewer strings per file. Each pass has to take the io time to copy the entire file, so less passes is better.
Quite right. But I've found there's still a sweet spot, and I'm gonna use it.
Quote:
Head moition is an old phobia, perhaps persisting because you can still hear the seeks on some drives. Ofthe the average seek is less than the average latency. Modern drives cache everything once they arrive on cylinder, so some latency may be paid back in fast access to the cached, later sectors. Large AU and smart buffering in HW and software helps ensure more data for each possible seek. If the disk is not defragged, and especially if it has a low AU, you may have a lot of seeks in a sequential read or write.
I didn't know seeks had gotten that fast. Interesting. But somebody please tell me what AU is.
I have finished my testing, found a broad sweet spot that's about twice as fast as sort's defaults, and about half as fast as just copying twice -- so I guess it's as good as anything is going to be. I'm going to go back to my project, but first I'll give what I've learned about GNU sort.
The default settings are to merge batches of 16 and not use extra cores, and to sort (pass 1) with the largest possible buffer for the given physical memory. On my 32 GB 64-bit machine, this takes 76014 seconds. Using cp to copy to the temp directory, and then to the result takes 23808 seconds. That's 6.6 hours and 21.1 hours. I was unhappy with the 21 hours, but not really expecting to get down to 6.
The buffer-size parameter you give to sort establishes the memory used for the phase 1 sort, but it results in temporary files of about 48% of that size. I timed the sort with 5 different sets of parameters, each time choosing a batch-size that allowed merging all of the temporaries in one pass. For the largest batch, this required raising the soft limit on open files to the hard limit of 4096 using the bash command 'ulimit -nS hard'. The tests cover the range of possible combinations, with buffer size limited by memory, and batch size limited by a hard limit on open files. The results clearly show that even though all tests merged the temporary files in one pass, approaching either limit resulted in significant time penalties.
Going from largest to smallest buffers, here are the buffer-size, batch-size and time. Starting with the third, I recorded the actual number of temporaries, and I'm reporting that as the batch size; the actual parameter was somewhat higher because it was an estimate.
Quote:
buffer batch time
default 135 65556s (18 hours) (default corresponds to a parameter of ~23g)
11g 320 42605s (11.8 hours)
8g 342 40454s (11.2 hours)
5g 546 43172s (12 hours)
510m 3947 70525s (19.6 hours)
So 8g seems to be the sweet spot. I tried it with --parallel=4 and actually got a slight slowdown, so I guess compute speed is not the bottleneck in the sweet spot and thread management is not worth the effort. Away from the sweet spot, I had seen a speedup with 4 threads.