Help optimizing sort of large files

11-21-2014

Registered User

17, 1

Join Date: Nov 2014

Last Activity: 29 November 2014, 8:18 AM EST

Posts: 17

Thanks Given: 5

Thanked 1 Time in 1 Post

I'm not going to load a database, because the results of the sort will be used just once, and as a practical matter may be passed in a pipe without ever hitting the filesystem. For testing, there's an output file, but just for testing, and to make the results more generally relevant to anyone else who might read this.

My sort times are generally within a factor of 2 of the cost of copying the file to temp and then to the output. So thrashing and computing are not horrible and I'm not going to write a separate sort or any part of it because it will take too long to get it right, and I'm not going to use multiple invocations of sort(1) because the disk I/O will clearly eat any benefits.

I've re-run my timing scripts on the small test file, and some comments I made earlier have to be corrected. The differences between runs are not that alarming after all, and are easily explained by differences in other competing activities on the same machine. My speedups so far are more modest than I thought, but --parallel=4 really does give me 20%, and there's about another 20% available from jiggering parameters.

Running after a fresh boot, I noticed some things that surprised me, though perhaps they should not have. By the time testing is done, the kernel has filled 64GB of memory, mostly with "cached" blocks, and has swapped out a little over 3 MB of memory. I presume it's swapping idle daemons. So these results will not scale up for files large enough to do a complete cache wipe.

I've pretty much determined that the main thing to avoid is getting more than one merge pass on the temporaries. I think it's time to try just a few things with my TB-sized things, because I know those were doing at least 2 extra passes with the default parameters. It took forever. I think the defaults are 4GB buffers (1/8 real memory) and merges of 16 files. The buffers seem to have a lot of overhead, so the temp files are smaller than you might expect. On a 1TB file, that will be roughly 500 2-GB temp files, and 3 levels of merge. The question is: given a choice, is it better to use a bigger buffer, or a wider merge? I'm betting 4GB buffers are too big, but I need to do some testing.

kogorman3

View Public Profile for kogorman3

Find all posts by kogorman3

11-21-2014

Registered User

4,673, 588

Join Date: Oct 2010

Last Activity: 1 February 2016, 3:35 PM EST

Location: Southern NJ, USA (Nord)

Posts: 4,673

Thanks Given: 8

Thanked 588 Times in 561 Posts

I suppose your could write a sort that used a tournament space as big as available ram to write temp files each with one sorted sequence, and then all the files can be merged by sort -m, or a tree of sort -m's, an always 2 pass sort.

The idea of the progressive merge sort was to live with the cache and RAM limits dynamically without knowing where they are. And in real life, sometimes the amount of available RAM or CACHE varies with other activities, so it is not good to have a fragile fit.

BTW, the tournament write can write strings far longer than the tournament table size. If the input was already sorted, it would be written as one string. You write the smallest item >= the last item written, so with random data, I am guessing a 64K leg tournament would write 96K items a sorted string, but my math is a bit weak. Someone show me up.

If you wanted to code it in a single process, you could write all sorted strings to one temp drive, write the starting offsets of each sorted string on a second temp file and fix the tournament size at something fitting comfortably within upper cache. The input file could be mmap()'s so actual records to be written are in RAM, just offsets in the tournament table. After the first pass, you know you have N strings to merge, so mmap() the output files and merge them. Since there are just 3 files open at a time, no fd limit woes. The mmap() lets your read from N different points in the first temp file. The one temp file write of all input data creates a huge change in working set size. Writing just one main file at a time minimizes write buffer cost. The original reading, although through mmap() page faults, is generally sequential. The second read uses just N-2N pages to buffer the sequential string reads. The latency is even pretty low, essentially 0 from last read to first write.

DGPickett

View Public Profile for DGPickett

Find all posts by DGPickett

11-23-2014

Registered User

17, 1

Join Date: Nov 2014

Last Activity: 29 November 2014, 8:18 AM EST

Posts: 17

Thanks Given: 5

Thanked 1 Time in 1 Post

Quote:

Originally Posted by DGPickett

I suppose your could write a sort that used a tournament space as big as available ram to write temp files each with one sorted sequence, and then all the files can be merged by sort -m, or a tree of sort -m's, an always 2 pass sort.

The idea of the progressive merge sort was to live with the cache and RAM limits dynamically without knowing where they are. And in real life, sometimes the amount of available RAM or CACHE varies with other activities, so it is not good to have a fragile fit.

BTW, the tournament write can write strings far longer than the tournament table size. If the input was already sorted, it would be written as one string. You write the smallest item >= the last item written, so with random data, I am guessing a 64K leg tournament would write 96K items a sorted string, but my math is a bit weak. Someone show me up.

If you wanted to code it in a single process, you could write all sorted strings to one temp drive, write the starting offsets of each sorted string on a second temp file and fix the tournament size at something fitting comfortably within upper cache. The input file could be mmap()'s so actual records to be written are in RAM, just offsets in the tournament table. After the first pass, you know you have N strings to merge, so mmap() the output files and merge them. Since there are just 3 files open at a time, no fd limit woes. The mmap() lets your read from N different points in the first temp file. The one temp file write of all input data creates a huge change in working set size. Writing just one main file at a time minimizes write buffer cost. The original reading, although through mmap() page faults, is generally sequential. The second read uses just N-2N pages to buffer the sequential string reads. The latency is even pretty low, essentially 0 from last read to first write.

I may be dense, but I don't see the point in such suggestions. My files are huge, much bigger than any SSD I can afford. No matter how you cut it, the file won't fit in RAM or my SSD drive, and there's going to be a lot of comparison, and a lot of disk head motion in the process of the sort. I'm using GNU sort because I have it and it works out of the box in such cases.

Having played around with my test file long enough to become familiar with the basics, and get a more accurate picture of how sort works, I sorted my 1.4TB file starting noon yesterday using sort's default settings other than directing temporary files to an empty spare drive.

It started at around 12:30 PM yesterday finished at 9:37 AM today. I guess I had misread the code that computes the buffer size; I was expecting 2GB temporaries, and in fact they were more like 11.24 GB, and there were 117 such files except that the last one was shorter. It took around 4 minutes to create the early ones, and they were all done in about 8.5 hours. Then 112 of them were merged in 7 batches of 16, creating files of 179.8 GB each, taking about 7.5 hours. Finally the 7 large and the 5 remaining small temporaries were merged into the output in about 6 hours.

for comparison, just copying the file to the temp drive, and then to the result drive took 6.6 hours, compared to about 21 hours for the sort.

I'm hoping to cut hours off that time by setting a bigger batch_size and not merge any of the data twice. I may also fool with the buffer_size to see if there's a way to reduce compute overhead. I'll start by cutting temporary size in half and doubling the batch size.

Last edited by kogorman3; 11-23-2014 at 04:20 PM.. Reason: more info

kogorman3

View Public Profile for kogorman3

Find all posts by kogorman3

11-23-2014

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Quote:

Originally Posted by kogorman3

I may be dense, but I don't see the point in such suggestions. My files are huge, much bigger than any SSD I can afford.

You don't have to fit the whole file in the SSD.

Quote:

No matter how you cut it, the file won't fit in RAM or my SSD drive, and there's going to be a lot of comparison, and a lot of disk head motion in the process of the sort.

I think his suggestion amounted to what I said earlier -- 'sort the largest chunks that will fit in RAM'. That could be done on an SSD.

If we're back to using ordinary sort without bothering to tune it for multiprocessing or external sorting, or even using that high-performance SSD at all, we are firmly back in "something for nothing" territory. Have fun.

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

11-25-2014

Registered User

17, 1

Join Date: Nov 2014

Last Activity: 29 November 2014, 8:18 AM EST

Posts: 17

Thanks Given: 5

Thanked 1 Time in 1 Post

Quote:

Originally Posted by Corona688

You don't have to fit the whole file in the SSD. I think his suggestion amounted to what I said earlier -- 'sort the largest chunks that will fit in RAM'. That could be done on an SSD.

If we're back to using ordinary sort without bothering to tune it for multiprocessing or external sorting, or even using that high-performance SSD at all, we are firmly back in "something for nothing" territory. Have fun.

I am. What would not be fun is re-writing the merge section of GNU sort to make special use of an SSD, and verifying that it is correct and robust. I don't have a spare SSD anyway, so it wouldn't help. My current SSD is about full, though it does have a 32GB swap area.

I do have more test results. I finished another sort of my large file, using a batch-size parameter large enough to merge all of the temporaries at once. This meant increasing the default of 16 to at least 117. I chose 135. The default sort took 21.1 hours. The tweaked sort took 18.2, about a 14% improvement. I'll also check the effect of making smaller temporaries (reduce the in-RAM merge footprint) and widening the file merge some more. This will require figuring out the relationship of parameters to temporary size -- I already know it's not identical; it appears the buffer-size I request is the size of some internal structure, not the size of the temporary.

Of the 18.2 hours that sort took, 8.1 were user time (doing the compares, presumably,) and 2.2 were system time (making I/O calls, managing buffers, handling TLB misses, page faults and such). The remainder, another 8 hours or so, is about double the unaccounted time in a pure copy. I suspect the difference is the additional time in doing head seeks on disk that are not needed by the copy of un-fragmented files. I consider them I/O time.

Without an SSD big enough to hold all the temporaries, I don't see how to reduce that I/O time. The suggestions I've seen have not painted a coherent picture for me of how that would go with less I/O than I have now. The first stage of merge sorts are already in RAM and read and write sequentially on separate spindles. It's just the file merges that do the seeking, and that involves all of the data. How are you going to do that without seeking a drive large enough to hold it all? Nobody needs to answer that, because I don't even have the smaller SSD.

My data is not at all uniform; someone had surmised that it was in comtemplating a radix sort. Instead, it clusters like mad in ways that vary between datasets. So the sort buckets would be of unpredictable sizes and require space allocation, and I'm anyway afraid it could be hard to implement in two passes over the data, which I think is required if it's going to beat the GNU sort, and so could have much the same seeking behavior.

When you think about this, consider my largest drives are 2 TB, and my input data occupies more than half that space. And this is not the largest dataset I'm going to have (I'm just starting the main project). Any approach that writes a bunch of intermediate collections of data is going to be spread all over the disk, whether it's in one file or many. It's gonna seek.

---------- Post updated 11-25-14 at 09:49 AM ---------- Previous update was 11-24-14 at 08:21 PM ----------

And in a victory of data over speculation, I've pretty much convinced myself that attempting to create longer temporaries by sorting more stuff in RAM, let alone SSD, is counterproductive. The new results come from changing the parameters from simply

Code:

 export param='--batch-size=135'

to

Code:

export param='--batch-size=320 --buffer-size=11g'

This results in temporaries about half the size, and an improvement in overall time from 18.2 hours to 11.8, just from using shorter temporaries (and increasing the batch so they're all merged at once.)

The user time did not change that much, though it did decrease from 8.7 to 7.6 hours. System time stayed about the same. I don't know why the unaccounted real time decreased by about 6.5 hours. Since it contradicts my intuition, I'll run a few more tests with other parameters to verify the trend.

If the trend holds up, I'll be decreasing the temporary size as much as possible, and making corresponding increases in batch size, until I reach a limit or a sweet spot.

kogorman3

View Public Profile for kogorman3

Find all posts by kogorman3

11-26-2014

Registered User

4,673, 588

Join Date: Oct 2010

Last Activity: 1 February 2016, 3:35 PM EST

Location: Southern NJ, USA (Nord)

Posts: 4,673

Thanks Given: 8

Thanked 588 Times in 561 Posts

Well, you are approaching the ideal, that the first pass create N files with one sorted string on each, and the next pass is the final merge. If you fall short, then there needs to be at least one intermediate pass to merge to fewer strings per file. Each pass has to take the io time to copy the entire file, so less passes is better.

Head moition is an old phobia, perhaps persisting because you can still hear the seeks on some drives. Ofthe the average seek is less than the average latency. Modern drives cache everything once they arrive on cylinder, so some latency may be paid back in fast access to the cached, later sectors. Large AU and smart buffering in HW and software helps ensure more data for each possible seek. If the disk is not defragged, and especially if it has a low AU, you may have a lot of seeks in a sequential read or write.

DGPickett

View Public Profile for DGPickett

Find all posts by DGPickett

11-28-2014

Registered User

17, 1

Join Date: Nov 2014

Last Activity: 29 November 2014, 8:18 AM EST

Posts: 17

Thanks Given: 5

Thanked 1 Time in 1 Post

Quote:

Originally Posted by DGPickett

Well, you are approaching the ideal, that the first pass create N files with one sorted string on each, and the next pass is the final merge. If you fall short, then there needs to be at least one intermediate pass to merge to fewer strings per file. Each pass has to take the io time to copy the entire file, so less passes is better.

Quite right. But I've found there's still a sweet spot, and I'm gonna use it.

Quote:

Head moition is an old phobia, perhaps persisting because you can still hear the seeks on some drives. Ofthe the average seek is less than the average latency. Modern drives cache everything once they arrive on cylinder, so some latency may be paid back in fast access to the cached, later sectors. Large AU and smart buffering in HW and software helps ensure more data for each possible seek. If the disk is not defragged, and especially if it has a low AU, you may have a lot of seeks in a sequential read or write.

I didn't know seeks had gotten that fast. Interesting. But somebody please tell me what AU is.

I have finished my testing, found a broad sweet spot that's about twice as fast as sort's defaults, and about half as fast as just copying twice -- so I guess it's as good as anything is going to be. I'm going to go back to my project, but first I'll give what I've learned about GNU sort.

The default settings are to merge batches of 16 and not use extra cores, and to sort (pass 1) with the largest possible buffer for the given physical memory. On my 32 GB 64-bit machine, this takes 76014 seconds. Using cp to copy to the temp directory, and then to the result takes 23808 seconds. That's 6.6 hours and 21.1 hours. I was unhappy with the 21 hours, but not really expecting to get down to 6.

The buffer-size parameter you give to sort establishes the memory used for the phase 1 sort, but it results in temporary files of about 48% of that size. I timed the sort with 5 different sets of parameters, each time choosing a batch-size that allowed merging all of the temporaries in one pass. For the largest batch, this required raising the soft limit on open files to the hard limit of 4096 using the bash command 'ulimit -nS hard'. The tests cover the range of possible combinations, with buffer size limited by memory, and batch size limited by a hard limit on open files. The results clearly show that even though all tests merged the temporary files in one pass, approaching either limit resulted in significant time penalties.

Going from largest to smallest buffers, here are the buffer-size, batch-size and time. Starting with the third, I recorded the actual number of temporaries, and I'm reporting that as the batch size; the actual parameter was somewhat higher because it was an estimate.

Quote:

buffer batch time
default 135 65556s (18 hours) (default corresponds to a parameter of ~23g)
11g 320 42605s (11.8 hours)
8g 342 40454s (11.2 hours)
5g 546 43172s (12 hours)
510m 3947 70525s (19.6 hours)

So 8g seems to be the sweet spot. I tried it with --parallel=4 and actually got a slight slowdown, so I guess compute speed is not the bottleneck in the sweet spot and thread management is not worth the effort. Away from the sweet spot, I had seen a speedup with 4 threads.

Last edited by kogorman3; 11-28-2014 at 03:05 PM..

kogorman3

View Public Profile for kogorman3

Find all posts by kogorman3

UNIX for Advanced & Expert Users

Help optimizing sort of large files

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Script to sort large file with frequency

Discussion started by: gimley

2. UNIX for Advanced & Expert Users

Script to sort the files and append the extension .sort to the sorted version of the file

Discussion started by: pankaj80

3. Solaris

How to safely copy full filesystems with large files (10Gb files)

Discussion started by: dragonov7

4. UNIX for Dummies Questions & Answers

Speeding/Optimizing GREP search on CSV files

Discussion started by: Whit3H0rse

5. Shell Programming and Scripting

Divide large data files into smaller files

Discussion started by: ad23

6. Shell Programming and Scripting

a problem with large files

Discussion started by: m_wassal

7. UNIX for Dummies Questions & Answers

Sort large file

Discussion started by: dcfargo

8. UNIX for Dummies Questions & Answers

large files?

Discussion started by: ranj@chn

9. Shell Programming and Scripting

Large Text Files

Discussion started by: caddyjoe77

10. UNIX for Dummies Questions & Answers

Large files

Discussion started by: sehgalniraj