Help optimizing sort of large files

11-12-2014

Registered User

4,673, 588

Join Date: Oct 2010

Last Activity: 1 February 2016, 3:35 PM EST

Location: Southern NJ, USA (Nord)

Posts: 4,673

Thanks Given: 8

Thanked 588 Times in 561 Posts

Well, you can swap out the world using mmap() aggressively, good for problem speed and bad for retaining control! File pages rolled into RAM are left there, even unmapped, until there is demand, and are recently touched, so . . . .

SSD based swap is like another, faster tier, but for control structures, not the data (which remains in the mmap()'d space, not swap space). Sorting like this is like building an index in the heap, compared to sort, which actually merges files over and over to create ascending/descending strings of items that grow into a sorted whole. It makes me recall those tape drives reading backward in tape sorts, where the data was distributed onto all but one drive in ascending amounts and then they all read backwards writing the first until one hit BOT, then it became a written volume and the old written volume started reading backward. No rewind time.

Last edited by DGPickett; 11-12-2014 at 04:28 PM..

DGPickett

View Public Profile for DGPickett

Find all posts by DGPickett

11-13-2014

Registered User

6,384, 2,214

Join Date: May 2005

Last Activity: 28 October 2019, 4:59 PM EDT

Location: In the leftmost byte of /dev/kmem

Posts: 6,384

Thanks Given: 143

Thanked 2,214 Times in 1,548 Posts

Quote:

Originally Posted by DGPickett

I wrote a "locality of reference" sort once: mmap64() the file and start sorting by making 2 lists of one line, then a linked list of 2 lines sorted, twice, and then merge them, then again, then merge the two sets of 4, etc.

This seems to be what is classically called merge sort. In the wikipedia article there are also its Landau symbols for runtime estimates.

I hope this helps.

bakunin

bakunin

View Public Profile for bakunin

Find all posts by bakunin

11-13-2014

Registered User

1,015, 157

Join Date: Jun 2009

Last Activity: 25 June 2018, 8:15 AM EDT

Posts: 1,015

Thanks Given: 3

Thanked 157 Times in 149 Posts

If the OP knew how to write a faster sort program using mmap(), this thread wouldn't exist.

Besides, all mmap() does at the application level is turn reading/writing data from/to a file into a memory access - the underlying read()/write() from/to disk still has to happen. If you know how to code IO operations, it's not hard to beat mmap() performance with standard read()/write() system calls, because you know your data access pattern and can tune your IO operations to the specifics of the underlying file system and hardware. mmap() IO is generic and untuned.

And mmap() has some problems with writing data, especially if you try to extend your file and run out of space.

The quickest and easiest way to make sorting data faster is use more memory. An SSD swap device will do that indirectly by providing a lot of extremely fast swap, along with being a lot bigger than any reasonable amount of actual RAM.

Last edited by achenle; 11-13-2014 at 04:49 PM..

achenle

View Public Profile for achenle

Find all posts by achenle

11-13-2014

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Quote:

Originally Posted by achenle

If you know how to code IO operations, it's not hard to beat mmap() performance with standard read()/write() system calls, because you know your data access pattern and can tune your IO operations to the specifics of the underlying file system and hardware. mmap() IO is generic and untuned.

I think you're overstating the benefits of tuning, the operating system will match I/O to memory page sizes and do readahead and things for you anyway without being asked. I/O all ends up in the generic cache bucket unless you do direct I/O and control your cache closely. Which there are POSIX-standard means to do so for both files and memory.

Quote:

The quickest and easiest way to make sorting data faster is use more memory. An SSD swap device will do that indirectly by providing a lot of extremely fast swap, along with being a lot bigger than any reasonable amount of actual RAM.

Agreed, RAM is what it comes down to in the end, reducing the number of times you need to re-read the same data.

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

11-13-2014

Registered User

1,015, 157

Join Date: Jun 2009

Last Activity: 25 June 2018, 8:15 AM EDT

Posts: 1,015

Thanks Given: 3

Thanked 157 Times in 149 Posts

Generally it's possible to beat default page-cached IO by 10-20% - but only if you can accurately predict your access pattern. And if you screw up and the data isn't accessed in the pattern you thought it would be, you can really destroy performance, especially on RAID5/6 disk systems.

Basically, the default IO is pretty darn good almost all the time. Don't try to beat it unless you really have to - for example, a $600 million system where there's more data to process then you ever could and the amount you process is limited by your IO speed.

achenle

View Public Profile for achenle

Find all posts by achenle

11-14-2014

Registered User

17, 1

Join Date: Nov 2014

Last Activity: 29 November 2014, 8:18 AM EST

Posts: 17

Thanks Given: 5

Thanked 1 Time in 1 Post

Some results

Thanks to all for your comments. I was asking for ways to tune UNIX sort, because while I know how, I'm unwilling to rewrite it for this project -- I'm likely to be mired in bugs for too long.

I did some time testing, and in spite of bugs that are going to make me do it again, there are some rough results. These are on a 14GB test file with records of 64 bytes plus newline.

First, I quickly abandoned the idea of having sort compress its temporary files. Using gzip, even with -fast, is a loss of 10% to 20% in speed. Using -best is much worse, for a loss around 1500%.

Second, adding 10 to the batch-size parameter costs about 10% in speed until you can eliminate a merge pass. At that point, it turns into a benefit of about 30%. That's the sweet spot, because it starts going up again if you add even more to the batch.

Third, adding to the -parallel parameter is a win if you have multiple cores. Not huge: about 10% each for doubling the parameter from 1 to 2 or from 2 to 4.

Finally, changing the buffer-size parameter from 1g to 11g was a big loss -- roughly doubling the execution time. I don't know if there's a sweet spot in there, and I'll have to do finer-grained testing, or testing on a larger input. I suspect that it only pays when it's the only way to reduce the number of merge passes.

So, to a first approximation, it's best to add to parallel and keep buffer-size and batch-size small so long as (buffer-size * 0.4) * batch-size is at least as big as the input file. This will give the minimum 2 passes through the data. The 0.4 represents the observed fact that the temporaries are a bit smaller than half the requested buffer-size, at least on my data.

The additional testing may give me some information about how to balance buffer-size and batch-size subject to the above formula.

kogorman3

View Public Profile for kogorman3

Find all posts by kogorman3

11-14-2014

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

sort is a merge-sort, which has no need or use for gigantic memory buffers -- unless you want to use more memory than is available and eat into swap, that is. If you have very high-performance swap that can be useful. Otherwise, leave -buffer-size out and let it manage itself.

-parallel should be a big performance gain -- if you have enough memory that it doesn't need to thrash your disk, and fast enough disks to keep up. If not, it will just make things worse.

I don't see any something-for-nothing solutions here. You won't squeeze out anything but percents here and there unless you deal with the bottlenecks. Every time you tell it "use more resources" and it slows down, that's a bottleneck. Every time you tell it "use less files" and it speeds up, that's a bottleneck.

1) More RAM -- the more the OS can cache, the less it has to wait on the disk. Brute force, but there's a reason RAM is popular, it works really well.
2) A different temp space. If you put /tmp/ on a different disk spindle than the file you are sorting, you can get the bandwidth of two disks instead of splitting the bandwidth of one disk several ways (and eliminate a lot of disk thrashing time). It doesn't have to be /tmp/ of course, sort -T puts the files wherever you ask.
3) Faster swap. Eat up more RAM than you have available and depend on an SSD to make up the difference. This could be good, though sounds rather complicated to me.

Last edited by Corona688; 11-14-2014 at 12:29 PM..

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

UNIX for Advanced & Expert Users

Help optimizing sort of large files

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Script to sort large file with frequency

Discussion started by: gimley

2. UNIX for Advanced & Expert Users

Script to sort the files and append the extension .sort to the sorted version of the file

Discussion started by: pankaj80

3. Solaris

How to safely copy full filesystems with large files (10Gb files)

Discussion started by: dragonov7

4. UNIX for Dummies Questions & Answers

Speeding/Optimizing GREP search on CSV files

Discussion started by: Whit3H0rse

5. Shell Programming and Scripting

Divide large data files into smaller files

Discussion started by: ad23

6. Shell Programming and Scripting

a problem with large files

Discussion started by: m_wassal

7. UNIX for Dummies Questions & Answers

Sort large file

Discussion started by: dcfargo

8. UNIX for Dummies Questions & Answers

large files?

Discussion started by: ranj@chn

9. Shell Programming and Scripting

Large Text Files

Discussion started by: caddyjoe77

10. UNIX for Dummies Questions & Answers

Large files

Discussion started by: sehgalniraj