Help optimizing sort of large files


 
Thread Tools Search this Thread
Top Forums UNIX for Advanced & Expert Users Help optimizing sort of large files
# 15  
Old 11-12-2014
Well, you can swap out the world using mmap() aggressively, good for problem speed and bad for retaining control! File pages rolled into RAM are left there, even unmapped, until there is demand, and are recently touched, so . . . .

SSD based swap is like another, faster tier, but for control structures, not the data (which remains in the mmap()'d space, not swap space). Sorting like this is like building an index in the heap, compared to sort, which actually merges files over and over to create ascending/descending strings of items that grow into a sorted whole. It makes me recall those tape drives reading backward in tape sorts, where the data was distributed onto all but one drive in ascending amounts and then they all read backwards writing the first until one hit BOT, then it became a written volume and the old written volume started reading backward. No rewind time.

Last edited by DGPickett; 11-12-2014 at 04:28 PM..
# 16  
Old 11-13-2014
Quote:
Originally Posted by DGPickett
I wrote a "locality of reference" sort once: mmap64() the file and start sorting by making 2 lists of one line, then a linked list of 2 lines sorted, twice, and then merge them, then again, then merge the two sets of 4, etc.
This seems to be what is classically called merge sort. In the wikipedia article there are also its Landau symbols for runtime estimates.

I hope this helps.

bakunin
# 17  
Old 11-13-2014
If the OP knew how to write a faster sort program using mmap(), this thread wouldn't exist.

Besides, all mmap() does at the application level is turn reading/writing data from/to a file into a memory access - the underlying read()/write() from/to disk still has to happen. If you know how to code IO operations, it's not hard to beat mmap() performance with standard read()/write() system calls, because you know your data access pattern and can tune your IO operations to the specifics of the underlying file system and hardware. mmap() IO is generic and untuned.

And mmap() has some problems with writing data, especially if you try to extend your file and run out of space.

The quickest and easiest way to make sorting data faster is use more memory. An SSD swap device will do that indirectly by providing a lot of extremely fast swap, along with being a lot bigger than any reasonable amount of actual RAM.

Last edited by achenle; 11-13-2014 at 04:49 PM..
# 18  
Old 11-13-2014
Quote:
Originally Posted by achenle
If you know how to code IO operations, it's not hard to beat mmap() performance with standard read()/write() system calls, because you know your data access pattern and can tune your IO operations to the specifics of the underlying file system and hardware. mmap() IO is generic and untuned.
I think you're overstating the benefits of tuning, the operating system will match I/O to memory page sizes and do readahead and things for you anyway without being asked. I/O all ends up in the generic cache bucket unless you do direct I/O and control your cache closely. Which there are POSIX-standard means to do so for both files and memory.
Quote:
The quickest and easiest way to make sorting data faster is use more memory. An SSD swap device will do that indirectly by providing a lot of extremely fast swap, along with being a lot bigger than any reasonable amount of actual RAM.
Agreed, RAM is what it comes down to in the end, reducing the number of times you need to re-read the same data.
# 19  
Old 11-13-2014
Generally it's possible to beat default page-cached IO by 10-20% - but only if you can accurately predict your access pattern. And if you screw up and the data isn't accessed in the pattern you thought it would be, you can really destroy performance, especially on RAID5/6 disk systems.

Basically, the default IO is pretty darn good almost all the time. Don't try to beat it unless you really have to - for example, a $600 million system where there's more data to process then you ever could and the amount you process is limited by your IO speed.
# 20  
Old 11-14-2014
Some results

Thanks to all for your comments. I was asking for ways to tune UNIX sort, because while I know how, I'm unwilling to rewrite it for this project -- I'm likely to be mired in bugs for too long.

I did some time testing, and in spite of bugs that are going to make me do it again, there are some rough results. These are on a 14GB test file with records of 64 bytes plus newline.

First, I quickly abandoned the idea of having sort compress its temporary files. Using gzip, even with -fast, is a loss of 10% to 20% in speed. Using -best is much worse, for a loss around 1500%.

Second, adding 10 to the batch-size parameter costs about 10% in speed until you can eliminate a merge pass. At that point, it turns into a benefit of about 30%. That's the sweet spot, because it starts going up again if you add even more to the batch.

Third, adding to the -parallel parameter is a win if you have multiple cores. Not huge: about 10% each for doubling the parameter from 1 to 2 or from 2 to 4.

Finally, changing the buffer-size parameter from 1g to 11g was a big loss -- roughly doubling the execution time. I don't know if there's a sweet spot in there, and I'll have to do finer-grained testing, or testing on a larger input. I suspect that it only pays when it's the only way to reduce the number of merge passes.

So, to a first approximation, it's best to add to parallel and keep buffer-size and batch-size small so long as (buffer-size * 0.4) * batch-size is at least as big as the input file. This will give the minimum 2 passes through the data. The 0.4 represents the observed fact that the temporaries are a bit smaller than half the requested buffer-size, at least on my data.

The additional testing may give me some information about how to balance buffer-size and batch-size subject to the above formula.
# 21  
Old 11-14-2014
sort is a merge-sort, which has no need or use for gigantic memory buffers -- unless you want to use more memory than is available and eat into swap, that is. If you have very high-performance swap that can be useful. Otherwise, leave -buffer-size out and let it manage itself.

-parallel should be a big performance gain -- if you have enough memory that it doesn't need to thrash your disk, and fast enough disks to keep up. If not, it will just make things worse.

I don't see any something-for-nothing solutions here. You won't squeeze out anything but percents here and there unless you deal with the bottlenecks. Every time you tell it "use more resources" and it slows down, that's a bottleneck. Every time you tell it "use less files" and it speeds up, that's a bottleneck.

1) More RAM -- the more the OS can cache, the less it has to wait on the disk. Brute force, but there's a reason RAM is popular, it works really well.
2) A different temp space. If you put /tmp/ on a different disk spindle than the file you are sorting, you can get the bandwidth of two disks instead of splitting the bandwidth of one disk several ways (and eliminate a lot of disk thrashing time). It doesn't have to be /tmp/ of course, sort -T puts the files wherever you ask.
3) Faster swap. Eat up more RAM than you have available and depend on an SSD to make up the difference. This could be good, though sounds rather complicated to me.

Last edited by Corona688; 11-14-2014 at 12:29 PM..
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Script to sort large file with frequency

Hello, I have a very large file of around 2 million records which has the following structure: I have used the standard awk program to sort: # wordfreq.awk --- print list of word frequencies { # remove punctuation #gsub(/_]/, "", $0) for (i = 1; i <= NF; i++) freq++ } END { for (word... (3 Replies)
Discussion started by: gimley
3 Replies

2. UNIX for Advanced & Expert Users

Script to sort the files and append the extension .sort to the sorted version of the file

Hello all - I am to this forum and fairly new in learning unix and finding some difficulty in preparing a small shell script. I am trying to make script to sort all the files given by user as input (either the exact full name of the file or say the files matching the criteria like all files... (3 Replies)
Discussion started by: pankaj80
3 Replies

3. Solaris

How to safely copy full filesystems with large files (10Gb files)

Hello everyone. Need some help copying a filesystem. The situation is this: I have an oracle DB mounted on /u01 and need to copy it to /u02. /u01 is 500 Gb and /u02 is 300 Gb. The size used on /u01 is 187 Gb. This is running on solaris 9 and both filesystems are UFS. I have tried to do it using:... (14 Replies)
Discussion started by: dragonov7
14 Replies

4. UNIX for Dummies Questions & Answers

Speeding/Optimizing GREP search on CSV files

Hi all, I have problem with searching hundreds of CSV files, the problem is that search is lasting too long (over 5min). Csv files are "," delimited, and have 30 fields each line, but I always grep same 4 fields - so is there a way to grep just those 4 fields to speed-up search. Example:... (11 Replies)
Discussion started by: Whit3H0rse
11 Replies

5. Shell Programming and Scripting

Divide large data files into smaller files

Hello everyone! I have 2 types of files in the following format: 1) *.fa >1234 ...some text... >2345 ...some text... >3456 ...some text... . . . . 2) *.info >1234 (7 Replies)
Discussion started by: ad23
7 Replies

6. Shell Programming and Scripting

a problem with large files

hello all, kindly i need your help, i made a script to print a specific lines from a huge file about 3 million line. the output of the script will be about 700,000 line...the problem is the script is too slow...it kept working for 5 days and the output was only 200,000 lines !!! the script is... (16 Replies)
Discussion started by: m_wassal
16 Replies

7. UNIX for Dummies Questions & Answers

Sort large file

I was wondering how sort works. Does file size and time to sort increase geometrically? I have a 5.3 billion line file I'd like to use with sort -u I'm wondering if that'll take forever because of a geometric expansion? If it takes 100 hours that's fine but not 100 days. Thanks so much. (2 Replies)
Discussion started by: dcfargo
2 Replies

8. UNIX for Dummies Questions & Answers

large files?

How do we check 'large files' is enabled on a Unix box -- HP-UX B11.11 (2 Replies)
Discussion started by: ranj@chn
2 Replies

9. Shell Programming and Scripting

Large Text Files

Hi All I have approximately 10 files that are at least 100+ MB in size. I am importing them into a DB to output them to the web. What i need to do first is clean the files up so i dont have un necessary rows in the DB. Below is what the file looks like: Ignore the <TAB> annotations as that... (4 Replies)
Discussion started by: caddyjoe77
4 Replies

10. UNIX for Dummies Questions & Answers

Large files

I am trying to understand the webserver log file for an error which has occured on my live web site. The webserver access file is very big in size so it's not possible to open this file using vi editor. I know the approximate time the error occured, so i am interested in looking for the log file... (4 Replies)
Discussion started by: sehgalniraj
4 Replies
Login or Register to Ask a Question