Help optimizing sort of large files Post: 302926906

Sponsored Content

Top Forums UNIX for Advanced & Expert Users Help optimizing sort of large files Post 302926906 by kogorman3 on Friday 28th of November 2014 10:41:27 AM

11-28-2014

Registered User

Quote:

Originally Posted by DGPickett

Well, you are approaching the ideal, that the first pass create N files with one sorted string on each, and the next pass is the final merge. If you fall short, then there needs to be at least one intermediate pass to merge to fewer strings per file. Each pass has to take the io time to copy the entire file, so less passes is better.

Quite right. But I've found there's still a sweet spot, and I'm gonna use it.

Quote:

Head moition is an old phobia, perhaps persisting because you can still hear the seeks on some drives. Ofthe the average seek is less than the average latency. Modern drives cache everything once they arrive on cylinder, so some latency may be paid back in fast access to the cached, later sectors. Large AU and smart buffering in HW and software helps ensure more data for each possible seek. If the disk is not defragged, and especially if it has a low AU, you may have a lot of seeks in a sequential read or write.

I didn't know seeks had gotten that fast. Interesting. But somebody please tell me what AU is.

I have finished my testing, found a broad sweet spot that's about twice as fast as sort's defaults, and about half as fast as just copying twice -- so I guess it's as good as anything is going to be. I'm going to go back to my project, but first I'll give what I've learned about GNU sort.

The default settings are to merge batches of 16 and not use extra cores, and to sort (pass 1) with the largest possible buffer for the given physical memory. On my 32 GB 64-bit machine, this takes 76014 seconds. Using cp to copy to the temp directory, and then to the result takes 23808 seconds. That's 6.6 hours and 21.1 hours. I was unhappy with the 21 hours, but not really expecting to get down to 6.

The buffer-size parameter you give to sort establishes the memory used for the phase 1 sort, but it results in temporary files of about 48% of that size. I timed the sort with 5 different sets of parameters, each time choosing a batch-size that allowed merging all of the temporaries in one pass. For the largest batch, this required raising the soft limit on open files to the hard limit of 4096 using the bash command 'ulimit -nS hard'. The tests cover the range of possible combinations, with buffer size limited by memory, and batch size limited by a hard limit on open files. The results clearly show that even though all tests merged the temporary files in one pass, approaching either limit resulted in significant time penalties.

Going from largest to smallest buffers, here are the buffer-size, batch-size and time. Starting with the third, I recorded the actual number of temporaries, and I'm reporting that as the batch size; the actual parameter was somewhat higher because it was an estimate.

Quote:

buffer batch time
default 135 65556s (18 hours) (default corresponds to a parameter of ~23g)
11g 320 42605s (11.8 hours)
8g 342 40454s (11.2 hours)
5g 546 43172s (12 hours)
510m 3947 70525s (19.6 hours)

So 8g seems to be the sweet spot. I tried it with --parallel=4 and actually got a slight slowdown, so I guess compute speed is not the bottleneck in the sweet spot and thread management is not worth the effort. Away from the sweet spot, I had seen a speedup with 4 threads.

Last edited by kogorman3; 11-28-2014 at 03:05 PM..

kogorman3

View Public Profile for kogorman3

Find all posts by kogorman3

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Large files

I am trying to understand the webserver log file for an error which has occured on my live web site. The webserver access file is very big in size so it's not possible to open this file using vi editor. I know the approximate time the error occured, so i am interested in looking for the log file...

2. Shell Programming and Scripting

Large Text Files

Hi All I have approximately 10 files that are at least 100+ MB in size. I am importing them into a DB to output them to the web. What i need to do first is clean the files up so i dont have un necessary rows in the DB. Below is what the file looks like: Ignore the <TAB> annotations as that...

3. UNIX for Dummies Questions & Answers

large files?

How do we check 'large files' is enabled on a Unix box -- HP-UX B11.11

4. UNIX for Dummies Questions & Answers

Sort large file

I was wondering how sort works. Does file size and time to sort increase geometrically? I have a 5.3 billion line file I'd like to use with sort -u I'm wondering if that'll take forever because of a geometric expansion? If it takes 100 hours that's fine but not 100 days. Thanks so much.

5. Shell Programming and Scripting

a problem with large files

hello all, kindly i need your help, i made a script to print a specific lines from a huge file about 3 million line. the output of the script will be about 700,000 line...the problem is the script is too slow...it kept working for 5 days and the output was only 200,000 lines !!! the script is...

6. Shell Programming and Scripting

Divide large data files into smaller files

Hello everyone! I have 2 types of files in the following format: 1) *.fa >1234 ...some text... >2345 ...some text... >3456 ...some text... . . . . 2) *.info >1234

7. UNIX for Dummies Questions & Answers

Speeding/Optimizing GREP search on CSV files

Hi all, I have problem with searching hundreds of CSV files, the problem is that search is lasting too long (over 5min). Csv files are "," delimited, and have 30 fields each line, but I always grep same 4 fields - so is there a way to grep just those 4 fields to speed-up search. Example:...

8. Solaris

How to safely copy full filesystems with large files (10Gb files)

Hello everyone. Need some help copying a filesystem. The situation is this: I have an oracle DB mounted on /u01 and need to copy it to /u02. /u01 is 500 Gb and /u02 is 300 Gb. The size used on /u01 is 187 Gb. This is running on solaris 9 and both filesystems are UFS. I have tried to do it using:...

9. UNIX for Advanced & Expert Users

Script to sort the files and append the extension .sort to the sorted version of the file

Hello all - I am to this forum and fairly new in learning unix and finding some difficulty in preparing a small shell script. I am trying to make script to sort all the files given by user as input (either the exact full name of the file or say the files matching the criteria like all files...

10. Shell Programming and Scripting

Script to sort large file with frequency

Hello, I have a very large file of around 2 million records which has the following structure: I have used the standard awk program to sort: # wordfreq.awk --- print list of word frequencies { # remove punctuation #gsub(/_]/, "", $0) for (i = 1; i <= NF; i++) freq++ } END { for (word...

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Large files

Discussion started by: sehgalniraj

2. Shell Programming and Scripting

Large Text Files

Discussion started by: caddyjoe77

3. UNIX for Dummies Questions & Answers

large files?

Discussion started by: ranj@chn

4. UNIX for Dummies Questions & Answers

Sort large file

Discussion started by: dcfargo

5. Shell Programming and Scripting

a problem with large files

Discussion started by: m_wassal

6. Shell Programming and Scripting

Divide large data files into smaller files

Discussion started by: ad23

7. UNIX for Dummies Questions & Answers

Speeding/Optimizing GREP search on CSV files

Discussion started by: Whit3H0rse

8. Solaris

How to safely copy full filesystems with large files (10Gb files)

Discussion started by: dragonov7

9. UNIX for Advanced & Expert Users

Script to sort the files and append the extension .sort to the sorted version of the file

Discussion started by: pankaj80

10. Shell Programming and Scripting

Script to sort large file with frequency

Discussion started by: gimley