Sponsored Content
Top Forums UNIX for Advanced & Expert Users Help optimizing sort of large files Post 302926511 by kogorman3 on Tuesday 25th of November 2014 12:49:28 PM
Old 11-25-2014
Quote:
Originally Posted by Corona688
You don't have to fit the whole file in the SSD. I think his suggestion amounted to what I said earlier -- 'sort the largest chunks that will fit in RAM'. That could be done on an SSD.

If we're back to using ordinary sort without bothering to tune it for multiprocessing or external sorting, or even using that high-performance SSD at all, we are firmly back in "something for nothing" territory. Have fun.
I am. What would not be fun is re-writing the merge section of GNU sort to make special use of an SSD, and verifying that it is correct and robust. I don't have a spare SSD anyway, so it wouldn't help. My current SSD is about full, though it does have a 32GB swap area.

I do have more test results. I finished another sort of my large file, using a batch-size parameter large enough to merge all of the temporaries at once. This meant increasing the default of 16 to at least 117. I chose 135. The default sort took 21.1 hours. The tweaked sort took 18.2, about a 14% improvement. I'll also check the effect of making smaller temporaries (reduce the in-RAM merge footprint) and widening the file merge some more. This will require figuring out the relationship of parameters to temporary size -- I already know it's not identical; it appears the buffer-size I request is the size of some internal structure, not the size of the temporary.

Of the 18.2 hours that sort took, 8.1 were user time (doing the compares, presumably,) and 2.2 were system time (making I/O calls, managing buffers, handling TLB misses, page faults and such). The remainder, another 8 hours or so, is about double the unaccounted time in a pure copy. I suspect the difference is the additional time in doing head seeks on disk that are not needed by the copy of un-fragmented files. I consider them I/O time.

Without an SSD big enough to hold all the temporaries, I don't see how to reduce that I/O time. The suggestions I've seen have not painted a coherent picture for me of how that would go with less I/O than I have now. The first stage of merge sorts are already in RAM and read and write sequentially on separate spindles. It's just the file merges that do the seeking, and that involves all of the data. How are you going to do that without seeking a drive large enough to hold it all? Nobody needs to answer that, because I don't even have the smaller SSD.

My data is not at all uniform; someone had surmised that it was in comtemplating a radix sort. Instead, it clusters like mad in ways that vary between datasets. So the sort buckets would be of unpredictable sizes and require space allocation, and I'm anyway afraid it could be hard to implement in two passes over the data, which I think is required if it's going to beat the GNU sort, and so could have much the same seeking behavior.

When you think about this, consider my largest drives are 2 TB, and my input data occupies more than half that space. And this is not the largest dataset I'm going to have (I'm just starting the main project). Any approach that writes a bunch of intermediate collections of data is going to be spread all over the disk, whether it's in one file or many. It's gonna seek.

---------- Post updated 11-25-14 at 09:49 AM ---------- Previous update was 11-24-14 at 08:21 PM ----------

And in a victory of data over speculation, I've pretty much convinced myself that attempting to create longer temporaries by sorting more stuff in RAM, let alone SSD, is counterproductive. The new results come from changing the parameters from simply
Code:
 export param='--batch-size=135'

to
Code:
export param='--batch-size=320 --buffer-size=11g'

This results in temporaries about half the size, and an improvement in overall time from 18.2 hours to 11.8, just from using shorter temporaries (and increasing the batch so they're all merged at once.)

The user time did not change that much, though it did decrease from 8.7 to 7.6 hours. System time stayed about the same. I don't know why the unaccounted real time decreased by about 6.5 hours. Since it contradicts my intuition, I'll run a few more tests with other parameters to verify the trend.

If the trend holds up, I'll be decreasing the temporary size as much as possible, and making corresponding increases in batch size, until I reach a limit or a sweet spot.
 

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Large files

I am trying to understand the webserver log file for an error which has occured on my live web site. The webserver access file is very big in size so it's not possible to open this file using vi editor. I know the approximate time the error occured, so i am interested in looking for the log file... (4 Replies)
Discussion started by: sehgalniraj
4 Replies

2. Shell Programming and Scripting

Large Text Files

Hi All I have approximately 10 files that are at least 100+ MB in size. I am importing them into a DB to output them to the web. What i need to do first is clean the files up so i dont have un necessary rows in the DB. Below is what the file looks like: Ignore the <TAB> annotations as that... (4 Replies)
Discussion started by: caddyjoe77
4 Replies

3. UNIX for Dummies Questions & Answers

large files?

How do we check 'large files' is enabled on a Unix box -- HP-UX B11.11 (2 Replies)
Discussion started by: ranj@chn
2 Replies

4. UNIX for Dummies Questions & Answers

Sort large file

I was wondering how sort works. Does file size and time to sort increase geometrically? I have a 5.3 billion line file I'd like to use with sort -u I'm wondering if that'll take forever because of a geometric expansion? If it takes 100 hours that's fine but not 100 days. Thanks so much. (2 Replies)
Discussion started by: dcfargo
2 Replies

5. Shell Programming and Scripting

a problem with large files

hello all, kindly i need your help, i made a script to print a specific lines from a huge file about 3 million line. the output of the script will be about 700,000 line...the problem is the script is too slow...it kept working for 5 days and the output was only 200,000 lines !!! the script is... (16 Replies)
Discussion started by: m_wassal
16 Replies

6. Shell Programming and Scripting

Divide large data files into smaller files

Hello everyone! I have 2 types of files in the following format: 1) *.fa >1234 ...some text... >2345 ...some text... >3456 ...some text... . . . . 2) *.info >1234 (7 Replies)
Discussion started by: ad23
7 Replies

7. UNIX for Dummies Questions & Answers

Speeding/Optimizing GREP search on CSV files

Hi all, I have problem with searching hundreds of CSV files, the problem is that search is lasting too long (over 5min). Csv files are "," delimited, and have 30 fields each line, but I always grep same 4 fields - so is there a way to grep just those 4 fields to speed-up search. Example:... (11 Replies)
Discussion started by: Whit3H0rse
11 Replies

8. Solaris

How to safely copy full filesystems with large files (10Gb files)

Hello everyone. Need some help copying a filesystem. The situation is this: I have an oracle DB mounted on /u01 and need to copy it to /u02. /u01 is 500 Gb and /u02 is 300 Gb. The size used on /u01 is 187 Gb. This is running on solaris 9 and both filesystems are UFS. I have tried to do it using:... (14 Replies)
Discussion started by: dragonov7
14 Replies

9. UNIX for Advanced & Expert Users

Script to sort the files and append the extension .sort to the sorted version of the file

Hello all - I am to this forum and fairly new in learning unix and finding some difficulty in preparing a small shell script. I am trying to make script to sort all the files given by user as input (either the exact full name of the file or say the files matching the criteria like all files... (3 Replies)
Discussion started by: pankaj80
3 Replies

10. Shell Programming and Scripting

Script to sort large file with frequency

Hello, I have a very large file of around 2 million records which has the following structure: I have used the standard awk program to sort: # wordfreq.awk --- print list of word frequencies { # remove punctuation #gsub(/_]/, "", $0) for (i = 1; i <= NF; i++) freq++ } END { for (word... (3 Replies)
Discussion started by: gimley
3 Replies
All times are GMT -4. The time now is 07:06 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy