You don't have to fit the whole file in the SSD. I think his suggestion amounted to what I said earlier -- 'sort the largest chunks that will fit in RAM'. That could be done on an SSD.
If we're back to using ordinary sort without bothering to tune it for multiprocessing or external sorting, or even using that high-performance SSD at all, we are firmly back in "something for nothing" territory. Have fun.
I am. What would not be fun is re-writing the merge section of GNU sort to make special use of an SSD, and verifying that it is correct and robust. I don't have a spare SSD anyway, so it wouldn't help. My current SSD is about full, though it does have a 32GB swap area.
I do have more test results. I finished another sort of my large file, using a batch-size parameter large enough to merge all of the temporaries at once. This meant increasing the default of 16 to at least 117. I chose 135. The default sort took 21.1 hours. The tweaked sort took 18.2, about a 14% improvement. I'll also check the effect of making smaller temporaries (reduce the in-RAM merge footprint) and widening the file merge some more. This will require figuring out the relationship of parameters to temporary size -- I already know it's not identical; it appears the buffer-size I request is the size of some internal structure, not the size of the temporary.
Of the 18.2 hours that sort took, 8.1 were user time (doing the compares, presumably,) and 2.2 were system time (making I/O calls, managing buffers, handling TLB misses, page faults and such). The remainder, another 8 hours or so, is about double the unaccounted time in a pure copy. I suspect the difference is the additional time in doing head seeks on disk that are not needed by the copy of un-fragmented files. I consider them I/O time.
Without an SSD big enough to hold all the temporaries, I don't see how to reduce that I/O time. The suggestions I've seen have not painted a coherent picture for me of how that would go with less I/O than I have now. The first stage of merge sorts are already in RAM and read and write sequentially on separate spindles. It's just the file merges that do the seeking, and that involves all of the data. How are you going to do that without seeking a drive large enough to hold it all? Nobody needs to answer that, because I don't even have the smaller SSD.
My data is not at all uniform; someone had surmised that it was in comtemplating a radix sort. Instead, it clusters like mad in ways that vary between datasets. So the sort buckets would be of unpredictable sizes and require space allocation, and I'm anyway afraid it could be hard to implement in two passes over the data, which I think is required if it's going to beat the GNU sort, and so could have much the same seeking behavior.
When you think about this, consider my largest drives are 2 TB, and my input data occupies more than half that space. And this is not the largest dataset I'm going to have (I'm just starting the main project). Any approach that writes a bunch of intermediate collections of data is going to be spread all over the disk, whether it's in one file or many. It's gonna seek.
---------- Post updated 11-25-14 at 09:49 AM ---------- Previous update was 11-24-14 at 08:21 PM ----------
And in a victory of data over speculation, I've pretty much convinced myself that attempting to create longer temporaries by sorting more stuff in RAM, let alone SSD, is counterproductive. The new results come from changing the parameters from simply
to
This results in temporaries about half the size, and an improvement in overall time from 18.2 hours to 11.8, just from using shorter temporaries (and increasing the batch so they're all merged at once.)
The user time did not change that much, though it did decrease from 8.7 to 7.6 hours. System time stayed about the same. I don't know why the unaccounted real time decreased by about 6.5 hours. Since it contradicts my intuition, I'll run a few more tests with other parameters to verify the trend.
If the trend holds up, I'll be decreasing the temporary size as much as possible, and making corresponding increases in batch size, until I reach a limit or a sweet spot.
I am trying to understand the webserver log file for an error which has occured on my live web site.
The webserver access file is very big in size so it's not possible to open this file using vi editor. I know the approximate time the error occured, so i am interested in looking for the log file... (4 Replies)
Hi All
I have approximately 10 files that are at least 100+ MB in size. I am importing them into a DB to output them to the web. What i need to do first is clean the files up so i dont have un necessary rows in the DB. Below is what the file looks like:
Ignore the <TAB> annotations as that... (4 Replies)
I was wondering how sort works.
Does file size and time to sort increase geometrically?
I have a 5.3 billion line file I'd like to use with sort -u I'm wondering if that'll take forever because of a geometric expansion?
If it takes 100 hours that's fine but not 100 days.
Thanks so much. (2 Replies)
hello all,
kindly i need your help, i made a script to print a specific lines from a huge file about 3 million line. the output of the script will be about 700,000 line...the problem is the script is too slow...it kept working for 5 days and the output was only 200,000 lines !!!
the script is... (16 Replies)
Hi all,
I have problem with searching hundreds of CSV files, the problem is that search is lasting too long (over 5min).
Csv files are "," delimited, and have 30 fields each line, but I always grep same 4 fields - so is there a way to grep just those 4 fields to speed-up search.
Example:... (11 Replies)
Hello everyone. Need some help copying a filesystem. The situation is this: I have an oracle DB mounted on /u01 and need to copy it to /u02. /u01 is 500 Gb and /u02 is 300 Gb. The size used on /u01 is 187 Gb. This is running on solaris 9 and both filesystems are UFS.
I have tried to do it using:... (14 Replies)
Hello all -
I am to this forum and fairly new in learning unix and finding some difficulty in preparing a small shell script. I am trying to make script to sort all the files given by user as input (either the exact full name of the file or say the files matching the criteria like all files... (3 Replies)
Hello,
I have a very large file of around 2 million records which has the following structure:
I have used the standard awk program to sort:
# wordfreq.awk --- print list of word frequencies
{
# remove punctuation
#gsub(/_]/, "", $0)
for (i = 1; i <= NF; i++)
freq++
}
END {
for (word... (3 Replies)