Sponsored Content
Top Forums UNIX for Advanced & Expert Users Help optimizing sort of large files Post 302925093 by kogorman3 on Friday 14th of November 2014 12:32:26 AM
Old 11-14-2014
Some results

Thanks to all for your comments. I was asking for ways to tune UNIX sort, because while I know how, I'm unwilling to rewrite it for this project -- I'm likely to be mired in bugs for too long.

I did some time testing, and in spite of bugs that are going to make me do it again, there are some rough results. These are on a 14GB test file with records of 64 bytes plus newline.

First, I quickly abandoned the idea of having sort compress its temporary files. Using gzip, even with -fast, is a loss of 10% to 20% in speed. Using -best is much worse, for a loss around 1500%.

Second, adding 10 to the batch-size parameter costs about 10% in speed until you can eliminate a merge pass. At that point, it turns into a benefit of about 30%. That's the sweet spot, because it starts going up again if you add even more to the batch.

Third, adding to the -parallel parameter is a win if you have multiple cores. Not huge: about 10% each for doubling the parameter from 1 to 2 or from 2 to 4.

Finally, changing the buffer-size parameter from 1g to 11g was a big loss -- roughly doubling the execution time. I don't know if there's a sweet spot in there, and I'll have to do finer-grained testing, or testing on a larger input. I suspect that it only pays when it's the only way to reduce the number of merge passes.

So, to a first approximation, it's best to add to parallel and keep buffer-size and batch-size small so long as (buffer-size * 0.4) * batch-size is at least as big as the input file. This will give the minimum 2 passes through the data. The 0.4 represents the observed fact that the temporaries are a bit smaller than half the requested buffer-size, at least on my data.

The additional testing may give me some information about how to balance buffer-size and batch-size subject to the above formula.
 

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Large files

I am trying to understand the webserver log file for an error which has occured on my live web site. The webserver access file is very big in size so it's not possible to open this file using vi editor. I know the approximate time the error occured, so i am interested in looking for the log file... (4 Replies)
Discussion started by: sehgalniraj
4 Replies

2. Shell Programming and Scripting

Large Text Files

Hi All I have approximately 10 files that are at least 100+ MB in size. I am importing them into a DB to output them to the web. What i need to do first is clean the files up so i dont have un necessary rows in the DB. Below is what the file looks like: Ignore the <TAB> annotations as that... (4 Replies)
Discussion started by: caddyjoe77
4 Replies

3. UNIX for Dummies Questions & Answers

large files?

How do we check 'large files' is enabled on a Unix box -- HP-UX B11.11 (2 Replies)
Discussion started by: ranj@chn
2 Replies

4. UNIX for Dummies Questions & Answers

Sort large file

I was wondering how sort works. Does file size and time to sort increase geometrically? I have a 5.3 billion line file I'd like to use with sort -u I'm wondering if that'll take forever because of a geometric expansion? If it takes 100 hours that's fine but not 100 days. Thanks so much. (2 Replies)
Discussion started by: dcfargo
2 Replies

5. Shell Programming and Scripting

a problem with large files

hello all, kindly i need your help, i made a script to print a specific lines from a huge file about 3 million line. the output of the script will be about 700,000 line...the problem is the script is too slow...it kept working for 5 days and the output was only 200,000 lines !!! the script is... (16 Replies)
Discussion started by: m_wassal
16 Replies

6. Shell Programming and Scripting

Divide large data files into smaller files

Hello everyone! I have 2 types of files in the following format: 1) *.fa >1234 ...some text... >2345 ...some text... >3456 ...some text... . . . . 2) *.info >1234 (7 Replies)
Discussion started by: ad23
7 Replies

7. UNIX for Dummies Questions & Answers

Speeding/Optimizing GREP search on CSV files

Hi all, I have problem with searching hundreds of CSV files, the problem is that search is lasting too long (over 5min). Csv files are "," delimited, and have 30 fields each line, but I always grep same 4 fields - so is there a way to grep just those 4 fields to speed-up search. Example:... (11 Replies)
Discussion started by: Whit3H0rse
11 Replies

8. Solaris

How to safely copy full filesystems with large files (10Gb files)

Hello everyone. Need some help copying a filesystem. The situation is this: I have an oracle DB mounted on /u01 and need to copy it to /u02. /u01 is 500 Gb and /u02 is 300 Gb. The size used on /u01 is 187 Gb. This is running on solaris 9 and both filesystems are UFS. I have tried to do it using:... (14 Replies)
Discussion started by: dragonov7
14 Replies

9. UNIX for Advanced & Expert Users

Script to sort the files and append the extension .sort to the sorted version of the file

Hello all - I am to this forum and fairly new in learning unix and finding some difficulty in preparing a small shell script. I am trying to make script to sort all the files given by user as input (either the exact full name of the file or say the files matching the criteria like all files... (3 Replies)
Discussion started by: pankaj80
3 Replies

10. Shell Programming and Scripting

Script to sort large file with frequency

Hello, I have a very large file of around 2 million records which has the following structure: I have used the standard awk program to sort: # wordfreq.awk --- print list of word frequencies { # remove punctuation #gsub(/_]/, "", $0) for (i = 1; i <= NF; i++) freq++ } END { for (word... (3 Replies)
Discussion started by: gimley
3 Replies
BATCHER(8)						      System Manager's Manual							BATCHER(8)

NAME
batcher - article batching backend for InterNetNews SYNOPSIS
batcher [ -a arts ] [ -A total_arts ] [ -b size ] [ -B total_size ] [ -i string ] [ -N num_batches ] [ -p process ] [ -r ] [ -s separator ] [ -S alt_spool ] [ -v ] host [ input ] DESCRIPTION
Batcher reads uses a list of files to prepare news batches for the specified host. It is normally invoked by a script run out of cron(8) that uses shlock(1) to lock the host name, followed by a ctlinnd(8) command to flush the batchfile. Batcher reads the named input file, or standard input if no file is given. Relative pathnames are interpreted from the /var/spool/news/out.going directory. The input is taken as a set of lines. Blank lines and lines starting with a number sign (``#'') are ignored. All other lines should consist of one or two fields separated by a single space. The first field is the name of a file holding an article; if it is not an an absolute pathname it is taken relative to the news spool directory, /var/spool/news. The second field, if present, specifies the size of the article in bytes. OPTIONS
-S The ``-S'' flag may be used to specify an alternate spool directory to use if the article is not found; this would normally be an NFS-mounted spool directory of a master server with longer expiration times. -r By default, the program sets its standard error to /var/log/news/errlog. To suppress this redirection, use the ``-r'' flag. -v Upon exit, batcher reports statistics via syslog(3). If the ``-v'' flag is used, they will also be printed on the standard output. -b Batcher collects the text of the named articles into batches. To limit the size of each batch, use the ``-b'' flag. The default size is 60 kilobytes. Using ``-b0'' allows unlimited batch sizes. -a To limit the number of articles in each batch, use the ``-a'' flag. The default is no limit. A new batch will be started when either the byte count or number of articles written exceeds the specified limits. -B To limit the total number of bytes written for all batches, use the ``-B'' flag. -A To limit the total number of articles that can be batched use the ``-A'' flag. -N To limit the total number of batches that should be created use the ``-N'' flag. In all three cases, the default is zero, which is taken to mean no limit. -i string A batch starts with an identifying line to specify the unpacking method to be used on the receiving end. When the ``-i'' flag is used, the initial string, string, followed by a newline, will be output at the start of every batch. The default is to have no ini- tial string. -s Each article starts with a separator line to indicate the size of the article. To specify the separator use the ``-s'' flag. This is a sprintf(3) format string which can have a single ``%ld'' parameter which will be given the size of the article. If the separa- tor is not empty, then the string and a newline will be output before every article. The default separator is ``#! rnews %ld''. -p By default, batches are written to standard output, which is not useful when more than one output batch is created. Use the ``-p'' flag to specify the shell command that should be created (via popen(3)) whenever a new batch is started. The process is a sprintf format string which can have a single ``%s'' parameter which will be given the host name. A common value is: ( echo '#! cunbatch' ; exec compress ) | uux - -r -z %s!rnews EXIT STATUS
If the input is exhausted, batcher will exit with a zero status. If any of the limits specified with the ``-B,'' ``-A,'' or ``-N'' flags is reached, or if there is an error writing the batch, then batcher will try to spool the input, copying it to a file. If there was no input filename, the standard input will be copied to /var/spool/news/out.going/host and the program will exit. If an input filename was given, a temporary file named input.bch (if input is an absolute pathname) or /var/spool/news/out.going/input.bch (if the filename does not begin with a slash) is created. Once the input is copied, batcher will try to rename this temporary file to be the name of the input file, and then exit. Upon receipt of an interrupt or termination signal, batcher will finish sending the current article, close the batch, and then rewrite the batchfile according as described in the previous paragraph. HISTORY
Written by Rich $alz <rsalz@uunet.uu.net> for InterNetNews. This is revision 1.18, dated 1996/10/29. SEE ALSO
ctlinnd(8), newsfeeds(5), shlock(1). BATCHER(8)
All times are GMT -4. The time now is 03:42 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy