Help optimizing sort of large files Post: 302925646

Sponsored Content

Top Forums UNIX for Advanced & Expert Users Help optimizing sort of large files Post 302925646 by gandolf989 on Tuesday 18th of November 2014 11:38:15 AM

11-18-2014

Registered User

If you look at the first 10 characters in each file and create a file that represents each combination of characters for the first 10 characters, you would have 59,049 files, if my math is correct. You can then put each row from each file in the corresponding file. This when you sort each file, the file is already sorted to the 10th character and the remaining sort would be easier to do and easier to do in parallel. Since you would not want multiple processes appending lines to the same file at the same time the initial bucket sort should probably get done as a single process. Then each of the 59,049 files can get sorted with as many processes as you want. Plus each file will be smaller than what you have now.

You can also break up the files by using the first three characters as directory names, that way each directory three levels down would have 1/9th of the total files. This would also make it easier to bucket sort by more than 10 characters. I included a file of each possible combination of characters for the first 10 characters. This approach assumes an even distribution of data across the first 10 characters.

I woulds still load the data into a database though... Smilie

gandolf989

View Public Profile for gandolf989

Find all posts by gandolf989

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Large files

I am trying to understand the webserver log file for an error which has occured on my live web site. The webserver access file is very big in size so it's not possible to open this file using vi editor. I know the approximate time the error occured, so i am interested in looking for the log file...

2. Shell Programming and Scripting

Large Text Files

Hi All I have approximately 10 files that are at least 100+ MB in size. I am importing them into a DB to output them to the web. What i need to do first is clean the files up so i dont have un necessary rows in the DB. Below is what the file looks like: Ignore the <TAB> annotations as that...

3. UNIX for Dummies Questions & Answers

large files?

How do we check 'large files' is enabled on a Unix box -- HP-UX B11.11

4. UNIX for Dummies Questions & Answers

Sort large file

I was wondering how sort works. Does file size and time to sort increase geometrically? I have a 5.3 billion line file I'd like to use with sort -u I'm wondering if that'll take forever because of a geometric expansion? If it takes 100 hours that's fine but not 100 days. Thanks so much.

5. Shell Programming and Scripting

a problem with large files

hello all, kindly i need your help, i made a script to print a specific lines from a huge file about 3 million line. the output of the script will be about 700,000 line...the problem is the script is too slow...it kept working for 5 days and the output was only 200,000 lines !!! the script is...

6. Shell Programming and Scripting

Divide large data files into smaller files

Hello everyone! I have 2 types of files in the following format: 1) *.fa >1234 ...some text... >2345 ...some text... >3456 ...some text... . . . . 2) *.info >1234

7. UNIX for Dummies Questions & Answers

Speeding/Optimizing GREP search on CSV files

Hi all, I have problem with searching hundreds of CSV files, the problem is that search is lasting too long (over 5min). Csv files are "," delimited, and have 30 fields each line, but I always grep same 4 fields - so is there a way to grep just those 4 fields to speed-up search. Example:...

8. Solaris

How to safely copy full filesystems with large files (10Gb files)

Hello everyone. Need some help copying a filesystem. The situation is this: I have an oracle DB mounted on /u01 and need to copy it to /u02. /u01 is 500 Gb and /u02 is 300 Gb. The size used on /u01 is 187 Gb. This is running on solaris 9 and both filesystems are UFS. I have tried to do it using:...

9. UNIX for Advanced & Expert Users

Script to sort the files and append the extension .sort to the sorted version of the file

Hello all - I am to this forum and fairly new in learning unix and finding some difficulty in preparing a small shell script. I am trying to make script to sort all the files given by user as input (either the exact full name of the file or say the files matching the criteria like all files...

10. Shell Programming and Scripting

Script to sort large file with frequency

Hello, I have a very large file of around 2 million records which has the following structure: I have used the standard awk program to sort: # wordfreq.awk --- print list of word frequencies { # remove punctuation #gsub(/_]/, "", $0) for (i = 1; i <= NF; i++) freq++ } END { for (word...

LEARN ABOUT DEBIAN

look

LOOK(1) 						    BSD General Commands Manual 						   LOOK(1)

NAME

     look -- display lines beginning with a given string

SYNOPSIS

     look [-bdf] [-t termchar] string [file ...]

DESCRIPTION

     The look utility displays any lines in file which contain string as a prefix.

     If file is not specified, the file /usr/share/dict/words is used, only alphanumeric characters are compared and the case of alphabetic char-
     acters is ignored.

     The following options are available:

     -b      Use a binary search on the given word list. If you are ignoring case with -f or ignoring non-alphanumeric characters with -d, the
	     file must be sorted in the same way. Please note that these options are the default if no filename is given. See sort(1) for more
	     information on sorting files.

     -d      Dictionary character set and order, i.e., only alphanumeric characters are compared.

     -f      Ignore the case of alphabetic characters.

     -t      Specify a string termination character, i.e., only the characters in string up to and including the first occurrence of termchar are
	     compared.

ENVIRONMENT

     The LANG, LC_ALL and LC_CTYPE environment variables affect the execution of the look utility.  Their effect is described in environ(7).

FILES

     /usr/share/dict/words  the dictionary

EXIT STATUS

     The look utility exits 0 if one or more lines were found and displayed, 1 if no lines were found, and >1 if an error occurred.

COMPATIBILITY

     The original manual page stated that tabs and blank characters participated in comparisons when the -d option was specified.  This was incor-
     rect and the current man page matches the historic implementation.

     look uses a linear search by default instead of a binary search, which is what most other implementations use by default.

SEE ALSO

     grep(1), sort(1)

HISTORY

     A look utility appeared in Version 7 AT&T UNIX.

BUGS

     Lines are not compared according to the current locale's collating order.	Input files must be sorted with LC_COLLATE set to 'C'.

BSD
								   July 17, 2004							       BSD

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Large files

Discussion started by: sehgalniraj

2. Shell Programming and Scripting

Large Text Files

Discussion started by: caddyjoe77

3. UNIX for Dummies Questions & Answers

large files?

Discussion started by: ranj@chn

4. UNIX for Dummies Questions & Answers

Sort large file

Discussion started by: dcfargo