Sort a big data file

10-17-2010

Registered User

6, 0

Join Date: Oct 2010

Last Activity: 21 April 2012, 8:26 AM EDT

Posts: 6

Thanks Given: 0

Thanked 0 Times in 0 Posts

Sort a big data file

Hello,
I have a big data file (160 MB) full of records with pipe(|) delimited those fields. I`m sorting the file on the first field.
I'm trying to sort with "sort" command and it brings me 6 minutes.
I have tried with some transformation methods in perl but it results "Out of memory". I was wondering to find any way (perl or unix shell script) to perform the fastest sort method of a big data file?.
Thanks,
bye.

rubber08

View Public Profile for rubber08

Find all posts by rubber08

10-17-2010

Registered User

4,673, 588

Join Date: Oct 2010

Last Activity: 1 February 2016, 3:35 PM EST

Location: Southern NJ, USA (Nord)

Posts: 4,673

Thanks Given: 8

Thanked 588 Times in 561 Posts

The sort is usually pretty good, but depends on file speed, especially in its temp -T dir. If the file was many files, they could be sorted separately in parallel and merged with sort -m, possibly directly with named pipes. The named piped can be managed by the ksh on UNIX's with /dev/fd/0-# pseudo file file descriptor devices:

Code:

sort -m <( sort file1 ) <( sort -m file2 )

or you can /usr/sbin/mknod -p named_pipe_path. Any additional options of your sort go on all sorts. This way, there is no delay writing intermediate files. It might work to assign different line number ranges to each, and since the sort sub-scripts are reading in parallel, the cost of selecting line # ranges is reduced.

Code:

sed '
   1,20000d
    40000q
 ' | sort

You can estimate the line count to be divided, shooting high, by a factor divided into the file byte size.

There are some exotic options to sort, but they are not usually recommended.

---------- Post updated at 09:50 AM ---------- Previous update was at 09:28 AM ----------

Another way to divide the data evenly to N sorts is my tool xdemux, which calloc's an array of $1 FILE*, does a popen() of $2 to write to lead all $1 cells of that array, and then reads stdin byte by byte (no line length concerns or extra copying) sending the lines down the pipes in rotation, and at EOF does fclose on the pipes so it does not wait for child status. In your case this would be

Code:

mknod -p /tmp/p.$$
xdemux 5 "sort your_args -o /tmp/p.$$" <your_file &
sort your_args -m /tmp/p.$$ /tmp/p.$$ /tmp/p.$$ /tmp/p.$$ /tmp/p.$$
rm -f /tmp/p.$$

A named pipe connects the next open() to write to a waiting, blocked on open() to read, not vice versa, so one named pipe can do for all. Here is xdemux.c, not sure if it is the latest as described above, but definitely close:

Code:

#include <stdio.h>
#include <stdlib.h>
#include <strings.h>

static void usage(){

	fputs(
"\n"
"Usage: xdemux <ct> <cmd> [ -l <line_ct> ]\n"
"\n"
"Runs <ct> copies of <cmd> and sends <line_ct> (default 1) lines to each\n"
"in rotation.\n"
"\n",
		stderr );
	exit( 1 );
 }

int main( int argc, char ** argv ){

	FILE **fp = NULL ;
	int i, x, c, l, lct = 1 ;

	if ( argc < 3
	  || 2 > ( x = atoi( argv[1] ))){
		usage();
	 }

	if ( argc > 3
	  && ( argc != 5
            || strcmp( argv[3], "-l" )
	    || 1 > ( lct = atoi( argv[4] )))){
		usage();
	 }
		
	if ( !( fp = (FILE **)calloc( x, sizeof (FILE *)))){
		perror( "calloc()" );
		exit( 2 );
		}

	for ( i = 0 ; i < x ; i++ ){
		if ( !( fp[i] = popen( argv[2], "w" ))){
			perror( "popen( $2 )" );
			exit( 3 );
			}
		}

	i = l = 0 ;

	while ( EOF != ( c = getchar())){
		if ( EOF == putc( c, fp[i] )){
			perror( "putc( popen( $2 ))" );
			exit( 4 );
			}
		if ( c == '\n'
		  && ++l == lct ){
			l = 0 ;
			if ( ++i == x ){
				i = 0 ;
			 }
		 }
	 }

	if ( ferror( stdin )){
		perror( "stdin" );
		exit( 5 );
		}

	for ( i = 0 ; i < x ; i++ ){
                if ( 0 > fclose( fp[i] )){
			perror( "fclose( popen( $2 ))" );
			}
		}

	exit( 0 );
}

You can enable a variety of power user excesses with xdemux!

Last edited by DGPickett; 10-20-2010 at 08:46 PM.. Reason: A later version of the xdemux code.

This User Gave Thanks to DGPickett For This Post:

DGPickett

View Public Profile for DGPickett

Find all posts by DGPickett

10-18-2010

Registered User

6,402, 678

Join Date: Mar 2008

Last Activity: 8 June 2016, 9:58 PM EDT

Posts: 6,402

Thanks Given: 288

Thanked 678 Times in 647 Posts

It always helps to know what Operating System you have and to see the command you typed.
In this case we'd also need to know the amount of memory you can devote to this "sort".

The biggest single improvement to the unix "sort" command is usually to give it more memory at the outset with the "-y kmem" parameter and to put temporary files (-T parameter) on a fast disc with at least twice as much free space as the size of the original file.

methyl

View Public Profile for methyl

Find all posts by methyl

Shell Programming and Scripting

Sort a big data file

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Sort file data according to a custom list of string

Discussion started by: mohtashims

2. Shell Programming and Scripting

Sort data in text file in particular format

Discussion started by: Adfire

3. Shell Programming and Scripting

Sort data file by case

Discussion started by: palex

4. UNIX for Advanced & Expert Users

Sort mixed data file

Discussion started by: jnrohit2k

5. Shell Programming and Scripting

Advanced: Sort, count data in column, append file name

Discussion started by: JamesT

6. Shell Programming and Scripting

parsing data from a big file using keys from another smaller file

Discussion started by: Lucky Ali

7. UNIX for Dummies Questions & Answers

How big is too big a config.log file?

Discussion started by: NeedLotsofHelp

8. Shell Programming and Scripting

How to cut some data from big file

Discussion started by: almanto

9. Shell Programming and Scripting

Big data file - sed/grep/awk?

Discussion started by: dlam

10. UNIX for Dummies Questions & Answers

How to view a big file(143M big)

Discussion started by: chenhao_no1