Sort a big data file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Sort a big data file
# 1  
Old 10-17-2010
Sort a big data file

Hello,
I have a big data file (160 MB) full of records with pipe(|) delimited those fields. I`m sorting the file on the first field.
I'm trying to sort with "sort" command and it brings me 6 minutes.
I have tried with some transformation methods in perl but it results "Out of memory". I was wondering to find any way (perl or unix shell script) to perform the fastest sort method of a big data file?.
Thanks,
bye.
# 2  
Old 10-17-2010
The sort is usually pretty good, but depends on file speed, especially in its temp -T dir. If the file was many files, they could be sorted separately in parallel and merged with sort -m, possibly directly with named pipes. The named piped can be managed by the ksh on UNIX's with /dev/fd/0-# pseudo file file descriptor devices:

Code:
sort -m <( sort file1 ) <( sort -m file2 )

or you can /usr/sbin/mknod -p named_pipe_path. Any additional options of your sort go on all sorts. This way, there is no delay writing intermediate files. It might work to assign different line number ranges to each, and since the sort sub-scripts are reading in parallel, the cost of selecting line # ranges is reduced.
Code:
sed '
   1,20000d
    40000q
 ' | sort

You can estimate the line count to be divided, shooting high, by a factor divided into the file byte size.

There are some exotic options to sort, but they are not usually recommended.

---------- Post updated at 09:50 AM ---------- Previous update was at 09:28 AM ----------

Another way to divide the data evenly to N sorts is my tool xdemux, which calloc's an array of $1 FILE*, does a popen() of $2 to write to lead all $1 cells of that array, and then reads stdin byte by byte (no line length concerns or extra copying) sending the lines down the pipes in rotation, and at EOF does fclose on the pipes so it does not wait for child status. In your case this would be
Code:
mknod -p /tmp/p.$$
xdemux 5 "sort your_args -o /tmp/p.$$" <your_file &
sort your_args -m /tmp/p.$$ /tmp/p.$$ /tmp/p.$$ /tmp/p.$$ /tmp/p.$$
rm -f /tmp/p.$$

A named pipe connects the next open() to write to a waiting, blocked on open() to read, not vice versa, so one named pipe can do for all. Here is xdemux.c, not sure if it is the latest as described above, but definitely close:
Code:
#include <stdio.h>
#include <stdlib.h>
#include <strings.h>

static void usage(){

	fputs(
"\n"
"Usage: xdemux <ct> <cmd> [ -l <line_ct> ]\n"
"\n"
"Runs <ct> copies of <cmd> and sends <line_ct> (default 1) lines to each\n"
"in rotation.\n"
"\n",
		stderr );
	exit( 1 );
 }

int main( int argc, char ** argv ){

	FILE **fp = NULL ;
	int i, x, c, l, lct = 1 ;

	if ( argc < 3
	  || 2 > ( x = atoi( argv[1] ))){
		usage();
	 }

	if ( argc > 3
	  && ( argc != 5
            || strcmp( argv[3], "-l" )
	    || 1 > ( lct = atoi( argv[4] )))){
		usage();
	 }
		
	if ( !( fp = (FILE **)calloc( x, sizeof (FILE *)))){
		perror( "calloc()" );
		exit( 2 );
		}

	for ( i = 0 ; i < x ; i++ ){
		if ( !( fp[i] = popen( argv[2], "w" ))){
			perror( "popen( $2 )" );
			exit( 3 );
			}
		}

	i = l = 0 ;

	while ( EOF != ( c = getchar())){
		if ( EOF == putc( c, fp[i] )){
			perror( "putc( popen( $2 ))" );
			exit( 4 );
			}
		if ( c == '\n'
		  && ++l == lct ){
			l = 0 ;
			if ( ++i == x ){
				i = 0 ;
			 }
		 }
	 }

	if ( ferror( stdin )){
		perror( "stdin" );
		exit( 5 );
		}

	for ( i = 0 ; i < x ; i++ ){
                if ( 0 > fclose( fp[i] )){
			perror( "fclose( popen( $2 ))" );
			}
		}

	exit( 0 );
}

You can enable a variety of power user excesses with xdemux!

Last edited by DGPickett; 10-20-2010 at 08:46 PM.. Reason: A later version of the xdemux code.
This User Gave Thanks to DGPickett For This Post:
# 3  
Old 10-18-2010
It always helps to know what Operating System you have and to see the command you typed.
In this case we'd also need to know the amount of memory you can devote to this "sort".

The biggest single improvement to the unix "sort" command is usually to give it more memory at the outset with the "-y kmem" parameter and to put temporary files (-T parameter) on a fast disc with at least twice as much free space as the size of the original file.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Sort file data according to a custom list of string

I have a string of pre defined ip address list which will always remain constant their order will never change like in below sample: iplist=8.8.5.19,9.7.5.14,12.9.9.23,8.8.8.14,144.1.113 In the above example i m considering only 5 ips but there could be many more. Now i have a file which... (15 Replies)
Discussion started by: mohtashims
15 Replies

2. Shell Programming and Scripting

Sort data in text file in particular format

I have to sort below output in text file in unix bash 20170308 DA,I,113 20170308 PM,I,123 20170308 DA,U,22 20170308 PM,U,123 20170309 DA,I,11 20170309 PM,I,23 20170309 DA,U,123 20170309 PM,U,233 (8 Replies)
Discussion started by: Adfire
8 Replies

3. Shell Programming and Scripting

Sort data file by case

Hello, I'm trying to sort a large data file by the 3rd column so that all of the first words in the 3rd column that are in all uppercase appear before (or after) the non uppercase words. For example, Data file: xxx 12345 Rat in the house xxx 12345 CAT in the hat xxx 12345 Dog in the... (4 Replies)
Discussion started by: palex
4 Replies

4. UNIX for Advanced & Expert Users

Sort mixed data file

I have a text file and each field is separated by semicolon ( ; ). Field number 7 is internally separated by comma ( , ) and pipe ( | ) symbol. I want to sort file based on three different fields which are marked in BOLD. Here first BOLD field will have numbers upto the length of 9 characters,... (6 Replies)
Discussion started by: jnrohit2k
6 Replies

5. Shell Programming and Scripting

Advanced: Sort, count data in column, append file name

Hi. I am not sure the title gives an optimal description of what I want to do. Also, I tried to post this in the "UNIX for Dummies Questions & Answers", but it seems no-one was able to help out. I have several text files that contain data in many columns. All the files are organized the same... (14 Replies)
Discussion started by: JamesT
14 Replies

6. Shell Programming and Scripting

parsing data from a big file using keys from another smaller file

Hi, I have 2 files format of file 1 is: a1 b2 a2 c2 d1 f3 format of file 2 is (tab delimited): a1 1.2 0.5 0.06 0.7 0.9 1 0.023 a3 0.91 0.007 0.12 0.34 0.45 1 0.7 a2 1.05 2.3 0.25 1 0.9 0.3 0.091 b1 1 5.4 0.3 9.2 0.3 0.2 0.1 b2 3 5 7 0.9 1 9 0 1 b3 0.001 1 2.3 4.6 8.9 10 0 1 0... (10 Replies)
Discussion started by: Lucky Ali
10 Replies

7. UNIX for Dummies Questions & Answers

How big is too big a config.log file?

I have a 5000 line config.log file with several "maybe" errors. Any reccomendations on finding solvable problems? (2 Replies)
Discussion started by: NeedLotsofHelp
2 Replies

8. Shell Programming and Scripting

How to cut some data from big file

How to cut data from big file my file around 30 gb I tried "head -50022172 filename > newfile.txt ,and tail -5454283 newfile.txt. It's slowy. afer that I tried sed -n '46467831,50022172p' filename > newfile.txt ,also slow Please recommend me , faster command to cut some data from... (4 Replies)
Discussion started by: almanto
4 Replies

9. Shell Programming and Scripting

Big data file - sed/grep/awk?

Morning guys. Another day another question. :rolleyes: I am knocking up a script to pull some data from a file. The problem is the file is very big (up to 1 gig in size), so this solution: for results in `grep "^\ ... works, but takes ages (we're talking minutes) to run. The data is held... (8 Replies)
Discussion started by: dlam
8 Replies

10. UNIX for Dummies Questions & Answers

How to view a big file(143M big)

1 . Thanks everyone who read the post first. 2 . I have a log file which size is 143M , I can not use vi open it .I can not use xedit open it too. How to view it ? If I want to view 200-300 ,how can I implement it 3 . Thanks (3 Replies)
Discussion started by: chenhao_no1
3 Replies
Login or Register to Ask a Question