The builtin split function in AWK is too slow Post: 302423125

Sponsored Content

Top Forums Shell Programming and Scripting The builtin split function in AWK is too slow Post 302423125 by kevintse on Thursday 20th of May 2010 09:39:54 PM

05-20-2010

Registered User

OK, I will explain what I want to achieve.
I have two data files, I want to generate the result through grouping and sorting the data in these two files.

Code:

$ cat data1.txt
A:list1,list2,list3,list4
B:list1,list2,list3,list4
C:list1,list2,list6
D:list3
F:list2,list4,list5
G:list7
H:list2,list5
A:list1,list2,list3,list4
B:list1,list2,list3,list4
C:list1,list2,list6
D:list3
F:list2,list4,list5
G:list7
H:list2,list5

$ cat data2.txt
list1:A,B,C
list2:A,B,C,F,H
list3:A,B,D
list4:A,B,F
list5:H,F
list6:C
list7:G

desired output:
A:B,C,D,F,H
B:A,C,D,F,H
C:A,B,F,H
D:A,B
F:A,B,C,H
H:A,B,C,F

I will explain a little bit for the output. For the row that starts with "A", "B,C,D,F,H" are extracted from all those lists that contain "A", and they are sorted by their appearing frequency in descending order.

I have already written the code that can generate the desired output, but I have noticed that the split function I used in the program is TOO slow when it comes to splitting a string that contains over 30,000 fields(comma separated). So I am here seeking a solution to making it fast.

This is the code I used to generate the output:

Code:

#!/bin/awk -f

BEGIN { FS=":"; ts1 = "date \"+%s\""; }
NR == FNR { bookLists[$1]=$2; next }    #the first file
{ books[$1]=$2 }                                        #the second file
END {
{ print "There are in total", length(bookLists), "book lists and ", length(books), "books." > "make_data_log.txt" }
count = 0
#bookLists: A:list1,list2
#books: list1:A,B
cmd = "sort -k 3 -nr | awk ' { arr[$2]; if(!b) b=$1 } END { for (i in arr) str=str ? str\",\"i : i; print b, str} '"
for (i in bookLists) {
        #print "========== book " i "=========="
        split(bookLists[i], tmpBls, /,/)
        for (j in tmpBls) {
                split(books[tmpBls[j]], tmpBs, /,/)
                for (k in tmpBs)
                        ++result[tmpBs[k]]      #print i":", tmpBs[k]
                delete tmpBs
        }
        for (l in result) {
                if (i != l) print i, l, result[l] | cmd
                #if (i != l) print i, l, result[l]
                if (++num == 20) break
        }
        close(cmd)              #close the pipe, or this "sort" will be delayed until the awk program ends
        delete result
        delete tmpBls
        num = 0
        if (++count % 100000 == 0)
                print count >> "make_data_log.txt"
}

{ ts2 = "date \"+%s\""; print ts2-ts1, "seconds consumed." >> "make_data_log.txt" }

}

---------- Post updated at 08:39 PM ---------- Previous update was at 07:23 AM ----------

bump...
please, experts...how to make this script fast?

kevintse

View Public Profile for kevintse

Find all posts by kevintse

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

split function

Hi all! I am relatively new to UNIX staff, and I have come across a problem: I have a big directory, which contains 100 smaller ones. Each of the 100 contains a file ending in .txt , so there are 100 files ending in .txt I want to split each of the 100 files in smaller ones, which will contain...

2. Shell Programming and Scripting

perl split function

$mystring = "name:blk:house::"; print "$mystring\n"; @s_format = split(/:/, $mystring); for ($i=0; $i <= $#s_format; $i++) { print "index is $i,field is $s_format"; print "\n"; } $size = $#s_format + 1; print "total size of array is $size\n"; i am expecting my size to be 5, why is it...

3. UNIX for Dummies Questions & Answers

Split a file with no pattern -- Split, Csplit, Awk

I have gone through all the threads in the forum and tested out different things. I am trying to split a 3GB file into multiple files. Some files are even larger than this. For example: split -l 3000000 filename.txt This is very slow and it splits the file with 3 million records in each...

4. Shell Programming and Scripting

awk - split function

Hi, I have some output in the form of: #output: abc123 def567 hij890 ghi324 the above is in one column, stored in the variable x ( and if you wana know about x... x=sprintf(tolower(substr(someArray,1,1)substr(userArray,3,1)substr(userArray,2,1))) when i simply print x (print x) I get...

5. Shell Programming and Scripting

Use split function in perl

Hello, if i have file like this: 010000890306932455804 05306977653873 0520080417010520ISMS SMT ZZZZZZZZZZZZZOC30693599000 30971360000 ZZZZZZZZZZZZZZZZZZZZ202011302942311 010000890306946317387 05306977313623 0520080417010520ISMS SMT ZZZZZZZZZZZZZOC306942190000 30971360000...

6. Homework & Coursework Questions

PERL split function

Hi... I have a question regarding the split function in PERL. I have a very huge csv file (more than 80 million records). I need to extract a particular position(eg : 50th position) of each line from the csv file. I tried using split function. But I realized split takes a very long time. Also...

7. Homework & Coursework Questions

PERL split function

8. Shell Programming and Scripting

PERL split function

9. Shell Programming and Scripting

Perl split function

my @d =split('\|', $_); west|ACH|3|Y|LuV|N||N|| Qt|UWST|57|Y|LSV|Y|Bng|N|KT| It Returns d as 8 for First Line, and 9 as for Second Line . I want to Process Both the Files, How to Handle It.

10. Shell Programming and Scripting

awk to split one field and print the last two fields within the split part.

Hello; I have a file consists of 4 columns separated by tab. The problem is the third fields. Some of the them are very long but can be split by the vertical bar "|". Also some of them do not contain the string "UniProt", but I could ignore it at this moment, and sort the file afterwards. Here is...

LEARN ABOUT REDHAT

ppmtosixel

ppmtosixel(1)						      General Commands Manual						     ppmtosixel(1)

NAME

       ppmtosixel - convert a portable pixmap into DEC sixel format

SYNOPSIS

       ppmtosixel [-raw] [-margin] [ppmfile]

DESCRIPTION

       Reads  a  portable  pixmap  as input.  Produces sixel commands (SIX) as output.	The output is formatted for color printing, e.g. for a DEC
       LJ250 color inkjet printer.

       If RGB values from the PPM file do not have maxval=100, the RGB values are rescaled.  A printer control header and a color assignment table
       begin the SIX file.  Image data is written in a compressed format by default.  A printer control footer ends the image file.

OPTIONS

       -raw   If  specified,  each  pixel  will  be explicitly described in the image file.  If -raw is not specified, output will default to com-
	      pressed format in which identical adjacent pixels are replaced by "repeat pixel" commands.  A raw file is often an order	of  magni-
	      tude larger than a compressed file and prints much slower.

       -margin
	      If  -margin  is not specified, the image will be start at the left margin (of the window, paper, or whatever).  If -margin is speci-
	      fied, a 1.5 inch left margin will offset the image.

PRINTING

       Generally, sixel files must reach the printer unfiltered.  Use the lpr -x option or cat filename > /dev/tty0?.

BUGS

       Upon rescaling, truncation of the least significant bits of RGB values may result in poor color conversion.  If the original PPM maxval was
       greater	than  100,  rescaling  also reduces the image depth.  While the actual RGB values from the ppm file are more or less retained, the
       color palette of the LJ250 may not match the colors on your screen.  This seems to be a printer limitation.

SEE ALSO

       ppm(5)

AUTHOR

       Copyright (C) 1991 by Rick Vinci.

								   26 April 1991						     ppmtosixel(1)

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

split function

Discussion started by: ktsirig

2. Shell Programming and Scripting

perl split function

Discussion started by: new2ss

3. UNIX for Dummies Questions & Answers

Split a file with no pattern -- Split, Csplit, Awk

Discussion started by: madhunk

4. Shell Programming and Scripting

awk - split function

Discussion started by: fusionX

5. Shell Programming and Scripting

Use split function in perl

Discussion started by: chriss_58

6. Homework & Coursework Questions

PERL split function

Discussion started by: castle

7. Homework & Coursework Questions

PERL split function

Discussion started by: castle

8. Shell Programming and Scripting

PERL split function

Discussion started by: castle

9. Shell Programming and Scripting

Perl split function

Discussion started by: vishwakar

10. Shell Programming and Scripting

awk to split one field and print the last two fields within the split part.

Discussion started by: yifangt

LEARN ABOUT REDHAT

ppmtosixel