The builtin split function in AWK is too slow Post: 302423788

Sponsored Content

Top Forums Shell Programming and Scripting The builtin split function in AWK is too slow Post 302423788 by kevintse on Saturday 22nd of May 2010 10:53:40 AM

05-22-2010

Registered User

Quote:

Originally Posted by aigles

Try this awk program which doesn't use the split function :

Code:

awk -v Q="'" -F'[:,]' '
BEGIN {
   cmd = "sort -k3,3nr -k2,2 | awk " Q "{ out=out (NR==1 ? $1 \":\" : \",\") $2 } END { print out }" Q;
}
NR==FNR {
   list = $1;
   all  = "";
   for (i=2; i<=NF; i++) {
      all = all SUBSEP $i ;
      books[list, i-1] = $i;
   }
   books[list, "all"  ] = all SUBSEP;
   books[list, "count"] = NF-1;
   next;
}
{
   book  = $1;
   delete bookCount;
   for (i=2; i<=NF; i++) {
      list = $i;
      if (books[list, "all"] ~ SUBSEP book SUBSEP) {
         for (ib=1; ib<=books[list, "count"]; ib++) {
            bookCount[books[list, ib]]++;
         }
      }
   }
   for (b in bookCount) {
      if (b != book) {
         print book, b, bookCount[b] | cmd;
      }
   }
   close(cmd);
}

' kevin2.dat kevin1.dat

Input file 1 (kevin1.dat) :

Code:

A:list1,list2,list3,list4
B:list1,list2,list3,list4
C:list1,list2,list6
D:list3
F:list2,list4,list5
G:list7
H:list2,list5

Input file 2 (kevin2.dat) :

Code:

list1:A,B,C
list2:A,B,C,F,H
list3:A,B,D
list4:A,B,F
list5:H,F
list6:C
list7:G

Output:

Code:

$ time ./kevin.sh
A:B,C,F,D,H
B:A,C,F,D,H
C:A,B,F,H
D:A,B
F:A,B,H,C
H:F,A,B,C

real    0m1.142s
user    0m0.590s
sys     0m0.580s
$

The problem with that script is that we run a sort command for every book.
The following solution use only one sort command :

Code:

awk -v Q="'" -F'[:,]' '
NR==FNR {
   list = $1;
   all  = "";
   for (i=2; i<=NF; i++) {
      all = all SUBSEP $i ;
      books[list, i-1] = $i;
   }
   books[list, "all"  ] = all SUBSEP;
   books[list, "count"] = NF-1;
   next;
}
{
   book  = $1;
   delete bookCount;
   for (i=2; i<=NF; i++) {
      list = $i;
      if (books[list, "all"] ~ SUBSEP book SUBSEP) {
         for (ib=1; ib<=books[list, "count"]; ib++) {
            bookCount[books[list, ib]]++;
         }
      }
   }
   for (b in bookCount) {
      if (b != book) {
         print book, b, bookCount[b]
      }
   }
   close(cmd);
}

' kevin2.dat kevin1.dat  |
sort -k1,1 -k3,3nr -k2,2 |
awk '
{
   book = $1;
   if (book == prev) {
      out = out "," $2;
   } else {
      if (out) print prev ":" out;
      out = $2;
      prev = book;
   }
}
END { if (out) print prev ":" out; }
'

With the same input files, the out is the same but times are better :

Code:

$ time ./kevin2.sh
A:B,C,F,D,H
B:A,C,F,D,H
C:A,B,F,H
D:A,B
F:A,B,H,C
H:F,A,B,C

real    0m0.419s
user    0m0.152s
sys     0m0.169s
$

Jean-Pierre.

Hi, aigles
Thank you so so much. your script is much faster than mine.
And it helps me a lot, I have learned many things about AWK(oh, it surprises me that AWK can be written this way) from your script.
Thank you!

Last edited by kevintse; 05-22-2010 at 12:02 PM..

kevintse

View Public Profile for kevintse

Find all posts by kevintse

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

split function

Hi all! I am relatively new to UNIX staff, and I have come across a problem: I have a big directory, which contains 100 smaller ones. Each of the 100 contains a file ending in .txt , so there are 100 files ending in .txt I want to split each of the 100 files in smaller ones, which will contain...

2. Shell Programming and Scripting

perl split function

$mystring = "name:blk:house::"; print "$mystring\n"; @s_format = split(/:/, $mystring); for ($i=0; $i <= $#s_format; $i++) { print "index is $i,field is $s_format"; print "\n"; } $size = $#s_format + 1; print "total size of array is $size\n"; i am expecting my size to be 5, why is it...

3. UNIX for Dummies Questions & Answers

Split a file with no pattern -- Split, Csplit, Awk

I have gone through all the threads in the forum and tested out different things. I am trying to split a 3GB file into multiple files. Some files are even larger than this. For example: split -l 3000000 filename.txt This is very slow and it splits the file with 3 million records in each...

4. Shell Programming and Scripting

awk - split function

Hi, I have some output in the form of: #output: abc123 def567 hij890 ghi324 the above is in one column, stored in the variable x ( and if you wana know about x... x=sprintf(tolower(substr(someArray,1,1)substr(userArray,3,1)substr(userArray,2,1))) when i simply print x (print x) I get...

5. Shell Programming and Scripting

Use split function in perl

Hello, if i have file like this: 010000890306932455804 05306977653873 0520080417010520ISMS SMT ZZZZZZZZZZZZZOC30693599000 30971360000 ZZZZZZZZZZZZZZZZZZZZ202011302942311 010000890306946317387 05306977313623 0520080417010520ISMS SMT ZZZZZZZZZZZZZOC306942190000 30971360000...

6. Homework & Coursework Questions

PERL split function

Hi... I have a question regarding the split function in PERL. I have a very huge csv file (more than 80 million records). I need to extract a particular position(eg : 50th position) of each line from the csv file. I tried using split function. But I realized split takes a very long time. Also...

7. Homework & Coursework Questions

PERL split function

8. Shell Programming and Scripting

PERL split function

9. Shell Programming and Scripting

Perl split function

my @d =split('\|', $_); west|ACH|3|Y|LuV|N||N|| Qt|UWST|57|Y|LSV|Y|Bng|N|KT| It Returns d as 8 for First Line, and 9 as for Second Line . I want to Process Both the Files, How to Handle It.

10. Shell Programming and Scripting

awk to split one field and print the last two fields within the split part.

Hello; I have a file consists of 4 columns separated by tab. The problem is the third fields. Some of the them are very long but can be split by the vertical bar "|". Also some of them do not contain the string "UniProt", but I could ignore it at this moment, and sort the file afterwards. Here is...

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

split function

Discussion started by: ktsirig

2. Shell Programming and Scripting

perl split function

Discussion started by: new2ss

3. UNIX for Dummies Questions & Answers

Split a file with no pattern -- Split, Csplit, Awk

Discussion started by: madhunk

4. Shell Programming and Scripting

awk - split function

Discussion started by: fusionX

5. Shell Programming and Scripting

Use split function in perl

Discussion started by: chriss_58

6. Homework & Coursework Questions

PERL split function

Discussion started by: castle

7. Homework & Coursework Questions

PERL split function

Discussion started by: castle

8. Shell Programming and Scripting

PERL split function

Discussion started by: castle

9. Shell Programming and Scripting

Perl split function

Discussion started by: vishwakar

10. Shell Programming and Scripting

awk to split one field and print the last two fields within the split part.

Discussion started by: yifangt