Sponsored Content
Top Forums Shell Programming and Scripting The builtin split function in AWK is too slow Post 302423675 by aigles on Friday 21st of May 2010 04:46:33 PM
Old 05-21-2010
Try this awk program which doesn't use the split function :
Code:
awk -v Q="'" -F'[:,]' '
BEGIN {
   cmd = "sort -k3,3nr -k2,2 | awk " Q "{ out=out (NR==1 ? $1 \":\" : \",\") $2 } END { print out }" Q;
}
NR==FNR {
   list = $1;
   all  = "";
   for (i=2; i<=NF; i++) {
      all = all SUBSEP $i ;
      books[list, i-1] = $i;
   }
   books[list, "all"  ] = all SUBSEP;
   books[list, "count"] = NF-1;
   next;
}
{
   book  = $1;
   delete bookCount;
   for (i=2; i<=NF; i++) {
      list = $i;
      if (books[list, "all"] ~ SUBSEP book SUBSEP) {
         for (ib=1; ib<=books[list, "count"]; ib++) {
            bookCount[books[list, ib]]++;
         }
      }
   }
   for (b in bookCount) {
      if (b != book) {
         print book, b, bookCount[b] | cmd;
      }
   }
   close(cmd);
}

' kevin2.dat kevin1.dat

Input file 1 (kevin1.dat) :
Code:
A:list1,list2,list3,list4
B:list1,list2,list3,list4
C:list1,list2,list6
D:list3
F:list2,list4,list5
G:list7
H:list2,list5

Input file 2 (kevin2.dat) :
Code:
list1:A,B,C
list2:A,B,C,F,H
list3:A,B,D
list4:A,B,F
list5:H,F
list6:C
list7:G

Output:
Code:
$ time ./kevin.sh
A:B,C,F,D,H
B:A,C,F,D,H
C:A,B,F,H
D:A,B
F:A,B,H,C
H:F,A,B,C

real    0m1.142s
user    0m0.590s
sys     0m0.580s
$

The problem with that script is that we run a sort command for every book.
The following solution use only one sort command :
Code:
awk -v Q="'" -F'[:,]' '
NR==FNR {
   list = $1;
   all  = "";
   for (i=2; i<=NF; i++) {
      all = all SUBSEP $i ;
      books[list, i-1] = $i;
   }
   books[list, "all"  ] = all SUBSEP;
   books[list, "count"] = NF-1;
   next;
}
{
   book  = $1;
   delete bookCount;
   for (i=2; i<=NF; i++) {
      list = $i;
      if (books[list, "all"] ~ SUBSEP book SUBSEP) {
         for (ib=1; ib<=books[list, "count"]; ib++) {
            bookCount[books[list, ib]]++;
         }
      }
   }
   for (b in bookCount) {
      if (b != book) {
         print book, b, bookCount[b]
      }
   }
   close(cmd);
}

' kevin2.dat kevin1.dat  |
sort -k1,1 -k3,3nr -k2,2 |
awk '
{
   book = $1;
   if (book == prev) {
      out = out "," $2;
   } else {
      if (out) print prev ":" out;
      out = $2;
      prev = book;
   }
}
END { if (out) print prev ":" out; }
'

With the same input files, the out is the same but times are better :
Code:
$ time ./kevin2.sh
A:B,C,F,D,H
B:A,C,F,D,H
C:A,B,F,H
D:A,B
F:A,B,H,C
H:F,A,B,C

real    0m0.419s
user    0m0.152s
sys     0m0.169s
$

Jean-Pierre.

Last edited by aigles; 05-21-2010 at 05:59 PM..
These 2 Users Gave Thanks to aigles For This Post:
 

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

split function

Hi all! I am relatively new to UNIX staff, and I have come across a problem: I have a big directory, which contains 100 smaller ones. Each of the 100 contains a file ending in .txt , so there are 100 files ending in .txt I want to split each of the 100 files in smaller ones, which will contain... (4 Replies)
Discussion started by: ktsirig
4 Replies

2. Shell Programming and Scripting

perl split function

$mystring = "name:blk:house::"; print "$mystring\n"; @s_format = split(/:/, $mystring); for ($i=0; $i <= $#s_format; $i++) { print "index is $i,field is $s_format"; print "\n"; } $size = $#s_format + 1; print "total size of array is $size\n"; i am expecting my size to be 5, why is it... (5 Replies)
Discussion started by: new2ss
5 Replies

3. UNIX for Dummies Questions & Answers

Split a file with no pattern -- Split, Csplit, Awk

I have gone through all the threads in the forum and tested out different things. I am trying to split a 3GB file into multiple files. Some files are even larger than this. For example: split -l 3000000 filename.txt This is very slow and it splits the file with 3 million records in each... (10 Replies)
Discussion started by: madhunk
10 Replies

4. Shell Programming and Scripting

awk - split function

Hi, I have some output in the form of: #output: abc123 def567 hij890 ghi324 the above is in one column, stored in the variable x ( and if you wana know about x... x=sprintf(tolower(substr(someArray,1,1)substr(userArray,3,1)substr(userArray,2,1))) when i simply print x (print x) I get... (7 Replies)
Discussion started by: fusionX
7 Replies

5. Shell Programming and Scripting

Use split function in perl

Hello, if i have file like this: 010000890306932455804 05306977653873 0520080417010520ISMS SMT ZZZZZZZZZZZZZOC30693599000 30971360000 ZZZZZZZZZZZZZZZZZZZZ202011302942311 010000890306946317387 05306977313623 0520080417010520ISMS SMT ZZZZZZZZZZZZZOC306942190000 30971360000... (5 Replies)
Discussion started by: chriss_58
5 Replies

6. Homework & Coursework Questions

PERL split function

Hi... I have a question regarding the split function in PERL. I have a very huge csv file (more than 80 million records). I need to extract a particular position(eg : 50th position) of each line from the csv file. I tried using split function. But I realized split takes a very long time. Also... (1 Reply)
Discussion started by: castle
1 Replies

7. Homework & Coursework Questions

PERL split function

Hi... I have a question regarding the split function in PERL. I have a very huge csv file (more than 80 million records). I need to extract a particular position(eg : 50th position) of each line from the csv file. I tried using split function. But I realized split takes a very long time. Also... (0 Replies)
Discussion started by: castle
0 Replies

8. Shell Programming and Scripting

PERL split function

Hi... I have a question regarding the split function in PERL. I have a very huge csv file (more than 80 million records). I need to extract a particular position(eg : 50th position) of each line from the csv file. I tried using split function. But I realized split takes a very long time. Also... (1 Reply)
Discussion started by: castle
1 Replies

9. Shell Programming and Scripting

Perl split function

my @d =split('\|', $_); west|ACH|3|Y|LuV|N||N|| Qt|UWST|57|Y|LSV|Y|Bng|N|KT| It Returns d as 8 for First Line, and 9 as for Second Line . I want to Process Both the Files, How to Handle It. (3 Replies)
Discussion started by: vishwakar
3 Replies

10. Shell Programming and Scripting

awk to split one field and print the last two fields within the split part.

Hello; I have a file consists of 4 columns separated by tab. The problem is the third fields. Some of the them are very long but can be split by the vertical bar "|". Also some of them do not contain the string "UniProt", but I could ignore it at this moment, and sort the file afterwards. Here is... (5 Replies)
Discussion started by: yifangt
5 Replies
All times are GMT -4. The time now is 09:31 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy