The builtin split function in AWK is too slow

05-20-2010

Registered User

248, 28

Join Date: May 2010

Last Activity: 24 May 2013, 10:11 PM EDT

Location: GuangZhou, China

Posts: 248

Thanks Given: 8

Thanked 28 Times in 27 Posts

OK, I will explain what I want to achieve.
I have two data files, I want to generate the result through grouping and sorting the data in these two files.

Code:

$ cat data1.txt
A:list1,list2,list3,list4
B:list1,list2,list3,list4
C:list1,list2,list6
D:list3
F:list2,list4,list5
G:list7
H:list2,list5
A:list1,list2,list3,list4
B:list1,list2,list3,list4
C:list1,list2,list6
D:list3
F:list2,list4,list5
G:list7
H:list2,list5

$ cat data2.txt
list1:A,B,C
list2:A,B,C,F,H
list3:A,B,D
list4:A,B,F
list5:H,F
list6:C
list7:G

desired output:
A:B,C,D,F,H
B:A,C,D,F,H
C:A,B,F,H
D:A,B
F:A,B,C,H
H:A,B,C,F

I will explain a little bit for the output. For the row that starts with "A", "B,C,D,F,H" are extracted from all those lists that contain "A", and they are sorted by their appearing frequency in descending order.

I have already written the code that can generate the desired output, but I have noticed that the split function I used in the program is TOO slow when it comes to splitting a string that contains over 30,000 fields(comma separated). So I am here seeking a solution to making it fast.

This is the code I used to generate the output:

Code:

#!/bin/awk -f

BEGIN { FS=":"; ts1 = "date \"+%s\""; }
NR == FNR { bookLists[$1]=$2; next }    #the first file
{ books[$1]=$2 }                                        #the second file
END {
{ print "There are in total", length(bookLists), "book lists and ", length(books), "books." > "make_data_log.txt" }
count = 0
#bookLists: A:list1,list2
#books: list1:A,B
cmd = "sort -k 3 -nr | awk ' { arr[$2]; if(!b) b=$1 } END { for (i in arr) str=str ? str\",\"i : i; print b, str} '"
for (i in bookLists) {
        #print "========== book " i "=========="
        split(bookLists[i], tmpBls, /,/)
        for (j in tmpBls) {
                split(books[tmpBls[j]], tmpBs, /,/)
                for (k in tmpBs)
                        ++result[tmpBs[k]]      #print i":", tmpBs[k]
                delete tmpBs
        }
        for (l in result) {
                if (i != l) print i, l, result[l] | cmd
                #if (i != l) print i, l, result[l]
                if (++num == 20) break
        }
        close(cmd)              #close the pipe, or this "sort" will be delayed until the awk program ends
        delete result
        delete tmpBls
        num = 0
        if (++count % 100000 == 0)
                print count >> "make_data_log.txt"
}

{ ts2 = "date \"+%s\""; print ts2-ts1, "seconds consumed." >> "make_data_log.txt" }

}

---------- Post updated at 08:39 PM ---------- Previous update was at 07:23 AM ----------

bump...
please, experts...how to make this script fast?

kevintse

View Public Profile for kevintse

Find all posts by kevintse

05-21-2010

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

Quote:

Originally Posted by kevintse

now I have this code:

Code:

awk -F: -vcmd="awk ' BEGIN { RS=\",\"} { print $1 }'" '{ print $2 | cmd; close(cmd)} ' data.txt

That code is incorrect. Since it occurs within double-quotes, $1 is replaced by the shell with the value of its first positional parameter, before AWK ever sees it. If, instead, you want it to refer the first field in AWK, you need to quote it, \$1.

The only reason I can think of for why you're getting the expected result is that the shell's first positional parameter, $1, is empty, and so AWK is only seeing "{ print }". Since lines split on the comma are yielding one-field records, within AWK, in this specific case, "print" (equivalent to "print $0") is equivalent to "print $1", and hence everything seems okay.

If I'm correct, setting $1 in the shell to a non-null value not equal to a literal '$0' or a literal '$1' will break the code.

Regards,
Alister

---------- Post updated 05-21-10 at 12:19 AM ---------- Previous update was 05-20-10 at 10:01 PM ----------

Quote:

Originally Posted by kevintse

Code:

$ cat data2.txt
list1:A,B,C
list2:A,B,C,F,H
list3:A,B,D
list4:A,B,F
list5:H,F
list6:C
list7:G

desired output:
A:B,C,D,F,H
B:A,C,D,F,H
C:A,B,F,H
D:A,B
F:A,B,C,H
H:A,B,C,F

I will explain a little bit for the output. For the row that starts with "A", "B,C,D,F,H" are extracted from all those lists that contain "A", and they are sorted by their appearing frequency in descending order.

I have already written the code that can generate the desired output.

That output is incorrect. If you take a close look at the desired output that you provided, they are incorrectly sorted. The easiest to spot is "H:A,B,C,F". H occurs with F twice, and the others once. That line should be, "H:F,A,B,C". As a matter of fact, with the exception of the C and D lines, they are all wrong. A quick look at your code suggests that the problem lies in:

Quote:

Originally Posted by kevintse

Code:

cmd = "sort -k 3 -nr | awk ' { arr[$2]; if(!b) b=$1 } END { for (i in arr)

Specifically, 'for (i in arr)' is not guaranteed to return the array elements in any specific order. That pipeline sorts with the sort command but then that order is disgarded when the "in" operator is used.

Regards,
Alister

alister

View Public Profile for alister

Find all posts by alister

05-21-2010

Registered User

1,714, 63

Join Date: Apr 2004

Last Activity: 15 May 2020, 11:27 AM EDT

Location: Bordeaux, France

Posts: 1,714

Thanks Given: 2

Thanked 63 Times in 59 Posts

Try this awk program which doesn't use the split function :

Code:

awk -v Q="'" -F'[:,]' '
BEGIN {
   cmd = "sort -k3,3nr -k2,2 | awk " Q "{ out=out (NR==1 ? $1 \":\" : \",\") $2 } END { print out }" Q;
}
NR==FNR {
   list = $1;
   all  = "";
   for (i=2; i<=NF; i++) {
      all = all SUBSEP $i ;
      books[list, i-1] = $i;
   }
   books[list, "all"  ] = all SUBSEP;
   books[list, "count"] = NF-1;
   next;
}
{
   book  = $1;
   delete bookCount;
   for (i=2; i<=NF; i++) {
      list = $i;
      if (books[list, "all"] ~ SUBSEP book SUBSEP) {
         for (ib=1; ib<=books[list, "count"]; ib++) {
            bookCount[books[list, ib]]++;
         }
      }
   }
   for (b in bookCount) {
      if (b != book) {
         print book, b, bookCount[b] | cmd;
      }
   }
   close(cmd);
}

' kevin2.dat kevin1.dat

Input file 1 (kevin1.dat) :

Code:

A:list1,list2,list3,list4
B:list1,list2,list3,list4
C:list1,list2,list6
D:list3
F:list2,list4,list5
G:list7
H:list2,list5

Input file 2 (kevin2.dat) :

Code:

list1:A,B,C
list2:A,B,C,F,H
list3:A,B,D
list4:A,B,F
list5:H,F
list6:C
list7:G

Output:

Code:

$ time ./kevin.sh
A:B,C,F,D,H
B:A,C,F,D,H
C:A,B,F,H
D:A,B
F:A,B,H,C
H:F,A,B,C

real    0m1.142s
user    0m0.590s
sys     0m0.580s
$

The problem with that script is that we run a sort command for every book.
The following solution use only one sort command :

Code:

awk -v Q="'" -F'[:,]' '
NR==FNR {
   list = $1;
   all  = "";
   for (i=2; i<=NF; i++) {
      all = all SUBSEP $i ;
      books[list, i-1] = $i;
   }
   books[list, "all"  ] = all SUBSEP;
   books[list, "count"] = NF-1;
   next;
}
{
   book  = $1;
   delete bookCount;
   for (i=2; i<=NF; i++) {
      list = $i;
      if (books[list, "all"] ~ SUBSEP book SUBSEP) {
         for (ib=1; ib<=books[list, "count"]; ib++) {
            bookCount[books[list, ib]]++;
         }
      }
   }
   for (b in bookCount) {
      if (b != book) {
         print book, b, bookCount[b]
      }
   }
   close(cmd);
}

' kevin2.dat kevin1.dat  |
sort -k1,1 -k3,3nr -k2,2 |
awk '
{
   book = $1;
   if (book == prev) {
      out = out "," $2;
   } else {
      if (out) print prev ":" out;
      out = $2;
      prev = book;
   }
}
END { if (out) print prev ":" out; }
'

With the same input files, the out is the same but times are better :

Code:

$ time ./kevin2.sh
A:B,C,F,D,H
B:A,C,F,D,H
C:A,B,F,H
D:A,B
F:A,B,H,C
H:F,A,B,C

real    0m0.419s
user    0m0.152s
sys     0m0.169s
$

Jean-Pierre.

Last edited by aigles; 05-21-2010 at 05:59 PM..

These 2 Users Gave Thanks to aigles For This Post:

aigles

View Public Profile for aigles

Find all posts by aigles

05-21-2010

Registered User

2,100, 402

Join Date: Apr 2009

Last Activity: 11 February 2020, 10:24 AM EST

Posts: 2,100

Thanks Given: 26

Thanked 402 Times in 360 Posts

I have been going through your posts, and I don't quite understand a few things in your data files.

Quote:

Originally Posted by kevintse

...I have two data files, I want to generate the result through grouping and sorting the data in these two files.

Code:

$ cat data1.txt
A:list1,list2,list3,list4
B:list1,list2,list3,list4
C:list1,list2,list6
D:list3
F:list2,list4,list5
G:list7
H:list2,list5
A:list1,list2,list3,list4
B:list1,list2,list3,list4
C:list1,list2,list6
D:list3
F:list2,list4,list5
G:list7
H:list2,list5
 
$ cat data2.txt
list1:A,B,C
list2:A,B,C,F,H
list3:A,B,D
list4:A,B,F
list5:H,F
list6:C
list7:G
 
desired output:
A:B,C,D,F,H
B:A,C,D,F,H
C:A,B,F,H
D:A,B
F:A,B,C,H
H:A,B,C,F

I will explain a little bit for the output. For the row that starts with "A", "B,C,D,F,H" are extracted from all those lists that contain "A", and they are sorted by their appearing frequency in descending order.
...

The portion in red color is a repetition of the portion immediately above it in your file data1.txt.

(1) Can your actual file have data like that ?
(2) If yes, then can the "second" set be different ? For example, is it possible to have two lines like so -

Code:

A:list1,list2,list3,list4
...
A:list1,list3,list5

(3) If yes, then can there be more than two sets in data1.txt ? like so -

Code:

A:list1,list2,list3,list4
...
A:list1,list3,list5
...
A:list1,list5,list7

(4) Do you collect a unique set of "lists", for each character on the left, in such a case ?
For example, for A => list1, list2, list3, list4, list5, list7 ?

(5) The character "C" is associated with 3 lists in data1.txt:

Code:

C:list1,list2,list6

And these lists have the following set of characters:

Code:

list1:A,B,C
list2:A,B,C,F,H
...
list6:C
...

So, the distinct set of characters associated with list1, list2 and list6 should be => (A, B, C, F, H)

Your desired output for "C" is like so -

Code:

desired output:
...
C:A,B,F,H

Have you removed "C" from the right-hand-side list, because it is common on either side of the ":" character ?

(6) Is that also the reason you've omitted "G:G" from your desired output ?

Quote:

...
This is the code I used to generate the output:

Code:

#!/bin/awk -f
 
BEGIN { FS=":"; ts1 = "date \"+%s\""; }
NR == FNR { bookLists[$1]=$2; next }    #the first file
{ books[$1]=$2 }                                        #the second file
END {
{ print "There are in total", length(bookLists), "book lists and ", length(books), "books." > "make_data_log.txt" }
count = 0
#bookLists: A:list1,list2
#books: list1:A,B
cmd = "sort -k 3 -nr | awk ' { arr[$2]; if(!b) b=$1 } END { for (i in arr) str=str ? str\",\"i : i; print b, str} '"
for (i in bookLists) {
        #print "========== book " i "=========="
        split(bookLists[i], tmpBls, /,/)
        for (j in tmpBls) {
                split(books[tmpBls[j]], tmpBs, /,/)
                for (k in tmpBs)
                        ++result[tmpBs[k]]      #print i":", tmpBs[k]
                delete tmpBs
        }
        for (l in result) {
                if (i != l) print i, l, result[l] | cmd
                #if (i != l) print i, l, result[l]
                if (++num == 20) break
        }
        close(cmd)              #close the pipe, or this "sort" will be delayed until the awk program ends
        delete result
        delete tmpBls
        num = 0
        if (++count % 100000 == 0)
                print count >> "make_data_log.txt"
}
 
{ ts2 = "date \"+%s\""; print ts2-ts1, "seconds consumed." >> "make_data_log.txt" }
 
}

...

What's the current response time for your actual data "data1.txt" (the one that has more than 30,000 strings delimited by commas) ?

And what is the acceptable response time for the same?

tyler_durden

durden_tyler

View Public Profile for durden_tyler

Find all posts by durden_tyler

05-22-2010

Registered User

248, 28

Join Date: May 2010

Last Activity: 24 May 2013, 10:11 PM EDT

Location: GuangZhou, China

Posts: 248

Thanks Given: 8

Thanked 28 Times in 27 Posts

Quote:

Originally Posted by aigles

Try this awk program which doesn't use the split function :

Code:

awk -v Q="'" -F'[:,]' '
BEGIN {
   cmd = "sort -k3,3nr -k2,2 | awk " Q "{ out=out (NR==1 ? $1 \":\" : \",\") $2 } END { print out }" Q;
}
NR==FNR {
   list = $1;
   all  = "";
   for (i=2; i<=NF; i++) {
      all = all SUBSEP $i ;
      books[list, i-1] = $i;
   }
   books[list, "all"  ] = all SUBSEP;
   books[list, "count"] = NF-1;
   next;
}
{
   book  = $1;
   delete bookCount;
   for (i=2; i<=NF; i++) {
      list = $i;
      if (books[list, "all"] ~ SUBSEP book SUBSEP) {
         for (ib=1; ib<=books[list, "count"]; ib++) {
            bookCount[books[list, ib]]++;
         }
      }
   }
   for (b in bookCount) {
      if (b != book) {
         print book, b, bookCount[b] | cmd;
      }
   }
   close(cmd);
}

' kevin2.dat kevin1.dat

Input file 1 (kevin1.dat) :

Code:

A:list1,list2,list3,list4
B:list1,list2,list3,list4
C:list1,list2,list6
D:list3
F:list2,list4,list5
G:list7
H:list2,list5

Input file 2 (kevin2.dat) :

Code:

list1:A,B,C
list2:A,B,C,F,H
list3:A,B,D
list4:A,B,F
list5:H,F
list6:C
list7:G

Output:

Code:

$ time ./kevin.sh
A:B,C,F,D,H
B:A,C,F,D,H
C:A,B,F,H
D:A,B
F:A,B,H,C
H:F,A,B,C

real    0m1.142s
user    0m0.590s
sys     0m0.580s
$

The problem with that script is that we run a sort command for every book.
The following solution use only one sort command :

Code:

awk -v Q="'" -F'[:,]' '
NR==FNR {
   list = $1;
   all  = "";
   for (i=2; i<=NF; i++) {
      all = all SUBSEP $i ;
      books[list, i-1] = $i;
   }
   books[list, "all"  ] = all SUBSEP;
   books[list, "count"] = NF-1;
   next;
}
{
   book  = $1;
   delete bookCount;
   for (i=2; i<=NF; i++) {
      list = $i;
      if (books[list, "all"] ~ SUBSEP book SUBSEP) {
         for (ib=1; ib<=books[list, "count"]; ib++) {
            bookCount[books[list, ib]]++;
         }
      }
   }
   for (b in bookCount) {
      if (b != book) {
         print book, b, bookCount[b]
      }
   }
   close(cmd);
}

' kevin2.dat kevin1.dat  |
sort -k1,1 -k3,3nr -k2,2 |
awk '
{
   book = $1;
   if (book == prev) {
      out = out "," $2;
   } else {
      if (out) print prev ":" out;
      out = $2;
      prev = book;
   }
}
END { if (out) print prev ":" out; }
'

With the same input files, the out is the same but times are better :

Code:

$ time ./kevin2.sh
A:B,C,F,D,H
B:A,C,F,D,H
C:A,B,F,H
D:A,B
F:A,B,H,C
H:F,A,B,C

real    0m0.419s
user    0m0.152s
sys     0m0.169s
$

Jean-Pierre.

Hi, aigles
Thank you so so much. your script is much faster than mine.
And it helps me a lot, I have learned many things about AWK(oh, it surprises me that AWK can be written this way) from your script.
Thank you!

Last edited by kevintse; 05-22-2010 at 12:02 PM..

kevintse

View Public Profile for kevintse

Find all posts by kevintse

05-22-2010

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

It seems to me that both files contain the same information, though in different formats. A simpler solution would be to use a different algorithm, which builds an internal list of book-pairs in one pass using one data file:

Code:

#!/bin/sh

awk -F'[:,]' '
    { for(i=2;i<=NF;i++) for(j=2;j<=NF;j++) if (i!=j) a[$i" "$j]++}
    END { for (k in a) print k" "a[k] }' "$1" \
| sort -k1,1 -k3,3nr -k2,2 \
| awk '{b=$1; if (b!=ob) {if (NR>1) print s; s=$1":"$2; ob=b; next}; s=s","$2} END {print s}'

Test run:

Code:

$ cat data
list1:A,B,C
list2:A,B,C,F,H
list3:A,B,D
list4:A,B,F
list5:H,F
list6:C
list7:G
$ ./books.sh data
A:B,C,F,D,H
B:A,C,F,D,H
C:A,B,F,H
D:A,B
F:A,B,H,C
H:F,A,B,C

A perl solution which is probably faster:

Code:

for ($i=1; $i<=$#F; $i++) {
    for ($j=1; $j<=$#F; $j++) {
        if ($i!=$j) {
            $books{$F[$i]}{$F[$j]}++
        }
    }
}

END {
    for $k ( sort keys %books ) {
        @v = sort { $books{$k}{$b} != $books{$k}{$a}
                    ? $books{$k}{$b} <=> $books{$k}{$a}
                    : $a cmp $b
                  } keys %{ $books{$k} };
        print "$k:" . join (",", @v);
    }
}

Test run, using the same data file as with the sh/awk/sort solution:

Code:

$ perl -lan -F'[:,]' books.pl data
A:B,C,F,D,H
B:A,C,F,D,H
C:A,B,F,H
D:A,B
F:A,B,H,C
H:F,A,B,C

Note: Its been about 10 years since I've written anything more than a one-liner in perl, so perhaps a perl guru can slash that to a couple of lines.

Regards,
Alister

This User Gave Thanks to alister For This Post:

alister

View Public Profile for alister

Find all posts by alister

05-23-2010

Registered User

248, 28

Join Date: May 2010

Last Activity: 24 May 2013, 10:11 PM EDT

Location: GuangZhou, China

Posts: 248

Thanks Given: 8

Thanked 28 Times in 27 Posts

Hi, Alister
I am really grateful for your succinct solutions.
The perl script is faster. But I still need some help from you, I have never learnt perl, and I don't have spare time to learn it for the moment, I want to modify your script to just print the top 20 books associating each book. how?

Thank you!!!

kevintse

View Public Profile for kevintse

Find all posts by kevintse

Shell Programming and Scripting

The builtin split function in AWK is too slow

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to split one field and print the last two fields within the split part.

Discussion started by: yifangt

2. Shell Programming and Scripting

Perl split function

Discussion started by: vishwakar

3. Shell Programming and Scripting

PERL split function

Discussion started by: castle

4. Homework & Coursework Questions

PERL split function

Discussion started by: castle

5. Homework & Coursework Questions

PERL split function

Discussion started by: castle

6. Shell Programming and Scripting

Use split function in perl

Discussion started by: chriss_58

7. Shell Programming and Scripting

awk - split function

Discussion started by: fusionX

8. UNIX for Dummies Questions & Answers

Split a file with no pattern -- Split, Csplit, Awk

Discussion started by: madhunk

9. Shell Programming and Scripting

perl split function

Discussion started by: new2ss

10. UNIX for Dummies Questions & Answers

split function

Discussion started by: ktsirig