Performance assessment of using single or combined pattern matching

07-08-2017

Registered User

22, 0

Join Date: Jun 2017

Last Activity: 12 July 2017, 2:38 AM EDT

Posts: 22

Thanks Given: 5

Thanked 0 Times in 0 Posts

Performance assessment of using single or combined pattern matching

Hi,

I want to know which pattern matching technique will be giving better performance and quick result.

I will be having the patterns in a file and want to read that patterns and search through a whole file of say 70 MB size. whether if i initially create a pattern matching string while reading through the pattern file and combining them to form like or condition added string variable and use it in the awk to search for the pattern in the 70 MB file

Code:

nawk -F"," '{ if ((substr($17,1,10)==1234567890 || substr($17,1,10)==2345678901 || substr($17,1,10)==3456789012 || substr($17,1,10)==4567890123)  && (substr($3,1,6)=="ABCDEF" || substr($3,1,6)=="GHIJKL" || substr($3,1,6)=="MNOPQR")) print substr($3,1,6)","$4","$5","$6","$8","$100","$101","$102",4"$103","$104","$109}' /text16.txt

or read the pattern one by one and search the whole file each time for each pattern.
Like

Code:

 
While read line
Do
... (same nawk with single pattern in the or portion and after &&  patterns will be same and fixed) 
Done<file

Which process will be faster and kindly give sample string formation techniques if more than one entry available in the file

I wish to make the string concatenation to form the or portion alone in the above code. And portion will be fixed... File can contain one pattern or multiple patterns.

Last edited by ananan; 07-08-2017 at 03:16 PM..

ananan

View Public Profile for ananan

Find all posts by ananan

07-08-2017

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

The first should be fastest.
I think it could be replaced by something like this (not tested), which should save some operations:

Code:

nawk  '
  BEGIN {
    FS=OFS=","
    p1="^(1234567890|2345678901|3456789012|4567890123)$"
    p2="^(ABCDEF|GHIJKL|MNOPQR)$"
  }

  {
    f1=substr($17,1,10)
    f2=substr($3,1,6)
  }

  f1~p1 && f2~p2 {
    print f2,$4,$5,$6,$8,$100,$101,$102,"4"$103,$104,$109
  }
' /text16.txt

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

07-08-2017

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

What Scrutinizer proposed is definitely faster than what you have in your post, but it has the patterns as string constants. You'll need to build those from the file, but how will you tell where to use the "or" operator and where the "and"? Please post a sample of your pattern file.

RudiC

View Public Profile for RudiC

Find all posts by RudiC

07-08-2017

Registered User

22, 0

Join Date: Jun 2017

Last Activity: 12 July 2017, 2:38 AM EDT

Posts: 22

Thanks Given: 5

Thanked 0 Times in 0 Posts

Rightly catched my requirements..
The and condition variable p2 will be constant. I need to build the or conditions variable p1 alone.
Say pattern file contains

Code:

File can contain either odd number of data or even number of data {I mean the wc -l of file}

Last edited by ananan; 07-09-2017 at 12:04 AM..

ananan

View Public Profile for ananan

Find all posts by ananan

07-09-2017

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Do you really have empty lines in your pattern file? Anyhow, try this adaption of Scrutinizer's proposal:

Code:

awk  '
BEGIN           {FS=OFS=","
                 p2="^(ABCDEF|GHIJKL|MNOPQR)$"
                }

NR == FNR       {if (NF)        {TMP = TMP DL $0
                                 DL = "|"
                                }
                 next
                }
FNR == 1        {p1 = "^(" TMP ")$"
                }
                {f1=substr($17,1,10)
                 f2=substr($3,1,6)
                }

f1~p1 && f2~p2  {print f2,$4,$5,$6,$8,$100,$101,$102,"4"$103,$104,$109
                }
' patternfile file

RudiC

View Public Profile for RudiC

Find all posts by RudiC

07-09-2017

Registered User

22, 0

Join Date: Jun 2017

Last Activity: 12 July 2017, 2:38 AM EDT

Posts: 22

Thanks Given: 5

Thanked 0 Times in 0 Posts

Thank you. It's working fine. If you don't mind how can I achieve if I need to read another pattern file 2 and form a variable p3 as like p2 and do a match with f3 in the same script.

Quote:

Originally Posted by RudiC

Do you really have empty lines in your pattern file? Anyhow, try this adaption of Scrutinizer's proposal:

Code:

awk  '
BEGIN           {FS=OFS=","
                 p2="^(ABCDEF|GHIJKL|MNOPQR)$"
                }

NR == FNR       {if (NF)        {TMP = TMP DL $0
                                 DL = "|"
                                }
                 next
                }
FNR == 1        {p1 = "^(" TMP ")$"
                }
                {f1=substr($17,1,10)
                 f2=substr($3,1,6)
                }

f1~p1 && f2~p2  {print f2,$4,$5,$6,$8,$100,$101,$102,"4"$103,$104,$109
                }
' patternfile file

ananan

View Public Profile for ananan

Find all posts by ananan

07-10-2017

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Why don't you give it a go yourself and post it here so we can discuss your approach?

RudiC

View Public Profile for RudiC

Find all posts by RudiC

Shell Programming and Scripting

Performance assessment of using single or combined pattern matching

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Group Multiple Lines on SINGLE line matching pattern

Discussion started by: RJSKR28

2. UNIX for Dummies Questions & Answers

Grep -v lines starting with pattern 1 and not matching pattern 2

Discussion started by: demmel

3. Shell Programming and Scripting

sed - filter blocks between single delimiters matching a pattern

Discussion started by: Flavius

4. Shell Programming and Scripting

Sed: printing lines AFTER pattern matching EXCLUDING the line containing the pattern

Discussion started by: essem

5. UNIX for Dummies Questions & Answers

Extracting combined differences based on a single column

Discussion started by: A-V

6. Shell Programming and Scripting

Creating single pattern for matching multiple files.

Discussion started by: Little

7. Shell Programming and Scripting

Split single file into multiple files using pattern matching

Discussion started by: prasadm

8. Shell Programming and Scripting

AWK - Pattern Matching & Replacing - Performance

Discussion started by: srivijay81

9. Shell Programming and Scripting

counting the lines matching a pattern, in between two pattern, and generate a tab

Discussion started by: d.chauliac

10. Shell Programming and Scripting

comment/delete a particular pattern starting from second line of the matching pattern

Discussion started by: imas