Performance assessment of using single or combined pattern matching


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Performance assessment of using single or combined pattern matching
# 1  
Old 07-08-2017
Performance assessment of using single or combined pattern matching

Hi,

I want to know which pattern matching technique will be giving better performance and quick result.

I will be having the patterns in a file and want to read that patterns and search through a whole file of say 70 MB size. whether if i initially create a pattern matching string while reading through the pattern file and combining them to form like or condition added string variable and use it in the awk to search for the pattern in the 70 MB file
Code:
nawk -F"," '{ if ((substr($17,1,10)==1234567890 || substr($17,1,10)==2345678901 || substr($17,1,10)==3456789012 || substr($17,1,10)==4567890123)  && (substr($3,1,6)=="ABCDEF" || substr($3,1,6)=="GHIJKL" || substr($3,1,6)=="MNOPQR")) print substr($3,1,6)","$4","$5","$6","$8","$100","$101","$102",4"$103","$104","$109}' /text16.txt

or read the pattern one by one and search the whole file each time for each pattern.
Like
Code:
 
While read line
Do
... (same nawk with single pattern in the or portion and after &&  patterns will be same and fixed) 
Done<file

Which process will be faster and kindly give sample string formation techniques if more than one entry available in the file

I wish to make the string concatenation to form the or portion alone in the above code. And portion will be fixed... File can contain one pattern or multiple patterns.

Last edited by ananan; 07-08-2017 at 03:16 PM..
# 2  
Old 07-08-2017
The first should be fastest.
I think it could be replaced by something like this (not tested), which should save some operations:
Code:
nawk  '
  BEGIN {
    FS=OFS=","
    p1="^(1234567890|2345678901|3456789012|4567890123)$"
    p2="^(ABCDEF|GHIJKL|MNOPQR)$"
  }

  {
    f1=substr($17,1,10)
    f2=substr($3,1,6)
  }

  f1~p1 && f2~p2 {
    print f2,$4,$5,$6,$8,$100,$101,$102,"4"$103,$104,$109
  }
' /text16.txt

# 3  
Old 07-08-2017
What Scrutinizer proposed is definitely faster than what you have in your post, but it has the patterns as string constants. You'll need to build those from the file, but how will you tell where to use the "or" operator and where the "and"? Please post a sample of your pattern file.
# 4  
Old 07-08-2017
Rightly catched my requirements..
The and condition variable p2 will be constant. I need to build the or conditions variable p1 alone.
Say pattern file contains
Code:
 
1234567890
2345678901
3456789012
4567890123
5678901234

File can contain either odd number of data or even number of data {I mean the wc -l of file}

Last edited by ananan; 07-09-2017 at 12:04 AM..
# 5  
Old 07-09-2017
Do you really have empty lines in your pattern file? Anyhow, try this adaption of Scrutinizer's proposal:
Code:
awk  '
BEGIN           {FS=OFS=","
                 p2="^(ABCDEF|GHIJKL|MNOPQR)$"
                }

NR == FNR       {if (NF)        {TMP = TMP DL $0
                                 DL = "|"
                                }
                 next
                }
FNR == 1        {p1 = "^(" TMP ")$"
                }
                {f1=substr($17,1,10)
                 f2=substr($3,1,6)
                }

f1~p1 && f2~p2  {print f2,$4,$5,$6,$8,$100,$101,$102,"4"$103,$104,$109
                }
' patternfile file

# 6  
Old 07-09-2017
Thank you. It's working fine. If you don't mind how can I achieve if I need to read another pattern file 2 and form a variable p3 as like p2 and do a match with f3 in the same script.

Quote:
Originally Posted by RudiC
Do you really have empty lines in your pattern file? Anyhow, try this adaption of Scrutinizer's proposal:
Code:
awk  '
BEGIN           {FS=OFS=","
                 p2="^(ABCDEF|GHIJKL|MNOPQR)$"
                }

NR == FNR       {if (NF)        {TMP = TMP DL $0
                                 DL = "|"
                                }
                 next
                }
FNR == 1        {p1 = "^(" TMP ")$"
                }
                {f1=substr($17,1,10)
                 f2=substr($3,1,6)
                }

f1~p1 && f2~p2  {print f2,$4,$5,$6,$8,$100,$101,$102,"4"$103,$104,$109
                }
' patternfile file

# 7  
Old 07-10-2017
Why don't you give it a go yourself and post it here so we can discuss your approach?
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Group Multiple Lines on SINGLE line matching pattern

Hi Guys, I am trying to format my csv file. When I spool the file using sqlplus the single row output is wrapped on three lines. Somehow I managed to format that file and finally i am trying to make the multiple line on single line. The below command is working fine but I need to pass the... (3 Replies)
Discussion started by: RJSKR28
3 Replies

2. UNIX for Dummies Questions & Answers

Grep -v lines starting with pattern 1 and not matching pattern 2

Hi all! Thanks for taking the time to view this! I want to grep out all lines of a file that starts with pattern 1 but also does not match with the second pattern. Example: Drink a soda Eat a banana Eat multiple bananas Drink an apple juice Eat an apple Eat multiple apples I... (8 Replies)
Discussion started by: demmel
8 Replies

3. Shell Programming and Scripting

sed - filter blocks between single delimiters matching a pattern

Hi! I have a file with the following format:CDR ... MSISDN=111 ... CDR ... MSISDN=xxx ... CDR ... MSISDN=xxx ... CDR ... MSISDN=111 (2 Replies)
Discussion started by: Flavius
2 Replies

4. Shell Programming and Scripting

Sed: printing lines AFTER pattern matching EXCLUDING the line containing the pattern

'Hi I'm using the following code to extract the lines(and redirect them to a txt file) after the pattern match. But the output is inclusive of the line with pattern match. Which option is to be used to exclude the line containing the pattern? sed -n '/Conn.*User/,$p' > consumers.txt (11 Replies)
Discussion started by: essem
11 Replies

5. UNIX for Dummies Questions & Answers

Extracting combined differences based on a single column

Dear All, I have two sets of files. File 1 can be any number between 1 and 20 followed by a frequency of that number in a give documents... the lines in the file will be dependent to the analysed document. e.g. file1 1,5 4,1 then I have file two which is basicall same numbers but with... (2 Replies)
Discussion started by: A-V
2 Replies

6. Shell Programming and Scripting

Creating single pattern for matching multiple files.

Hi friends, I have a some files in a directory. for example 856-abc 856-def 851-abc 945-def 956-abc 852-abc i want to display only those files whose name starts with 856* 945* and 851* using a single pattern. i.e 856-abc 856-def 851-abc 945-def the rest of the two files... (2 Replies)
Discussion started by: Little
2 Replies

7. Shell Programming and Scripting

Split single file into multiple files using pattern matching

I have one single shown below and I need to break each ST|850 & SE to separate file using unix script. Below example should create 3 files. We can use ST & SE to filter as these field names will remain same. Please advice with the unix code. ST|850 BEG|PO|1234 LIN|1|23 SE|4 ST|850... (3 Replies)
Discussion started by: prasadm
3 Replies

8. Shell Programming and Scripting

AWK - Pattern Matching & Replacing - Performance

Experts, I am a beginner to Unix Shell Scripting We have source as a flat file which contains CTRL+F character as the delimiter. We need to count the number of records in the file (CTRL+F) to perform file validation Following command being used: awk '{cnt+=gsub(//,"&")}END {print cnt}'... (4 Replies)
Discussion started by: srivijay81
4 Replies

9. Shell Programming and Scripting

counting the lines matching a pattern, in between two pattern, and generate a tab

Hi all, I'm looking for some help. I have a file (very long) that is organized like below: >Cluster 0 0 283nt, >01_FRYJ6ZM12HMXZS... at +/99% 1 279nt, >01_FRYJ6ZM12HN12A... at +/99% 2 281nt, >01_FRYJ6ZM12HM4TS... at +/99% 3 283nt, >01_FRYJ6ZM12HM946... at +/99% 4 279nt,... (4 Replies)
Discussion started by: d.chauliac
4 Replies

10. Shell Programming and Scripting

comment/delete a particular pattern starting from second line of the matching pattern

Hi, I have file 1.txt with following entries as shown: 0152364|134444|10.20.30.40|015236433 0233654|122555|10.20.30.50|023365433 ** ** ** In file 2.txt I have the following entries as shown: 0152364|134444|10.20.30.40|015236433 0233654|122555|10.20.30.50|023365433... (4 Replies)
Discussion started by: imas
4 Replies
Login or Register to Ask a Question