Efficient way to search array in text file by awk

10-21-2015

Registered User

100, 0

Join Date: Mar 2012

Last Activity: 26 July 2017, 2:30 AM EDT

Posts: 100

Thanks Given: 22

Thanked 0 Times in 0 Posts

Efficient way to search array in text file by awk

I have one array SPLNO with approx 10k numbers.Now i want to search the subscriber number from MDN.TXT file (containing approx 1.5 lac record)from the array.if subscriber number found in array it will perform below operation.my issue is that it's taking more time because for one number it's search whole array of 10k records. therefore for 1.5 lac records it's looping around (1.5lac*10K). please suggest efficient ways.

Sample SPLNO.TXT:

Code:

918542054921|30|1|2
918542144944|12|1|2
854215595|12|1|2
918542166966|12|1|2
854225595|12|1|2
918542355955|12|1|2
918542455955|12|1|2
918542555955|12|1|2
918542955955|12|1|2

Sample MDN.TXT:

Code:

8542166966
8542355955
8542555955

Code is

Code:

awk -F"|"  'FNR==1 { ++counter}
counter==1 {SPLNOPULSE[$1]=$4;SPLNOAMT[$1]=$3;SPLNOMAXLEN[$1]=$2;next}
{
for ( mdn in SPLNOMAXLEN)
        {
         if ( ($1 ~ "^"mdn && length($1) <=SPLNOMAXLEN[mdn]) || ("91"$1 ~ "^"mdn && length("91"$1) <=SPLNOMAXLEN[mdn]) )
              {                              
                print found
               }
         else
                print not found
        }                             
 } ' SPLNO.TXT MDN.TXT

siramitsharma

View Public Profile for siramitsharma

Find all posts by siramitsharma

10-21-2015

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

If you want to find one of 10k items in a large file, there's no straight, easy way to avoid comparing each item against each line in file. What you can do is break when found.

You could try a binary search algorithm.

RudiC

View Public Profile for RudiC

Find all posts by RudiC

10-21-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

You could also explain what the SPLNOMAXLEN[] values are trying to accomplish. (If you were trying to perform an exact match for one of two numbers ($1 or "91"$1) instead of a string match at the start and then a string length comparison), it would be tremendously faster.)

And, please explain why printing found or not found 1.5 billion times with no indication of what was or was not found is going to be useful to anyone. This appears to be a useless exercise. And, in that case, why does the speed matter?

But, showing us the output you're hoping to produce from those sample input files might help us understand what you're trying to do.

Note that if 1/3 of your MDN.TXT lines are duplicates (as they are in your sample), you might speed things up considerably by getting rid of the duplicates before running the search loop. That alone would probably save you about 25% on your script's running time.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

10-24-2015

Registered User

100, 0

Join Date: Mar 2012

Last Activity: 26 July 2017, 2:30 AM EDT

Posts: 100

Thanks Given: 22

Thanked 0 Times in 0 Posts

Hi Don,
Please find below the answers to the queries raised

SPLNOMAXLEN[] is for checking the maximum length of the input string,i.e, from MDN.txt & i am not trying for exact match for one of two numbers as input string may contain initial values with our without "91" so this condition is used

Code:

$1 ~ "^"mdn

and

Code:

"91"$1 ~ "^"mdn

I will be carrying out further steps based on found & not found like, say populating fields from

Code:

SPLNOPULSE[mdn] SPLNOAMT[mdn]

. In case of found & not matching will handle cases accordingly.

So output from found will be

Code:

8542355955,1,2
8542555955,1,2

MDN.TXT lines will definitely have duplicates as per requirement and i cannot help it but SPLNO.TXT will not have duplicates for sure.

Please let me know in case processing time can be reduced.

siramitsharma

View Public Profile for siramitsharma

Find all posts by siramitsharma

10-24-2015

Registered User

5,091, 1,931

Join Date: May 2012

Last Activity: 15 July 2020, 4:46 AM EDT

Location: Simplicity

Posts: 5,091

Thanks Given: 565

Thanked 1,931 Times in 1,668 Posts

Do computations with a static result before the loop,
that might save a few CPU cycles.

Code:

len1=length($1)
lenL1=length(L1="91"$1)
for ( mdn in SPLNOMAXLEN)
        {
         if ( ($1 ~ "^"mdn && len1 <=SPLNOMAXLEN[mdn]) || (L1 ~ "^"mdn && lenL1 <=SPLNOMAXLEN[mdn]) )

MadeInGermany

View Public Profile for MadeInGermany

Find all posts by MadeInGermany

10-26-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Hi siramitsharma,
Forum rules prohibit sending private message asking people to respond to your posts.

Note that with the two lines in SPLNO.TXT:

Code:

918542054921|30|1|2
854215595|12|1|2

any of the following lines in MDN.TXT:

Code:

918542054921
91854205492
9185420549
918542054
91854205
9185420
918542
91854
9185
918
91
9
8542054921
854205492
85420549
8542054
854205
85420
8542
854
85
8

would match the 1st line. And any of the following lines in MDN.TXT:

Code:

would match the 2nd line.
So maybe you could pre-build a table of the values that could match entries from the first file and instead of performing lots of matches and comparisons in a loop, you could just test if((mdn in table) || ("91"mdn in table)) instead of the four slower tests you are current using to see if there is a match.

Since every entry in the input has the final two fields with the values 1 and 2, respectively, all we need to know is whether or not there is a match; not what values appear if there is a match. This is important because both entries above match lines in MDN.TXT containing the values:

Code:

You also need to explain why the 2nd field in the 1st line above has the value 30. Since field 1 is 12 characters (918542054921), the longest possible string that can be matched is 12 characters. And, on the 2nd line we have a 1st field (854215595) with length 9 and a second field containing 12. So, I repeat, what use is the 2nd field in SPLNO.TXT other than to give you two more tests to slow down your loop?

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

10-26-2015

Registered User

100, 0

Join Date: Mar 2012

Last Activity: 26 July 2017, 2:30 AM EDT

Posts: 100

Thanks Given: 22

Thanked 0 Times in 0 Posts

THanks Don

siramitsharma

View Public Profile for siramitsharma

Find all posts by siramitsharma

Shell Programming and Scripting

Efficient way to search array in text file by awk

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Read in search strings from text file, search for string in second text file and output to CSV

Discussion started by: An0mander

2. Shell Programming and Scripting

Search text beween tags and write to file using awk

Discussion started by: anandapani

3. Shell Programming and Scripting

Search and replace from file in awk using a 16 bit text file

Discussion started by: gimley

4. Shell Programming and Scripting

Efficient population of array from text file

Discussion started by: carlr

5. Homework & Coursework Questions

Efficient Text File Writing

Discussion started by: george3isme

6. Shell Programming and Scripting

Need an efficient way to search for a tag in an xml file having millions of rows

Discussion started by: Sheel

7. Shell Programming and Scripting

Better and efficient way to reverse search a file for first matched line number.

Discussion started by: kchinnam

8. Shell Programming and Scripting

search text file in file if this file contains necessary text (awk,grep)

Discussion started by: candyme

9. Shell Programming and Scripting

search needed part in text file (awk?)

Discussion started by: candyme

10. Shell Programming and Scripting

text file search and replace with awk

Discussion started by: erikshek