Efficient way to search array in text file by awk


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Efficient way to search array in text file by awk
# 1  
Old 10-21-2015
Efficient way to search array in text file by awk

I have one array SPLNO with approx 10k numbers.Now i want to search the subscriber number from MDN.TXT file (containing approx 1.5 lac record)from the array.if subscriber number found in array it will perform below operation.my issue is that it's taking more time because for one number it's search whole array of 10k records. therefore for 1.5 lac records it's looping around (1.5lac*10K). please suggest efficient ways.

Sample SPLNO.TXT:
Code:
918542054921|30|1|2
918542144944|12|1|2
854215595|12|1|2
918542166966|12|1|2
854225595|12|1|2
918542355955|12|1|2
918542455955|12|1|2
918542555955|12|1|2
918542955955|12|1|2

Sample MDN.TXT:
Code:
8542166966
8542355955
8542555955

Code is
Code:
awk -F"|"  'FNR==1 { ++counter}
counter==1 {SPLNOPULSE[$1]=$4;SPLNOAMT[$1]=$3;SPLNOMAXLEN[$1]=$2;next}
{
for ( mdn in SPLNOMAXLEN)
        {
         if ( ($1 ~ "^"mdn && length($1) <=SPLNOMAXLEN[mdn]) || ("91"$1 ~ "^"mdn && length("91"$1) <=SPLNOMAXLEN[mdn]) )
              {                              
                print found
               }
         else
                print not found
        }                             
 } ' SPLNO.TXT MDN.TXT

# 2  
Old 10-21-2015
If you want to find one of 10k items in a large file, there's no straight, easy way to avoid comparing each item against each line in file. What you can do is break when found.

You could try a binary search algorithm.
# 3  
Old 10-21-2015
You could also explain what the SPLNOMAXLEN[] values are trying to accomplish. (If you were trying to perform an exact match for one of two numbers ($1 or "91"$1) instead of a string match at the start and then a string length comparison), it would be tremendously faster.)

And, please explain why printing found or not found 1.5 billion times with no indication of what was or was not found is going to be useful to anyone. This appears to be a useless exercise. And, in that case, why does the speed matter?

But, showing us the output you're hoping to produce from those sample input files might help us understand what you're trying to do.

Note that if 1/3 of your MDN.TXT lines are duplicates (as they are in your sample), you might speed things up considerably by getting rid of the duplicates before running the search loop. That alone would probably save you about 25% on your script's running time.
# 4  
Old 10-24-2015
Hi Don,
Please find below the answers to the queries raised

SPLNOMAXLEN[] is for checking the maximum length of the input string,i.e, from MDN.txt & i am not trying for exact match for one of two numbers as input string may contain initial values with our without "91" so this condition is used
Code:
$1 ~ "^"mdn

and
Code:
"91"$1 ~ "^"mdn

I will be carrying out further steps based on found & not found like, say populating fields from
Code:
SPLNOPULSE[mdn] SPLNOAMT[mdn]

. In case of found & not matching will handle cases accordingly.

So output from found will be
Code:
8542355955,1,2
8542555955,1,2

MDN.TXT lines will definitely have duplicates as per requirement and i cannot help it but SPLNO.TXT will not have duplicates for sure.

Please let me know in case processing time can be reduced.
# 5  
Old 10-24-2015
Do computations with a static result before the loop,
that might save a few CPU cycles.
Code:
len1=length($1)
lenL1=length(L1="91"$1)
for ( mdn in SPLNOMAXLEN)
        {
         if ( ($1 ~ "^"mdn && len1 <=SPLNOMAXLEN[mdn]) || (L1 ~ "^"mdn && lenL1 <=SPLNOMAXLEN[mdn]) )

# 6  
Old 10-26-2015
Hi siramitsharma,
Forum rules prohibit sending private message asking people to respond to your posts.

Note that with the two lines in SPLNO.TXT:
Code:
918542054921|30|1|2
854215595|12|1|2

any of the following lines in MDN.TXT:
Code:
918542054921
91854205492
9185420549
918542054
91854205
9185420
918542
91854
9185
918
91
9
8542054921
854205492
85420549
8542054
854205
85420
8542
854
85
8

would match the 1st line. And any of the following lines in MDN.TXT:
Code:
854215595
85421559
8542155
854215
85421
8542
854
85
8

would match the 2nd line.
So maybe you could pre-build a table of the values that could match entries from the first file and instead of performing lots of matches and comparisons in a loop, you could just test if((mdn in table) || ("91"mdn in table)) instead of the four slower tests you are current using to see if there is a match.

Since every entry in the input has the final two fields with the values 1 and 2, respectively, all we need to know is whether or not there is a match; not what values appear if there is a match. This is important because both entries above match lines in MDN.TXT containing the values:
Code:
8542
854
85
8

You also need to explain why the 2nd field in the 1st line above has the value 30. Since field 1 is 12 characters (918542054921), the longest possible string that can be matched is 12 characters. And, on the 2nd line we have a 1st field (854215595) with length 9 and a second field containing 12. So, I repeat, what use is the 2nd field in SPLNO.TXT other than to give you two more tests to slow down your loop?
# 7  
Old 10-26-2015
THanks Don
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Read in search strings from text file, search for string in second text file and output to CSV

Hi guys, I have a text file named file1.txt that is formatted like this: 001 , ID , 20000 002 , Name , Brandon 003 , Phone_Number , 616-234-1999 004 , SSNumber , 234-23-234 005 , Model , Toyota 007 , Engine ,V8 008 , GPS , OFF and I have file2.txt formatted like this: ... (2 Replies)
Discussion started by: An0mander
2 Replies

2. Shell Programming and Scripting

Search text beween tags and write to file using awk

Hi Friends, I have a very big text file, that has code for multiple functions. I have scan through the file and write each function in seperate file. All functions starts with BEGIN DSFNC Identifier "ABCDDataValidationfnc" and ends with END DSFNC I need create a file(using identifier)... (2 Replies)
Discussion started by: anandapani
2 Replies

3. Shell Programming and Scripting

Search and replace from file in awk using a 16 bit text file

Hello, Some time ago a helpful awk file was provided on the forum which I give below: NR==FNR{A=$0;next}{for(j in A){split(A,P,"=");for(i=1;i<=NF;i++){if($i==P){$i=P}}}}1 While it works beautifully on English and Latin characters i.e. within the ASCII range of 127, the moment a character beyond... (6 Replies)
Discussion started by: gimley
6 Replies

4. Shell Programming and Scripting

Efficient population of array from text file

Hi, I am trying to populate an array with data from a text file. I have a working method using awk but it is too slow and inefficent. See below. The text file has 70,000 lines. As awk is a line editor it reads each line of the file until it gets to the required line and then processes it.... (3 Replies)
Discussion started by: carlr
3 Replies

5. Homework & Coursework Questions

Efficient Text File Writing

Use and complete the template provided. The entire template must be completed. If you don't, your post may be deleted! 1. The problem statement, all variables and given/known data: Write a template main.c file via shell script to make it easier for yourself later. The issue here isn't writing... (2 Replies)
Discussion started by: george3isme
2 Replies

6. Shell Programming and Scripting

Need an efficient way to search for a tag in an xml file having millions of rows

Hi, I have an XML file with around 1 billion rows in it and i am trying to find the number of times a particular tag occurs in it. The solution i am using works but takes a lot of time (~1 hr) .Please help me with an efficient way to do this. Lets say the input file is <Root> ... (13 Replies)
Discussion started by: Sheel
13 Replies

7. Shell Programming and Scripting

Better and efficient way to reverse search a file for first matched line number.

How to reverse search for a matched string in a file. Get line# of the first matched line. I am getting '2' into 'lineNum' variable. But it feels like I am using too many commands. Is there a better more efficiant way to do this on Unix? abc.log aaaaaaaaaaaaa bbbbbbbbbbbbb... (11 Replies)
Discussion started by: kchinnam
11 Replies

8. Shell Programming and Scripting

search text file in file if this file contains necessary text (awk,grep)

Hello friends! Help me pls to write correct awk and grep statements for my task: I have got files with name filename.txt It has such structure: Start of file FROM: address@domen.com (12...890) abc DATE: 11/23/2009 on Std SUBJECT: any subject End of file So, I must check, if this file... (4 Replies)
Discussion started by: candyme
4 Replies

9. Shell Programming and Scripting

search needed part in text file (awk?)

Hello! I have text file: From aaa@bbb Fri Jun 1 10:04:29 2010 --____OSPHWOJQGRPHNTTXKYGR____ Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Content-Disposition: inline My code '234565'. ... (2 Replies)
Discussion started by: candyme
2 Replies

10. Shell Programming and Scripting

text file search and replace with awk

hello all greeting for the day i have a text file as the following text.xml abcd<FIELD>123.456</FIELD>efgh i need to replace the value between <FIELD> and </FIELD> by using awk command. please throw some light on this. thank you very very much Erik (5 Replies)
Discussion started by: erikshek
5 Replies
Login or Register to Ask a Question