Faster Line by Line String/Date Comparison of 2 Files

07-09-2012

Registered User

11, 1

Join Date: May 2011

Last Activity: 3 April 2013, 4:50 AM EDT

Location: Philippines

Posts: 11

Thanks Given: 8

Thanked 1 Time in 1 Post

Faster Line by Line String/Date Comparison of 2 Files

Hello,

I was wondering if anyone knows a faster way to search and compare strings and dates from 2 files?
I'm currently using "for loop" but seems sluggish as i have to cycle through 10 directories with 10 files each containing thousands of lines.

Given:

Code:

-10 directories
-10 files (tab delimited)
-1 lookup file
-1 report file

Heres what i'm trying to achieve:

Code:

1. loop through the 10 dirs, with 10 files
2. for every line read from file
   a. get column 1 string and grep it from lookup file, write output to result file
   b. get column 3 date and store as vardate  
   c. for every line of result file from 2.a
      c.I get column 2 date as varstart
      c.II get column 3 date as varend
      c.III get column 7 as lastcol
      c.IV check if vardate from 2.b is between varstart and varend    
      c.V if vardate is between varstart and varend, write line from 2. + vartstart + varend  + lastcol to report file

Here is my straightforward solution so far, though its working fine, its not that elegant and fast:

Code:

for loop through the 10 dirs
do
   for loop through 10 files
   do
      while read LINE of file
       do
            grep column 1 from lookup file > result file
            vardate=column 3
            while read ROW of result file
            do
               varstart=column 2
               varend=column 3
               lastcol=column 7
               if [[ varstart -le vardate && varend -ge vardate ]]
               then
                  printf "LINE\tvarstart\tvarend\tlastcol\n" >> report file
               else
                  :
               fi
            done
       done
   done
done

I was trying to replace the if statement with the following (as i have learned through research, awk is much faster in line by line processing):

Code:

awk '{varstart=$2; varend=$3; lastcol=$7; getline;} varstart <= vardate && varend >= vardate {print l,varstart,varend,lastcol}' l=${LINE} vardate=column3 resultfile >> reportfile

But cant seem to get it to work properly. Plus, i was thinking if i can use awk as well on the parent while loop to be more efficient, but i have no idea anymore if its possible to use another awk, within an awk statement.

My goal is just to make this script work faster. Any suggestions or alternate approaches is well appreciated.

Thank you very much guys.

Last edited by agentgrecko; 07-09-2012 at 07:07 AM..

agentgrecko

View Public Profile for agentgrecko

Find all posts by agentgrecko

07-09-2012

Registered User

1,690, 205

Join Date: Jun 2007

Last Activity: 13 July 2020, 5:35 PM EDT

Location: Mumbai, India

Posts: 1,690

Thanks Given: 139

Thanked 205 Times in 199 Posts

How BIG your lookup file is?

Quote:

i have to cycle through 10 directories with 10 files each containing thousands of lines.

Can you get the approx stats of line? How many thousands?

Also, tell us about the O/S and shell version.

Last edited by clx; 07-09-2012 at 07:46 AM..

clx

View Public Profile for clx

Find all posts by clx

07-09-2012

Registered User

11, 1

Join Date: May 2011

Last Activity: 3 April 2013, 4:50 AM EDT

Location: Philippines

Posts: 11

Thanks Given: 8

Thanked 1 Time in 1 Post

Quote:

Originally Posted by clx

How BIG your lookup file is?

Can you get the approx stats of line? How many thousands?

Also, tell us about the O/S and shell version.

Hi clx, thanks for the interest and apologies for the late reply.
I'm using HP-UX, /bin/ksh.

For the files, we're averaging 300,000 lines per file, * 10 files (gzipped), * 10 dirs, so the need for faster processing.

My lookup file is an isql output, and is around 30,000 records.

I'm currently developing/testing an alternative, wherein instead of embedding another while loop inside a while loop, I will have just 1, and sequentially do the following:

1. while loop to match the records from file against the lookup.
a. for every matched record, output the records from file + records from lookup to a result file.

2. use the awk script to check the vardate if between varstart/varend. (much easier since awk will only be accessing 1 file)

But then again, other faster alternatives are welcome.

Thanks.

agentgrecko

View Public Profile for agentgrecko

Find all posts by agentgrecko

07-10-2012

Registered User

1,690, 205

Join Date: Jun 2007

Last Activity: 13 July 2020, 5:35 PM EDT

Location: Mumbai, India

Posts: 1,690

Thanks Given: 139

Thanked 205 Times in 199 Posts

Try ..

Code:

 awk 'NR==FNR {r[$1]=$0;d_start[$1]=$2;d_end[$1]=$3;last_col[$1]=$7;next} ($1 in r) && d_start[$1] <= $3 && d_end[$1] >= $3 { print $0, d_start[$1], d_end[$1], last_col[$1]}' lookup_file *.files

I used multiple array which could be avoided. but since your look-up file is not so-big, its worth trying.

I see you files are not at one place, those needs to be found.

Code:

find /parent/path -type f | xargs awk 'NR==FNR {r[$1]=$0;d_start[$1]=$2;d_end[$1]=$3;last_col[$1]=$7;next} ($1 in r) && d_start[$1] <= $3 && d_end[$1] >= $3 { print $0, d_start[$1], d_end[$1], last_col[$1]}' lookup_file

I don't have a real system with me so can't test the performance on BIG files.

However, There could be surely other more efficient ways which might come-up soon.

This User Gave Thanks to clx For This Post:

clx

View Public Profile for clx

Find all posts by clx

07-10-2012

Registered User

11, 1

Join Date: May 2011

Last Activity: 3 April 2013, 4:50 AM EDT

Location: Philippines

Posts: 11

Thanks Given: 8

Thanked 1 Time in 1 Post

Quote:

Originally Posted by clx

Try ..

Code:

 awk 'NR==FNR {r[$1]=$0;d_start[$1]=$2;d_end[$1]=$3;last_col[$1]=$7;next} ($1 in r) && d_start[$1] <= $3 && d_end[$1] >= $3 { print $0, d_start[$1], d_end[$1], last_col[$1]}' lookup_file *.files

I used multiple array which could be avoided. but since your look-up file is not so-big, its worth trying.

I see you files are not at one place, those needs to be found.

Code:

find /parent/path -type f | xargs awk 'NR==FNR {r[$1]=$0;d_start[$1]=$2;d_end[$1]=$3;last_col[$1]=$7;next} ($1 in r) && d_start[$1] <= $3 && d_end[$1] >= $3 { print $0, d_start[$1], d_end[$1], last_col[$1]}' lookup_file

I don't have a real system with me so can't test the performance on BIG files.

However, There could be surely other more efficient ways which might come-up soon.

Hi clx, thank you very much for taking your time with this.

I have as well developed similar awk codes to what you have. Posting it as it might help others as well:

Code:

for files in dirs
   do
      ###search records from file using column 1 from records in lookup_file and store to temp file (normal join)
      awk 'FNR==NR{a[$1]=$0} NR>FNR && ($1 in a){ print $0,a[$1] } ' lookup_file files > tempfile
      ###compare the lookupdate (column 3 from the file) from start/end date (column 2 and 3 of lookup file) and write to report
      awk '{lookup=$3; start=$9; end=$10; getline;} start <= lookup && end >= lookup {print $0}' tempfile >> reportfile
   done

What the code does is to join records from the file to the matching records from the lookup.
So columns 2/3 from the lookup is now column 9/10 of the joined file. And since all the dates that i needed are already in 1 file, it was easier for awk to compare them.

(I took away the adding of the last column as well since it was already available upon joining the records)

This took away a big chunk of code from my script and was able to breeze through 30 million lines in around 10 minutes, which is already tolerable.

Again, thanks!

Last edited by agentgrecko; 07-10-2012 at 04:55 AM..

This User Gave Thanks to agentgrecko For This Post:

agentgrecko

View Public Profile for agentgrecko

Find all posts by agentgrecko

Shell Programming and Scripting

Faster Line by Line String/Date Comparison of 2 Files

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk - 2 files comparison without for loop - multi-line issue

Discussion started by: chill3chee

2. Shell Programming and Scripting

Add Date string at end of line

Discussion started by: satish1222

3. Shell Programming and Scripting

Comparison of fields then increment a counter reading line by line in a file

Discussion started by: selvankj

4. Shell Programming and Scripting

Need help in column comparison & adding extra line to files

Discussion started by: b@l@ji

5. Shell Programming and Scripting

how to read the contents of two files line by line and compare the line by line?

Discussion started by: mjavalkar

6. Solaris

Line too long error Replace string with new line line character

Discussion started by: ducati

7. Shell Programming and Scripting

String comparison: All lines in file but each within line

Discussion started by: beddo

8. Homework & Coursework Questions

Date comparison with 'string date having slashes and time zone' in Bash only

Discussion started by: TariqYousaf

9. UNIX for Dummies Questions & Answers

line by line file comparison

Discussion started by: newbreed1