Can awk do lookups to other files and process results

10-23-2008

Registered User

2,524, 241

Join Date: Dec 2007

Last Activity: 17 March 2020, 2:04 PM EDT

Posts: 2,524

Thanks Given: 173

Thanked 241 Times in 206 Posts

Can awk do lookups to other files and process results

I know that 'brute-force' scripting could accomplish this with lots of cat/echo/cut/grep and more. But, because my real file has 800k records, and the matching files have 10-20k records, this is not time-possible or efficient.
I have input file:

Code:

> cat file_in
1234567890123456789012345678901234567890
Joe   123456  30 Main St    1234    F   
Jim   101362  1492 Hugh     0101    P   
Kerry 040419  6091 Lost St  0101    F   
Linda 123456  50 High Way   1235        
Matt  242424  48 Speedway Dr4343    F   
Kerrin180118  99 Skaters Way2012    P  *

(you can ignore the first line - just a help since a fixed record file)
(tail +2 file_in skips over this line during testing)

Begin by only reviewing records where position 40 is blank = still need to process.

Want to see those records that cannot be processed because (a) the data in columns 7-12 does not exist in the following file:

Code:

> cat file_cd1
040419
101362
180118
242424
789012
967539
988012

I know Joe does not match, so ideally I would like to put a "1" in position 39 telling me I failed the first test.

A second test (b) is to only process records that are "abc" based on lookup of columns 29-32 into the following file:

Code:

> cat file_cd2
0101 abc
1234 abc 
1235 ghi
2012 ghi
4343 ghi
9012 abc

Linda & Matt should then have a "2" put in position 39.

So, my start would be

Code:

awk 'substr($0,40,1)==" " {print}' file_in >file_out

which would create an output file, but only records I want to even consider that are not yet marked as processed. So, yes I intend to start with 6 records and make a file of 5 records. I now need to add those two codes at position 39 when appropriate.

joeyg

View Public Profile for joeyg

Find all posts by joeyg

10-23-2008

Registered User

5,690, 630

Join Date: Jan 2007

Last Activity: 9 January 2017, 4:40 AM EST

Location: Варна, България / Milano, Italia

Posts: 5,690

Thanks Given: 184

Thanked 630 Times in 587 Posts

Edit: actually you should jump to the second example. The first assumes that posision 39 is always empty.

The code below sets 1 for Linda, because she's not present in the example file_cd1:

Code:

awk 'NR == FNR { cd1[$1]; next }
f { cd2[$1] = $2; next }
!f && / $/ { 
  if (!(substr($0, 7, 6) in cd1)) sub(/. $/, "1 ") 
  if ((substr($0, 29, 4) in cd2) && cd2[substr($0, 29, 4)] != "abc") 
    sub(/  $/, "2 ") 
  }1' file_cd1 f=1 file_cd2 f=0 file_in

An example:

Code:

% awk 'NR == FNR { cd1[$1]; next }
f { cd2[$1] = $2; next }
!f && / $/ {
  if (!(substr($0, 7, 6) in cd1)) sub(/. $/, "1 ")
  if ((substr($0, 29, 4) in cd2) && cd2[substr($0, 29, 4)] != "abc")
    sub(/  $/, "2 ")
  }1' file_cd1 f=1 file_cd2 f=0 file_in
1234567890123456789012345678901234567890
Joe   123456  30 Main St    1234    F 1 
Jim   101362  1492 Hugh     0101    P   
Kerry 040419  6091 Lost St  0101    F   
Linda 123456  50 High Way   1235      1 
Matt  242424  48 Speedway Dr4343    F 2 
Kerrin180118  99 Skaters Way2012    P  *

If you want the second test to have precedence:

Code:

awk 'NR == FNR { cd1[$1]; next }
f { cd2[$1] = $2; next }
!f && / $/ { 
  if (!(substr($0, 7, 6) in cd1)) sub(/. $/, "1 ") 
  if ((substr($0, 29, 4) in cd2) && cd2[substr($0, 29, 4)] != "abc") 
    sub(/. $/, "2 ") 
  }1' file_cd1 f=1 file_cd2 f=0 file_in

For example:

Code:

% awk 'NR == FNR { cd1[$1]; next }
quote> f { cd2[$1] = $2; next }
quote> !f && / $/ { 
quote>   if (!(substr($0, 7, 6) in cd1)) sub(/. $/, "1 ") 
quote>   if ((substr($0, 29, 4) in cd2) && cd2[substr($0, 29, 4)] != "abc") 
quote>     sub(/. $/, "2 ") 
quote>   }1' file_cd1 f=1 file_cd2 f=0 file_in
1234567890123456789012345678901234567890
Joe   123456  30 Main St    1234    F 1 
Jim   101362  1492 Hugh     0101    P   
Kerry 040419  6091 Lost St  0101    F   
Linda 123456  50 High Way   1235      2 
Matt  242424  48 Speedway Dr4343    F 2 
Kerrin180118  99 Skaters Way2012    P  *

Last edited by radoulov; 10-24-2008 at 06:20 AM.. Reason: correction

radoulov

View Public Profile for radoulov

Find all posts by radoulov

10-23-2008

Registered User

11,728, 1,345

Join Date: Feb 2004

Last Activity: 8 May 2020, 9:07 AM EDT

Location: NM

Posts: 11,728

Thanks Given: 903

Thanked 1,345 Times in 1,201 Posts

Okay. Use associative arrays. This gives you three files one.txt with a "1" two.txt three.txt which are intermediate and then bad.txt which is still just blank in col 39 & 40.

Code:

awk ' FILENAME=="file_cd1" { cd1[$0]=$0}
      FILENAME=="file_cd2" { cd2[$1]=$2}
      FILENAME=="inputfile" {
         if(FNR > 1 && substr($0,40,1)==" ")
         {
         	if ( substr($0,7,6) in cd_1)
         	{
         	    $0=substr($0,1,38) "1 "
         	    print $0 > "one.txt"
         	    continue
         	}
         	else
         	{
         		if( cd2[substr($0, 29, 4)]!="abc")
         		  { $0=substr($0,1,38) "2 "
         		     print $0 > "two.txt"
         		     continue 
         		  }        		  
         	}
             print $0 > "bad.txt"; continue
         }
         print $0 > "three.txt"
      
      } '  file_cd1 file_cd2 inputfile

Last edited by jim mcnamara; 10-23-2008 at 05:24 PM..

jim mcnamara

View Public Profile for jim mcnamara

Find all posts by jim mcnamara

10-23-2008

Registered User

5,690, 630

Join Date: Jan 2007

Last Activity: 9 January 2017, 4:40 AM EST

Location: Варна, България / Milano, Italia

Posts: 5,690

Thanks Given: 184

Thanked 630 Times in 587 Posts

After re-reading your post and Jim's comments I'm not sure if you prefer to generate multiple files (good - bad records) or an output like the one I posted.

radoulov

View Public Profile for radoulov

Find all posts by radoulov

10-23-2008

Registered User

2,524, 241

Join Date: Dec 2007

Last Activity: 17 March 2020, 2:04 PM EDT

Posts: 2,524

Thanks Given: 173

Thanked 241 Times in 206 Posts

Thanks for the feedback

I would prefer all data - good and bad records - stored to one file.
While reading through my 'sed & awk' book, the idea of arrays did jump out to me. I am going to have to sit and read through the examples to understand how they work.

joeyg

View Public Profile for joeyg

Find all posts by joeyg

Shell Programming and Scripting

Can awk do lookups to other files and process results

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Comparing 2 files using awk , not getting any results - C shell

Discussion started by: reach2khan

2. Shell Programming and Scripting

Process multiple large files with awk

Discussion started by: camor

3. UNIX for Advanced & Expert Users

Process remians in Running state causing other similar process to sleep and results to system hang

Discussion started by: naveeng

4. UNIX for Advanced & Expert Users

Process remians in Running state causing other similar process to sleep and results to system hang

Discussion started by: naveeng

5. BSD

Process remians in Running state causing other similar process to sleep and results to system hang

Discussion started by: naveeng

6. Shell Programming and Scripting

Bash-awk to process thousands of files

Discussion started by: Ophiuchus

7. Shell Programming and Scripting

awk help: Match data fields from 2 files & output results from both into 1 file

Discussion started by: ambroze

8. Shell Programming and Scripting

awk script to parse results from TWO files

Discussion started by: roger67

9. Shell Programming and Scripting

awk - Matching columns between 2 files and reordering results

Discussion started by: ambroze

10. HP-UX

TWS 8.4 on HP-UX - lookups?

Discussion started by: rnbwkat