Can awk do lookups to other files and process results


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Can awk do lookups to other files and process results
# 1  
Old 10-23-2008
Question Can awk do lookups to other files and process results

I know that 'brute-force' scripting could accomplish this with lots of cat/echo/cut/grep and more. But, because my real file has 800k records, and the matching files have 10-20k records, this is not time-possible or efficient.
I have input file:
Code:
> cat file_in
1234567890123456789012345678901234567890
Joe   123456  30 Main St    1234    F   
Jim   101362  1492 Hugh     0101    P   
Kerry 040419  6091 Lost St  0101    F   
Linda 123456  50 High Way   1235        
Matt  242424  48 Speedway Dr4343    F   
Kerrin180118  99 Skaters Way2012    P  *

(you can ignore the first line - just a help since a fixed record file)
(tail +2 file_in skips over this line during testing)

Begin by only reviewing records where position 40 is blank = still need to process.

Want to see those records that cannot be processed because (a) the data in columns 7-12 does not exist in the following file:
Code:
> cat file_cd1
040419
101362
180118
242424
789012
967539
988012

I know Joe does not match, so ideally I would like to put a "1" in position 39 telling me I failed the first test.

A second test (b) is to only process records that are "abc" based on lookup of columns 29-32 into the following file:
Code:
> cat file_cd2
0101 abc
1234 abc 
1235 ghi
2012 ghi
4343 ghi
9012 abc

Linda & Matt should then have a "2" put in position 39.

So, my start would be
Code:
awk 'substr($0,40,1)==" " {print}' file_in >file_out

which would create an output file, but only records I want to even consider that are not yet marked as processed. So, yes I intend to start with 6 records and make a file of 5 records. I now need to add those two codes at position 39 when appropriate.
# 2  
Old 10-23-2008
Edit: actually you should jump to the second example. The first assumes that posision 39 is always empty.

The code below sets 1 for Linda, because she's not present in the example file_cd1:

Code:
awk 'NR == FNR { cd1[$1]; next }
f { cd2[$1] = $2; next }
!f && / $/ { 
  if (!(substr($0, 7, 6) in cd1)) sub(/. $/, "1 ") 
  if ((substr($0, 29, 4) in cd2) && cd2[substr($0, 29, 4)] != "abc") 
    sub(/  $/, "2 ") 
  }1' file_cd1 f=1 file_cd2 f=0 file_in

An example:

Code:
% awk 'NR == FNR { cd1[$1]; next }
f { cd2[$1] = $2; next }
!f && / $/ {
  if (!(substr($0, 7, 6) in cd1)) sub(/. $/, "1 ")
  if ((substr($0, 29, 4) in cd2) && cd2[substr($0, 29, 4)] != "abc")
    sub(/  $/, "2 ")
  }1' file_cd1 f=1 file_cd2 f=0 file_in
1234567890123456789012345678901234567890
Joe   123456  30 Main St    1234    F 1 
Jim   101362  1492 Hugh     0101    P   
Kerry 040419  6091 Lost St  0101    F   
Linda 123456  50 High Way   1235      1 
Matt  242424  48 Speedway Dr4343    F 2 
Kerrin180118  99 Skaters Way2012    P  *

If you want the second test to have precedence:

Code:
awk 'NR == FNR { cd1[$1]; next }
f { cd2[$1] = $2; next }
!f && / $/ { 
  if (!(substr($0, 7, 6) in cd1)) sub(/. $/, "1 ") 
  if ((substr($0, 29, 4) in cd2) && cd2[substr($0, 29, 4)] != "abc") 
    sub(/. $/, "2 ") 
  }1' file_cd1 f=1 file_cd2 f=0 file_in

For example:

Code:
% awk 'NR == FNR { cd1[$1]; next }
quote> f { cd2[$1] = $2; next }
quote> !f && / $/ { 
quote>   if (!(substr($0, 7, 6) in cd1)) sub(/. $/, "1 ") 
quote>   if ((substr($0, 29, 4) in cd2) && cd2[substr($0, 29, 4)] != "abc") 
quote>     sub(/. $/, "2 ") 
quote>   }1' file_cd1 f=1 file_cd2 f=0 file_in
1234567890123456789012345678901234567890
Joe   123456  30 Main St    1234    F 1 
Jim   101362  1492 Hugh     0101    P   
Kerry 040419  6091 Lost St  0101    F   
Linda 123456  50 High Way   1235      2 
Matt  242424  48 Speedway Dr4343    F 2 
Kerrin180118  99 Skaters Way2012    P  *


Last edited by radoulov; 10-24-2008 at 06:20 AM.. Reason: correction
# 3  
Old 10-23-2008
Okay. Use associative arrays. This gives you three files one.txt with a "1" two.txt three.txt which are intermediate and then bad.txt which is still just blank in col 39 & 40.


Code:
awk ' FILENAME=="file_cd1" { cd1[$0]=$0}
      FILENAME=="file_cd2" { cd2[$1]=$2}
      FILENAME=="inputfile" {
         if(FNR > 1 && substr($0,40,1)==" ")
         {
         	if ( substr($0,7,6) in cd_1)
         	{
         	    $0=substr($0,1,38) "1 "
         	    print $0 > "one.txt"
         	    continue
         	}
         	else
         	{
         		if( cd2[substr($0, 29, 4)]!="abc")
         		  { $0=substr($0,1,38) "2 "
         		     print $0 > "two.txt"
         		     continue 
         		  }        		  
         	}
             print $0 > "bad.txt"; continue
         }
         print $0 > "three.txt"
      
      } '  file_cd1 file_cd2 inputfile


Last edited by jim mcnamara; 10-23-2008 at 05:24 PM..
# 4  
Old 10-23-2008
After re-reading your post and Jim's comments I'm not sure if you prefer to generate multiple files (good - bad records) or an output like the one I posted.
# 5  
Old 10-23-2008
Hammer & Screwdriver Thanks for the feedback

I would prefer all data - good and bad records - stored to one file.
While reading through my 'sed & awk' book, the idea of arrays did jump out to me. I am going to have to sit and read through the examples to understand how they work.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Comparing 2 files using awk , not getting any results - C shell

I am using c shell and trying to compare 2 files using awk . But the below awk statement doesnt give any result. Pls. advise why am not getting the desired o/p with the corrected awk script. Need to acheive this solution in awk using C shell. awk 'FNR==NR{a++;next} {for(i in a) {if ( a=$0... (8 Replies)
Discussion started by: reach2khan
8 Replies

2. Shell Programming and Scripting

Process multiple large files with awk

Hi there, I'm camor and I'm trying to process huge files with bash scripting and awk. I've got a dataset folder with 10 files (16 millions of row each one - 600MB), and I've got a sorted file with all keys inside. For example: a sample_1 200 a.b sample_2 10 a sample_3 10 a sample_1 10 a... (4 Replies)
Discussion started by: camor
4 Replies

3. UNIX for Advanced & Expert Users

Process remians in Running state causing other similar process to sleep and results to system hang

Hi Experts, I am facing one problem here which is one process always stuck in running state which causes the other similar process to sleep state . This causes my system in hanged state. On doing cat /proc/<pid>wchan showing the "__init_begin" in the output. Can you please help me here... (6 Replies)
Discussion started by: naveeng
6 Replies

4. UNIX for Advanced & Expert Users

Process remians in Running state causing other similar process to sleep and results to system hang

Hi Experts, I am facing one problem here which is one process always stuck in running state which causes the other similar process to sleep state . This causes my system in hanged state. On doing cat /proc/<pid>wchan showing the "__init_begin" in the output. Can you please help me here... (1 Reply)
Discussion started by: naveeng
1 Replies

5. BSD

Process remians in Running state causing other similar process to sleep and results to system hang

Hi Experts, I am facing one problem here which is one process always stuck in running state which causes the other similar process to sleep state . This causes my system in hanged state. On doing cat /proc/<pid>wchan showing the "__init_begin" in the output. Can you please help me here... (0 Replies)
Discussion started by: naveeng
0 Replies

6. Shell Programming and Scripting

Bash-awk to process thousands of files

Hi to all, I have thousand of files in a folder with names with format "FILE-YYYY-MM-DD-HHMM" for what I want to send the following AWK command awk '/Code.*/' FILE-2014* I'd like to separate all files that have the same date to a folder named with the corresponding date. For example, if I... (7 Replies)
Discussion started by: Ophiuchus
7 Replies

7. Shell Programming and Scripting

awk help: Match data fields from 2 files & output results from both into 1 file

I need to take 2 input files and create 1 output based on matches from each file. I am looking to match field #1 in both files (Userid) and create an output file that will be a combination of fields from both file1 and file2 if there are any differences in the fields 2,3,4,5,or 6. Below is an... (5 Replies)
Discussion started by: ambroze
5 Replies

8. Shell Programming and Scripting

awk script to parse results from TWO files

I am trying to parse two files and get data that does not match in one of the columns ( column 3 in my case ) Data for two files are as follows A.txt ===== abc 10 5 0 1 16 xyz 16 1 1 0 18 efg 30 8 0 2 40 ijk 22 2 0 1 25 B.txt ===== abc... (6 Replies)
Discussion started by: roger67
6 Replies

9. Shell Programming and Scripting

awk - Matching columns between 2 files and reordering results

I am trying to match 4 colums (first_name,last_name,dob,ssn) between 2 files and when there is an exact match I need to write out these matches to a new file with a combination of fields from file1 and file2. I've managed to come up with a way to match these 2 files based on the columns (see below)... (7 Replies)
Discussion started by: ambroze
7 Replies

10. HP-UX

TWS 8.4 on HP-UX - lookups?

Just wondering if anyone else is using IBM's TWS on HP-UX 11.11i. Seeing some very strange name-lookup issues when it comes to using various utilities on the system. The same software works fine o0n AIX, Linux, Solaris, etc, but on HP-UX there is noticeable time lags in issuing commands - at the... (0 Replies)
Discussion started by: rnbwkat
0 Replies
Login or Register to Ask a Question