Awk: Performing "for" loop within text block with two files

01-22-2019

Registered User

58, 2

Join Date: Aug 2014

Last Activity: 6 April 2020, 3:03 PM EDT

Posts: 58

Thanks Given: 61

Thanked 2 Times in 2 Posts

Awk: Performing "for" loop within text block with two files

I am hoping to pull multiple strings from one file and use them to search within a block of text within another file.

File 1

Code:

PS001,001 HLK
PS002,004 MWQ
PS004,002 RXM
PS004,006 DBX
PS004,006 SBR
PS005,007 ML
PS005,009 DBR
PS005,011 MR
PS005,012 SBR
PS006,003 RXM
PS006,003 >SJ
PS006,010 QBL

File 2

Code:

 PS001,001 [VWB-WHJ <Su>] [L-GBR> <PC>]
 Lexeme     VWB HJ==     # L GBR       #
 PhraseType  2(2.1,7) 5(5,2.3)
 PhraseLab  502[0]         521[0]
 ClauseType NmCl

 PS001,001 [D-<Re>] [B->WRX> D-<WL> <Co>] [L> <Ng>] [HLK <Pr>]
 Lexeme     D      # B >WRX D <WL        # L>      # HLK      #
 PhraseType  6(6) 5(5,2.3,5,2.3) 11(11) 1(1:2)
 PhraseLab  519[0]   504[0]                510[0]    501[0]
 ClauseType xQt0 
 
 PS002,004 [W-<Cj>] [MRJ> <Su>] [NMJQ <Pr>] [B-HWN <Co>]
 Lexeme     W      # MRJ>      # MWQ       # B HWN=     #
 PhraseType  6(6) 3(3.2) 1(1:1) 5(5,7)
 PhraseLab  509[0]   502[0]      501[0]      504[0]
 ClauseType WXYq

 PS002,005 [HJ DJN <Mo>] [NMLL <Pr>] [<LJ-HWN <Co>] [B-RWGZ-H <Aj>]
 Lexeme     HJ= DJN=    # ML        # <L HWN=      # B RWGZ H      #
 PhraseType  4(8,4) 1(1:1) 5(5,7) 5(5,2.1,7)
 PhraseLab  508[0]        501[0]      504[0]         505[0]
 ClauseType xYq0

 PS005,012 [D-<Re>] [MSBRJN <PC>] [B-K <Co>]
 Lexeme     D      # SBR         # B K      #
 PhraseType  6(6) 1(1:6.2) 5(5,7)
 PhraseLab  519[0]   521[0]        504[0]
 ClauseType Ptcp

 PS005,012 [W-<Cj>] [L-<LM <Ti>] [NCBXWN-<Pr>] [K <Ob>]
 Lexeme     W      # L <LM      # CBX         # K      #
 PhraseType  6(6) 5(5,2.2) 1(1:1) 7(7)
 PhraseLab  509[0]   506[0]       501[0]        503[0]
 ClauseType WxY0 PS005,013 [>JK SKR> MQBLT> <Aj>] [T<VP-<Pr>] [NJ <Ob>]
 Lexeme     >JK SKR QBL          # <VP       # NJ      #
 PhraseType  5(5,2.3,13:62.3) 1(1:1) 7(7)
 PhraseLab  505[0]                 501[0]      503[0]
 ClauseType xYq0

 PS006,002 [MRJ> <Vo>]
 Lexeme     MRJ>      #
 PhraseType  3(3.2)
 PhraseLab  562[0]
 ClauseType Voct

 PS006,002 [L> <Ng>] [B-RWGZ-K <Aj>] [TKS-<Pr>] [NJ <Ob>]
 Lexeme     L>      # B RWGZ K      # KS       # NJ      #
 PhraseType  11(11) 5(5,2.1,7) 1(1:1) 7(7)
 PhraseLab  510[0]    505[0]          501[0]     503[0]
 ClauseType xYq0

My hope was that when $1 of File 1 matches $1 in File 2, $0 in File 2 contains the string "<Co>", and $2 of File 1 matches a string *exactly* in File 2 on a line beginning with the word "Lexeme," then print.

Thus, my desired output would look like this:

Code:

 PS001,001 [D-<Re>] [B->WRX> D-<WL> <Co>] [L> <Ng>] [HLK <Pr>]
 Lexeme     D      # B >WRX D <WL        # L>      # HLK      #
 PhraseType  6(6) 5(5,2.3,5,2.3) 11(11) 1(1:2)
 PhraseLab  519[0]   504[0]                510[0]    501[0]
 ClauseType xQt0 

 PS002,004 [W-<Cj>] [MRJ> <Su>] [NMJQ <Pr>] [B-HWN <Co>]
 Lexeme     W      # MRJ>      # MWQ       # B HWN=     #
 PhraseType  6(6) 3(3.2) 1(1:1) 5(5,7)
 PhraseLab  509[0]   502[0]      501[0]      504[0]
 ClauseType WXYq

 PS005,012 [D-<Re>] [MSBRJN <PC>] [B-K <Co>]
 Lexeme     D      # SBR         # B K      #
 PhraseType  6(6) 1(1:6.2) 5(5,7)
 PhraseLab  519[0]   521[0]        504[0]
 ClauseType Ptcp

With the following code I am able to am able to do two of the three criteria listed above, namely, I am able to match $1 of File1 with $1 of File2 and also when $0 of File 1 has the string "<Co>". However, I am having difficulty with the last criteria, viz., match $2 of File 1 with the exact string in File 2 when the lines begins with "Lexeme."

Code:

NR==FNR   {A[$1]
           B[$2]
           next
          }
/^ Cl/   {if (PR1 && PR2 && PR3)   {print"\n" BUF
                                    print
                             }
           PR1 = PR2 = PR3 = 0
           BUF = ""
           next
          }

          {BUF = BUF (BUF?ORS:_) $0
           if ($1 in A) PR1 = 1
           if ($0 ~/\<Co\>/) PR2 = 1
           for (b in B) if($0 ~ b) PR3 = 1
          }

I have also tried:

Code:

NR==FNR   {A[$1]
           B[$2]
           next
          }
/^ Cl/   {if (PR1 && PR2 && PR3)   {print"\n" BUF
                                    print
                             }
           PR1 = PR2 = PR3 = 0
           BUF = ""
           next
          }

          {BUF = BUF (BUF?ORS:_) $0
           if ($1 in A) PR1 = 1
           if ($0 ~/\<Co\>/) PR2 = 1
          if ($1 ~/ ^L/ && $0 in B) PR3 = 1
          }

I think there might be something wrong with the way that I'm defining the "B" array with $2 of File 1 or defining the "for" loop in the script. Thank you so much in advance for your help.

Last edited by jvoot; 01-22-2019 at 10:45 AM.. Reason: Highlighted problem areas in script example.

jvoot

View Public Profile for jvoot

Find all posts by jvoot

01-22-2019

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Since you have multiple lines in File 1 with the same $1 values and different $2 values, there are a lot of similarities between the awk code needed to solve your problem and the problem that cmccabe presented to us three days ago: awk to add text to each line of matching id

Have you looked at the awk code that was provided to help cmccabe in that thread to see if it might help you solve your problem?

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

01-22-2019

Registered User

58, 2

Join Date: Aug 2014

Last Activity: 6 April 2020, 3:03 PM EDT

Posts: 58

Thanks Given: 61

Thanked 2 Times in 2 Posts

Thank you so much Don. I'll take a look.

--- Post updated 01-22-19 at 07:06 AM ---

I read through that thread Don and while I must admit that I do not exactly follow all of it, it seems to be a bit different than what I am asking here. My problem is not so much dealing with repeated values, but rather getting that "for" loop to work in my awk script.

To isolate the problem further, I am having trouble getting $2 in File 1 to precisely match a string in File 2 on lines that begin with "Lexeme" (~/ ^L/) giving me fits. I have highlighted the two lines of code in both of the example *.awk scripts with red font. In principle something similar to awk 'FNR==NR{B[$2]; next}{for (b in B) if ($0 ~b) print} ' File[12] should do the trick, but relative to this particular issue: (A) I'm having difficulty getting something like that to work in my script; and (B) this does not produce exact matches, but rather treats the values of $2 in File 1 as strings (rather than say $2 ~/^pattern$/). Thus for the line in File 1 that reads PS005,007 ML I'll get matches such as MLL> MLT> TQML, etc.

Hopefully that hones in on my particular impasse that I'm experiencing.

jvoot

View Public Profile for jvoot

Find all posts by jvoot

01-22-2019

Registered User

5,091, 1,931

Join Date: May 2012

Last Activity: 15 July 2020, 4:46 AM EDT

Location: Simplicity

Posts: 5,091

Thanks Given: 565

Thanked 1,931 Times in 1,668 Posts

You mean loop through the fields and each lookup in the array, like this?

Code:

  if ($1 == "Lexeme") {
    for (b=2; b<=NF; b++) if ($b in B) PR3 = 1
  }

Also, do you want to search <Co> in all the records? Or only in the records that matched with $1? then it should be

Code:

  if (($1 in A) && /<Co>/) PR1 = 1

or (again) loop through the fields and compare each field:

Code:

  if ($1 in A) {
    for (b=2; b<=NF; b++) if ($b == "<Co>") PR1 = 1
  }

This User Gave Thanks to MadeInGermany For This Post:

MadeInGermany

View Public Profile for MadeInGermany

Find all posts by MadeInGermany

01-22-2019

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

The suggestions MadeInGermany provided you will work for your stated requirements.

But, when I look at you examples it seems that your requirements might be more stringent that what you have stated. If we look at your sample File 1:

Code:

PS001,001 HLK
PS002,004 MWQ
PS004,002 RXM
PS004,006 DBX
PS004,006 SBR
PS005,007 ML
PS005,009 DBR
PS005,011 MR
PS005,012 SBR
PS006,003 RXM
PS006,003 >SJ
PS006,010 QBL

and the output you say you are trying to produce:

Code:

PS001,001 [D-<Re>] [B->WRX> D-<WL> <Co>] [L> <Ng>] [HLK <Pr>]
 Lexeme     D      # B >WRX D <WL        # L>      # HLK      #
 PhraseType  6(6) 5(5,2.3,5,2.3) 11(11) 1(1:2)
 PhraseLab  519[0]   504[0]                510[0]    501[0]
 ClauseType xQt0 

 PS002,004 [W-<Cj>] [MRJ> <Su>] [NMJQ <Pr>] [B-HWN <Co>]
 Lexeme     W      # MRJ>      # MWQ       # B HWN=     #
 PhraseType  6(6) 3(3.2) 1(1:1) 5(5,7)
 PhraseLab  509[0]   502[0]      501[0]      504[0]
 ClauseType WXYq

 PS005,012 [D-<Re>] [MSBRJN <PC>] [B-K <Co>]
 Lexeme     D      # SBR         # B K      #
 PhraseType  6(6) 1(1:6.2) 5(5,7)
 PhraseLab  519[0]   521[0]        504[0]
 ClauseType Ptcp

I note that each of the selected output line groups does not just have a line 1 $1 value that is in your A[] array and a word in a line that starts with Lexeme that is in your B[] array; it has a matched pair where the word matched in the Lexeme line had to be from a line in File 1 that had a $1 value that matched $1 in that 1st line.

The requirements you stated do not require that both of those values found in a group of lines in File 2 come from a single line in File 1. But, in each of your sample output groups of lines, both values came from the same line in File 1.

Am I reading too much into your example? Or are your requirements more stringent than what you stated.

If you do want the more stringent requirements (and I am correct in assuming that the $1 value from File 1 is supposed to appear only as $1 in the first line of a group of lines in File 2 and the $2 from that same line in File 1 is required to appear in a line of a group of lines in File 2 starting with Lexeme, then you might want something more like:

Code:

awk '
NR == FNR {
	Keys[$1, $2]
	next
}
function PrintGroup() {
	if(PrintThisGroup) {
		if(GroupsPrinted++)
			print ""
		printf("%s",  Group)
	}
	Group = KeyField1 = ""
	LinesInGroup = PrintThisGroup = 0
}
NF == 0 {
	PrintGroup()
	next
}
++LinesInGroup == 1 && /<Co>/ {
	KeyField1 = $1
}
{	Group = Group $0 "\n"
}
$1 == "Lexeme" {
	for(i = 2; i <= NF; i++)
		if((KeyField1, $i) in Keys) {
			PrintThisGroup = 1
			next
		}
}
END {	PrintGroup()
}' "File 1" "File 2"

These 2 Users Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

01-22-2019

Registered User

58, 2

Join Date: Aug 2014

Last Activity: 6 April 2020, 3:03 PM EDT

Posts: 58

Thanks Given: 61

Thanked 2 Times in 2 Posts

Reply deleted.

Last edited by jvoot; 01-22-2019 at 06:15 PM.. Reason: Significant error in description.

jvoot

View Public Profile for jvoot

Find all posts by jvoot

01-22-2019

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by jvoot

... ... ...
Don, while your code seems to address well the thrust of what I'm going for, but must admit that I have to study it much more to "get it." I have attempted to add a counter to the opening array with input from File 1, but to no avail. I would appreciate any help one could offer to address this latest wrinkle.

Hi jvoot,
The code in the other thread I pointed to in post #2 in this thread had to keep a counter of the number of lines in a file that had the same keys. As long as you treat both fields of a line from File 1 as a single key (the two subscripts I used in the Keys[] array), it should correctly deal with the problem you were having in your code. Your problem seems to be caused by the fact that you're using two arrays (A[], and B[]) instead of just one array to keep track of the fact that you need an entry from $1 in File 1 and an entry from $2 in File 1 that come from the same line.

Did you try running the code I suggested in post #5? Didn't it give you the output you want? If you tried the different portion of File 2 you showed us in post #6, didn't the code I suggested in post #5 still give you the output you wanted?

Since you have changed the code you originally showed us (based on MadeInGermany's suggestions) without showing us your modified code, I can't comment on what else might need to be done to make it work with your updated requirements.

Note that since your original sample File 2 contents included the "record":

Code:

 PS005,012 [W-<Cj>] [L-<LM <Ti>] [NCBXWN-<Pr>] [K <Ob>]
 Lexeme     W      # L <LM      # CBX         # K      #
 PhraseType  6(6) 5(5,2.2) 1(1:1) 7(7)
 PhraseLab  509[0]   506[0]       501[0]        503[0]
 ClauseType WxY0 PS005,013 [>JK SKR> MQBLT> <Aj>] [T<VP-<Pr>] [NJ <Ob>]
 Lexeme     >JK SKR QBL          # <VP       # NJ      #
 PhraseType  5(5,2.3,13:62.3) 1(1:1) 7(7)
 PhraseLab  505[0]                 501[0]      503[0]
 ClauseType xYq0

which contains two lines starting with Cl and two lines matching $1=="Lexeme", I chose to make my code trigger off of the empty line between records instead of off of the ClauseType keyword that appears on multiple lines in some of your records. That may have made my code more difficult to understand and it led me to create a function to print selected output "record"s since I assume that there is no empty line after the last "record" in File 2. Calling that function when NF==0 happens on the empty line between records and calling that function in the END clause handles the case of the last record in an input file not being followed by an empty line.

If my code does what you want with your new sample input file, I'll be happy to try to explain how it works. If it doesn't work for you, then I still don't understand what you're trying to do and I won't be able to help you until I can understand requirements (i.e. specifications) for the software you're trying to write.

We could also look at the code Scrutinizer suggested, but we'd have to make it a little more complex to be sure that it looks for "<Co>" on the first line (and only on the first line) of each multi-line record and to be sure that it only looks for a match for field 2 from File 1 on lines that have "Lexeme" as the first word on a line (his code is currently looking for a match on any word in the entire multi-line record; not just on the "Lexeme" line).

Please understand that we know you have a non-computer science background and we respect that. But to write software that does what you want it to, a computer expects extremely picky details about the data it is asked to process to be clearly specified. If you can't do that, a computer will be very happy giving you results that match what you said you wanted instead of results that match what you really want. I tried to point out a couple of cases where what you said you wanted might not be what you really want, but only you can tell us whether some fields can appear anywhere or only in certain places, whether lines containing "Lexeme" can appear more than once (as in the example from you sample input above) or can only appear as the 2nd line of a multi-line "record" in your real data, etc., etc., etc.

P.S. If you'd like to play around with genes to find a cure for Type 1 diabetes, I'll be happy to write awk scripts for you to help you reach your goal.

These 3 Users Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

UNIX for Beginners Questions & Answers

Awk: Performing "for" loop within text block with two files

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Using "for" loop within "awk"

Discussion started by: chatwithsaurav

2. Shell Programming and Scripting

For Loop Field editing - without using "awk"

Discussion started by: Dev_Dev

3. Shell Programming and Scripting

how to use "cut" or "awk" or "sed" to remove a string

Discussion started by: timmywong

4. Shell Programming and Scripting

Using sed to find text between a "string " and character ","

Discussion started by: haggismn

5. Shell Programming and Scripting

Get values from 2 files - Complex "for loop and if" awk problem

Discussion started by: Ophiuchus

6. Shell Programming and Scripting

awk command to replace ";" with "|" and ""|" at diferent places in line of file

Discussion started by: shis100

7. Shell Programming and Scripting

"Join" or "Merge" more than 2 files into single output based on common key (column)

Discussion started by: Katabatic

8. Shell Programming and Scripting

cat $como_file | awk /^~/'{print $1","$2","$3","$4}' | sed -e 's/~//g'

Discussion started by: harshakusam

9. UNIX for Dummies Questions & Answers

Explain the line "mn_code=`env|grep "..mn"|awk -F"=" '{print $2}'`"

Discussion started by: Lokesha

10. Shell Programming and Scripting

Printing "END" before a new loop in AWK

Discussion started by: ccox85