Awk: Performing "for" loop within text block with two files


 
Thread Tools Search this Thread
Top Forums UNIX for Beginners Questions & Answers Awk: Performing "for" loop within text block with two files
# 8  
Old 01-22-2019
Scrutinizer! That did it (or at least seems to have done it)! Your solution was so much more simple than that which I was attempting. Thank you so very much!
# 9  
Old 01-22-2019
Quote:
Originally Posted by jvoot
... ... ...
Don, while your code seems to address well the thrust of what I'm going for, but must admit that I have to study it much more to "get it." I have attempted to add a counter to the opening array with input from File 1, but to no avail. I would appreciate any help one could offer to address this latest wrinkle.
Hi jvoot,
The code in the other thread I pointed to in post #2 in this thread had to keep a counter of the number of lines in a file that had the same keys. As long as you treat both fields of a line from File 1 as a single key (the two subscripts I used in the Keys[] array), it should correctly deal with the problem you were having in your code. Your problem seems to be caused by the fact that you're using two arrays (A[], and B[]) instead of just one array to keep track of the fact that you need an entry from $1 in File 1 and an entry from $2 in File 1 that come from the same line.

Did you try running the code I suggested in post #5? Didn't it give you the output you want? If you tried the different portion of File 2 you showed us in post #6, didn't the code I suggested in post #5 still give you the output you wanted?

Since you have changed the code you originally showed us (based on MadeInGermany's suggestions) without showing us your modified code, I can't comment on what else might need to be done to make it work with your updated requirements.

Note that since your original sample File 2 contents included the "record":
Code:
 PS005,012 [W-<Cj>] [L-<LM <Ti>] [NCBXWN-<Pr>] [K <Ob>]
 Lexeme     W      # L <LM      # CBX         # K      #
 PhraseType  6(6) 5(5,2.2) 1(1:1) 7(7)
 PhraseLab  509[0]   506[0]       501[0]        503[0]
 ClauseType WxY0 PS005,013 [>JK SKR> MQBLT> <Aj>] [T<VP-<Pr>] [NJ <Ob>]
 Lexeme     >JK SKR QBL          # <VP       # NJ      #
 PhraseType  5(5,2.3,13:62.3) 1(1:1) 7(7)
 PhraseLab  505[0]                 501[0]      503[0]
 ClauseType xYq0

which contains two lines starting with Cl and two lines matching $1=="Lexeme", I chose to make my code trigger off of the empty line between records instead of off of the ClauseType keyword that appears on multiple lines in some of your records. That may have made my code more difficult to understand and it led me to create a function to print selected output "record"s since I assume that there is no empty line after the last "record" in File 2. Calling that function when NF==0 happens on the empty line between records and calling that function in the END clause handles the case of the last record in an input file not being followed by an empty line.

If my code does what you want with your new sample input file, I'll be happy to try to explain how it works. If it doesn't work for you, then I still don't understand what you're trying to do and I won't be able to help you until I can understand requirements (i.e. specifications) for the software you're trying to write.

We could also look at the code Scrutinizer suggested, but we'd have to make it a little more complex to be sure that it looks for "<Co>" on the first line (and only on the first line) of each multi-line record and to be sure that it only looks for a match for field 2 from File 1 on lines that have "Lexeme" as the first word on a line (his code is currently looking for a match on any word in the entire multi-line record; not just on the "Lexeme" line).

Please understand that we know you have a non-computer science background and we respect that. But to write software that does what you want it to, a computer expects extremely picky details about the data it is asked to process to be clearly specified. If you can't do that, a computer will be very happy giving you results that match what you said you wanted instead of results that match what you really want. I tried to point out a couple of cases where what you said you wanted might not be what you really want, but only you can tell us whether some fields can appear anywhere or only in certain places, whether lines containing "Lexeme" can appear more than once (as in the example from you sample input above) or can only appear as the 2nd line of a multi-line "record" in your real data, etc., etc., etc.

P.S. If you'd like to play around with genes to find a cure for Type 1 diabetes, I'll be happy to write awk scripts for you to help you reach your goal. Smilie
These 3 Users Gave Thanks to Don Cragun For This Post:
# 10  
Old 01-22-2019
I apologize for the misleading information Don. I have been trying to delete that last update that I had put up because I saw the error. I unfortunately do not see the ability to delete a reply and you beat me to it. Again, I am very sorry for that.
# 11  
Old 01-23-2019
You are welcome jvoot.

Don is right of course and it all depends what it is that you require.

To illustrate, to add the additional requirements, you could try these adjustments, which makes the code more precise, but more complex.
Code:
awk '
  NR==FNR {
    Keys[$1,$2]
    next
  }
  /<Co>/ { 
    for(line=1; line<=NF; line++) {
      split($line,Fields," ")
      if(line==1)
        subkey1=Fields[1]
      if(Fields[1]=="Lexeme") {
        for(i in Fields)
          if((subkey1,Fields[i]) in Keys) {
            print
            next
          }
        }
      }
    }
' file1 FS='\n' RS= ORS='\n\n' file2

How much precision you need depends on the variability of your input files. So to understand the limitations, you need to understand both your data and your code and you need to test of course.

This approach uses RS=, which is a special case, where an empty line (two consecutive newlines) is used as a record separator. The additional requirements, meant that the field separator FS needed to be changed to a newline, so that each field constitutes a line within a record in file2. These lines then needed to be split into smaller subfields using the split() command.

Each approach has its pros and cons.

A pro of this approach may be that the code can be simpler, so it may be easier to understand.

The cons are:
  • Adding more precision can sometimes lead to more complexity than a line based approach.
  • If there is so much as a space or TAB character on any of the empty lines, then it may break the solution..

You need to weigh these considerations when choosing your approach. It all depends..

Last edited by Scrutinizer; 01-23-2019 at 12:42 AM..
These 2 Users Gave Thanks to Scrutinizer For This Post:
# 12  
Old 01-23-2019
Quote:
Originally Posted by Scrutinizer
You are welcome jvoot.

Don is right of course and it all depends what it is that you require.

To illustrate, to add the additional requirements, you could try these adjustments, which makes the code more precise, but more complex.
Code:
awk '
  NR==FNR {
    Keys[$1,$2]
    next
  }
  /<Co>/ { 
    for(line=1; line<=NF; line++) {
      split($line,Fields," ")
      if(line==1)
        subkey1=Fields[1]
      if(Fields[1]=="Lexeme") {
        for(i in Fields)
          if((subkey1,Fields[i]) in Keys) {
            print
            next
          }
        }
      }
    }
' file1 FS='\n' RS= ORS='\n\n' file2

How much precision you need depends on the variability of your input files. So to understand the limitations, you need to understand both your data and your code and you need to test of course.

This approach uses RS=, which is a special case, where an empty line (two consecutive newlines) is used as a record separator. The additional requirements, meant that the field separator FS needed to be changed to a newline, so that each field constitutes a line within a record in file2. These lines then needed to be split into smaller subfields using the split() command.

Each approach has its pros and cons.

A pro of this approach may be that the code can be simpler, so it may be easier to understand.

The cons are:
  • Adding more precision can sometimes lead to more complexity than a line based approach.
  • If there is so much as a space or TAB character on any of the empty lines, then it may break the solution..

You need to weigh these considerations when choosing your approach. It all depends..
Hi Scrutinizer,
You and I interpreted the requirement from post #1:
Quote:
My hope was that when $1 of File 1 matches $1 in File 2, $0 in File 2 contains the string "<Co>", and $2 of File 1 matches a string *exactly* in File 2 on a line beginning with the word "Lexeme," then print.
differently. You interpreted "when $1 of File 1 matches $1 in File 2, $0 in File 2 contains the string "<Co>"" to mean that "<Co>" has to appear somewhere in the multi-line record while I interpreted it to mean that "<Co>" has to appear in the line where $1 in File 1 matches $1 in File 2.

If we change the line in your code that is:
Code:
  /<Co>/ {

to be:
Code:
  $1 ~ /<Co>/ {

then I believe that your code and my code would produce the same results.

I also believe that changing the two lines in my code:
Code:
	if(PrintThisGroup) {
	... ... ...
++LinesInGroup == 1 && /<Co>/ {

to be:
Code:
	if(PrintThisGroup && Group ~ /<Co>/) {
	... ... ...
++LinesInGroup == 1 {

would make my code produce the same output as your code. I'm afraid jvoot will have to clarify the requirements for us to determine which one of us guessed correctly at the desired behavior.

Until I saw the sample record containing two "Lexeme" lines and two "ClauseType" lines, I had started trying a different approach (assuming that line 2 of a multi-line record in File 2 had to be the one and only "Lexeme" line in a record) that was closer to jvoot's original code. That method might have been easier for jvoot to follow, but since the given sample didn't match what I was expecting, I abandoned that approach.

It isn't surprising that we chose different approaches to solving the problem based on our different interpretations of the requirements. Clearly, there are lots of ways to address this problem.

Cheers,
Don
This User Gave Thanks to Don Cragun For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Using "for" loop within "awk"

Hi Team. I am trying to execute a simple for loop within an awk but its giving a different result. Below is the main code: awk '{for(i=1;i<=6;i++) print $i}'The result should be 1 2 3 4 5 6 but its not giving this result. Can someone please help? (3 Replies)
Discussion started by: chatwithsaurav
3 Replies

2. Shell Programming and Scripting

For Loop Field editing - without using "awk"

Hi, I'm using Linux and bash shell. I have a file (F1.txt) with contents like Table1 Column1 123abc Table1 Column2 xyz Table2 Column1 543 Now, I would like to get the output as UPDATE Table1 SET Column1='123abc'; UPDATE Table1 SET Column2='xyz'; UPDATE Table2 SET Column1='543';... (3 Replies)
Discussion started by: Dev_Dev
3 Replies

3. Shell Programming and Scripting

how to use "cut" or "awk" or "sed" to remove a string

logs: "/home/abc/public_html/index.php" "/home/abc/public_html/index.php" "/home/xyz/public_html/index.php" "/home/xyz/public_html/index.php" "/home/xyz/public_html/index.php" how to use "cut" or "awk" or "sed" to get the following result: abc abc xyz xyz xyz (8 Replies)
Discussion started by: timmywong
8 Replies

4. Shell Programming and Scripting

Using sed to find text between a "string " and character ","

Hello everyone Sorry I have to add another sed question. I am searching a log file and need only the first 2 occurances of text which comes after (note the space) "string " and before a ",". I have tried sed -n 's/.*string \(*\),.*/\1/p' filewith some, but limited success. This gives out all... (10 Replies)
Discussion started by: haggismn
10 Replies

5. Shell Programming and Scripting

Get values from 2 files - Complex "for loop and if" awk problem

Hi everyone, I've been thinking and trying/changing all day long the below code, maybe some awk expert could help me to fix the for loop I've thought, I think I'm very close to the correct output. file1 is: <boxes content="Grapes and Apples"> <box No.="Box MT. 53"> <quantity... (8 Replies)
Discussion started by: Ophiuchus
8 Replies

6. Shell Programming and Scripting

awk command to replace ";" with "|" and ""|" at diferent places in line of file

Hi, I have line in input file as below: 3G_CENTRAL;INDONESIA_(M)_TELKOMSEL;SPECIAL_WORLD_GRP_7_FA_2_TELKOMSEL My expected output for line in the file must be : "1-Radon1-cMOC_deg"|"LDIndex"|"3G_CENTRAL|INDONESIA_(M)_TELKOMSEL"|LAST|"SPECIAL_WORLD_GRP_7_FA_2_TELKOMSEL" Can someone... (7 Replies)
Discussion started by: shis100
7 Replies

7. Shell Programming and Scripting

"Join" or "Merge" more than 2 files into single output based on common key (column)

Hi All, I have working (Perl) code to combine 2 input files into a single output file using the join function that works to a point, but has the following limitations: 1. I am restrained to 2 input files only. 2. Only the "matched" fields are written out to the "matched" output file and... (1 Reply)
Discussion started by: Katabatic
1 Replies

8. Shell Programming and Scripting

cat $como_file | awk /^~/'{print $1","$2","$3","$4}' | sed -e 's/~//g'

hi All, cat file_name | awk /^~/'{print $1","$2","$3","$4}' | sed -e 's/~//g' Can this be done by using sed or awk alone (4 Replies)
Discussion started by: harshakusam
4 Replies

9. UNIX for Dummies Questions & Answers

Explain the line "mn_code=`env|grep "..mn"|awk -F"=" '{print $2}'`"

Hi Friends, Can any of you explain me about the below line of code? mn_code=`env|grep "..mn"|awk -F"=" '{print $2}'` Im not able to understand, what exactly it is doing :confused: Any help would be useful for me. Lokesha (4 Replies)
Discussion started by: Lokesha
4 Replies

10. Shell Programming and Scripting

Printing "END" before a new loop in AWK

First off, I have been learning AWK by trial and error over the last week or so, and there are some gaps in my basic understanding of the language. Here is my situation: I am coding and outputting results from an experiment I conducted in Psyscope, which has all been compiled into a master file.... (2 Replies)
Discussion started by: ccox85
2 Replies
Login or Register to Ask a Question