Sponsored Content
Top Forums UNIX for Beginners Questions & Answers (g)awk: Matching strings from one file in another file between two strings Post 303028012 by jvoot on Saturday 29th of December 2018 04:10:40 PM
Old 12-29-2018
(g)awk: Matching strings from one file in another file between two strings

Hello all, I can get close to what I am looking for but cannot seem to hit it exactly and was wondering if I could get your help.

I have the following sample from textfile with many thousands of lines: File 1
Code:
PS001,001 HLK
PS002,004 L<G
PS004,002 XNN
PS004,006 BVX
PS004,006 ZBX=
PS005,007 DBR=
PS005,011 MRH
PS005,012 XSH
PS006,003 RP>
PS006,003 XNN
PS006,010 LQX
PS007,002 XSH
PS009,011 BVX

I have another large text file with many lines such as this: File 2
Code:
           * 0 1 55 0 0 .\ 1 LineNr 4 ClauseNr 1: 1: 2: 104: 505 11 SentenceNr 1 TxtType: Q Pargr: 2 ClType:InfC
 PS004,002 <NH                 0   1  1  0  1 -1 -1    3  2  1  2    -1   1   1  -1      -1      -1      -1    0  501     0
 PS004,002 NJ                 -1   7 -1 -1 -1 -1 -1   -1  1  1 -1    -1   7   7   2      -1      -1      -1    0  503     0
           * 0 -1 55 1 103 2 123 3 200 0 0 .N 0 LineNr 5 ClauseNr 2: 1: 2: 133: 0 0 SentenceNr 1 TxtType: Q Pargr: 2 ClType:ZIm0
           * 0 -2 123 0 0 .. 1 LineNr 7 ClauseNr 1: 1: 3: 132: 0 0 SentenceNr 2 TxtType: Q Pargr: 2 ClType:xQt0
 PS004,002 XNN                 0   1  1  0  1 -1 -1    3  2  1  2    -1   1   1  -1      -1      -1      -1    0  501     0
 PS004,002 NJ                 -1   7 -1 -1 -1 -1 -1   -1  1  1 -1    -1   7   7   2      -1      -1      -1    0  503     0
           * 0 -3 200 1 201 2 103 18 163 22 123 0 0 .. 0 LineNr 8 ClauseNr 1: 1: 2: 103: 0 0 SentenceNr 3 TxtType: Q Pargr: 2 ClType:ZIm0
           * 0 -3 200 1 201 2 103 18 163 22 123 0 0 .. 0 LineNr 8 ClauseNr 1: 1: 2: 103: 0 0 SentenceNr 3 TxtType: Q Pargr: 2 ClType:ZIm0
 PS004,006 ZBX=                0   1  1  0  7 -1 -1    3  2  3  2    -1   1   1  -1      -1      -1      -1    0  501     0
 PS004,006 ZBX                 0   2 -1 -1 -1  5 -1   -1 -1  3  2     1   2   0  -1       2      -1      -1   -1   -1    -1
 PS004,006 YDQ                 0   2 -1 -1 -1  1 -1   -1 -1  1  2     2   2   2   1  -10002      -1      -1    0  503     0
           * 0 -3 200 1 201 0 0 .. 5 LineNr 24 ClauseNr 1: 1: 2: 103: 0 0 SentenceNr 14 TxtType: Q Pargr: 2.1 ClType:ZIm0
          * 0 -2 523 1 122 0 0 .. 3 LineNr 32 ClauseNr 1: 1: 4: 142: 0 0 SentenceNr 17 TxtType: Q Pargr: 2.1 ClType:xQtX
 PS006,010 CM<                 0   1  0  0  1 -1 -1    2  3  1  2    -1   1   1  -1      -1      -1      -1    0  501     0
 PS006,010 JHWH                0   3 -1 -1 -1  1 -1   -1 -1  1  2     2   3   3   2      -1      -1      -1    0  502     0
 PS006,010 TXNH                0   2 -1 -1 -1  3 -1   -1 -1  1  1     1   2   0  -1      -1      -1      -1   -1   -1    -1
 PS006,010 J                  -1   7 -1 -1 -1 -1 -1   -1  1  1 -1    -1   7   2   2      -1      -1      -1    0  503     0
           * 0 -1 122 1 112 0 0 .. 4 LineNr 33 ClauseNr 2: 1: 3: 112: -6 -11 SentenceNr 17 TxtType: Q Pargr: 2.1 ClType:ZQtX
           * 0 -1 122 1 112 0 0 .. 4 LineNr 33 ClauseNr 2: 1: 3: 112: -6 -11 SentenceNr 17 TxtType: Q Pargr: 2.1 ClType:ZQtX
 PS006,010 JHWH                0   3 -1 -1 -1  1 -1   -1 -1  1  2     2   3   3   2      -1      -1      -1    0  502     0
 PS006,010 TPLH                0   2 -1 -1 -1  3 -1   -1 -1  1  1     1   2   0  -1      -1      -1      -1   -1   -1    -1
 PS006,010 J                  -1   7 -1 -1 -1 -1 -1   -1  1  1 -1    -1   7   2   2      -1      -1      -1    0  503     0
 PS006,010 LQX                 0   1  2  0  1 -1 -1    1  3  1  2    -1   1   1  -1      -1      -1      -1    0  501     0
           * 0 -1 112 0 0 .. 5 LineNr 34 ClauseNr 3: 1: 3: 121: -6 -11 SentenceNr 17 TxtType: Q Pargr: 2.1 ClType:XYqt

My desire is that when $1 && $2 of File 1 match $1 && $2 of File 2 and that match is between lines beginning with "*" and also has $22=="503" in that same group of lines between "*", then print. So:
Code:
            * 0 -2 123 0 0 .. 1 LineNr 7 ClauseNr 1: 1: 3: 132: 0 0 SentenceNr 2 TxtType: Q Pargr: 2 ClType:xQt0
 PS004,002 XNN                 0   1  1  0  1 -1 -1    3  2  1  2    -1   1   1  -1      -1      -1      -1    0  501     0
 PS004,002 NJ                 -1   7 -1 -1 -1 -1 -1   -1  1  1 -1    -1   7   7   2      -1      -1      -1    0  503     0
           * 0 -3 200 1 201 2 103 18 163 22 123 0 0 .. 0 LineNr 8 ClauseNr 1: 1: 2: 103: 0 0 SentenceNr 3 TxtType: Q Pargr: 2 ClType:ZIm0
           * 0 -1 103 0 0 m. 7 LineNr 23 ClauseNr 1: 1: 1: 304: 0 0 SentenceNr 13 TxtType: Q Pargr: 2.1 ClType:MSyn
 PS004,006 ZBX=                0   1  1  0  7 -1 -1    3  2  3  2    -1   1   1  -1      -1      -1      -1    0  501     0
 PS004,006 ZBX                 0   2 -1 -1 -1  5 -1   -1 -1  3  2     1   2   0  -1       2      -1      -1   -1   -1    -1
 PS004,006 YDQ                 0   2 -1 -1 -1  1 -1   -1 -1  1  2     2   2   2   1  -10002      -1      -1    0  503     0
           * 0 -3 200 1 201 0 0 .. 5 LineNr 24 ClauseNr 1: 1: 2: 103: 0 0 SentenceNr 14 TxtType: Q Pargr: 2.1 ClType:ZIm0
           * 0 -1 122 1 112 0 0 .. 4 LineNr 33 ClauseNr 2: 1: 3: 112: -6 -11 SentenceNr 17 TxtType: Q Pargr: 2.1 ClType:ZQtX
 PS006,010 JHWH                0   3 -1 -1 -1  1 -1   -1 -1  1  2     2   3   3   2      -1      -1      -1    0  502     0
 PS006,010 TPLH                0   2 -1 -1 -1  3 -1   -1 -1  1  1     1   2   0  -1      -1      -1      -1   -1   -1    -1
 PS006,010 J                  -1   7 -1 -1 -1 -1 -1   -1  1  1 -1    -1   7   2   2      -1      -1      -1    0  503     0
 PS006,010 LQX                 0   1  2  0  1 -1 -1    1  3  1  2    -1   1   1  -1      -1      -1      -1    0  501     0
           * 0 -1 112 0 0 .. 5 LineNr 34 ClauseNr 3: 1: 3: 121: -6 -11 SentenceNr 17 TxtType: Q Pargr: 2.1 ClType:XYqt

My current tactic was to take File 2 and print only matches between "*" that have $22=="503"
Code:
gawk '{BUF = BUF ORS $0} $22=="503"{PRT=1}/^ *\*/{if(PRT) print BUF; BUF=$0; PRT=DL=""}' File 2

Then I was taking File 1 iterating over the previous output to find matches:
Code:
gawk 'FNR==NR{a[$1]; next} ($1) in a || $0 ~/\*/' File 1 <(awk '{BUF = BUF ORS $0} $22=="503"{PRT=1}/^ *\*/{if(PRT) print BUF;BUF=$0; PRT=DL=""}' File2)

However, this method produces many false matches because the search criteria ($1 of File 1) is too ambiguous to match the specific matches I need. If I include the other field in the search criteria of File 1, it becomes too specific and will not include the surrounding lines.

So for example, given a hypothetical:
File 1a
Code:
PS004,002 XNN

File 2a
Code:
 * 0 1 55 0 0 .\ 1 LineNr 4 ClauseNr 1: 1: 2: 104: 505 11 SentenceNr 1 TxtType: Q Pargr: 2 ClType:InfC
 PS004,002 <NH                 0   1  1  0  1 -1 -1    3  2  1  2    -1   1   1  -1      -1      -1      -1    0  501     0
 PS004,002 NJ                 -1   7 -1 -1 -1 -1 -1   -1  1  1 -1    -1   7   7   2      -1      -1      -1    0  503     0
           * 0 -1 55 1 103 2 123 3 200 0 0 .N 0 LineNr 5 ClauseNr 2: 1: 2: 133: 0 0 SentenceNr 1 TxtType: Q Pargr: 2 ClType:ZIm0
           * 0 -2 123 0 0 .. 1 LineNr 7 ClauseNr 1: 1: 3: 132: 0 0 SentenceNr 2 TxtType: Q Pargr: 2 ClType:xQt0
 PS004,002 XNN                 0   1  1  0  1 -1 -1    3  2  1  2    -1   1   1  -1      -1      -1      -1    0  501     0
 PS004,002 NJ                 -1   7 -1 -1 -1 -1 -1   -1  1  1 -1    -1   7   7   2      -1      -1      -1    0  503     0
           * 0 -3 200 1 201 2 103 18 163 22 123 0 0 .. 0 LineNr 8 ClauseNr 1: 1: 2: 103: 0 0 SentenceNr 3 TxtType: Q Pargr: 2 ClType:ZIm0

My sample code gives:
Code:
 * 0 1 55 0 0 .\ 1 LineNr 4 ClauseNr 1: 1: 2: 104: 505 11 SentenceNr 1 TxtType: Q Pargr: 2 ClType:InfC
 PS004,002 <NH                 0   1  1  0  1 -1 -1    3  2  1  2    -1   1   1  -1      -1      -1      -1    0  501     0
 PS004,002 NJ                 -1   7 -1 -1 -1 -1 -1   -1  1  1 -1    -1   7   7   2      -1      -1      -1    0  503     0
           * 0 -1 55 1 103 2 123 3 200 0 0 .N 0 LineNr 5 ClauseNr 2: 1: 2: 133: 0 0 SentenceNr 1 TxtType: Q Pargr: 2 ClType:ZIm0
           * 0 -2 123 0 0 .. 1 LineNr 7 ClauseNr 1: 1: 3: 132: 0 0 SentenceNr 2 TxtType: Q Pargr: 2 ClType:xQt0
 PS004,002 XNN                 0   1  1  0  1 -1 -1    3  2  1  2    -1   1   1  -1      -1      -1      -1    0  501     0
 PS004,002 NJ                 -1   7 -1 -1 -1 -1 -1   -1  1  1 -1    -1   7   7   2      -1      -1      -1    0  503     0
           * 0 -3 200 1 201 2 103 18 163 22 123 0 0 .. 0 LineNr 8 ClauseNr 1: 1: 2: 103: 0 0 SentenceNr 3 TxtType: Q Pargr: 2 ClType:ZIm0

Rather than the desired:
Code:
           * 0 -2 123 0 0 .. 1 LineNr 7 ClauseNr 1: 1: 3: 132: 0 0 SentenceNr 2 TxtType: Q Pargr: 2 ClType:xQt0
 PS004,002 XNN                 0   1  1  0  1 -1 -1    3  2  1  2    -1   1   1  -1      -1      -1      -1    0  501     0
 PS004,002 NJ                 -1   7 -1 -1 -1 -1 -1   -1  1  1 -1    -1   7   7   2      -1      -1      -1    0  503     0
           * 0 -3 200 1 201 2 103 18 163 22 123 0 0 .. 0 LineNr 8 ClauseNr 1: 1: 2: 103: 0 0 SentenceNr 3 TxtType: Q Pargr: 2 ClType:ZIm0

Thanks so much and sorry for the lengthy post. Hopefully I have described this accurately.

Last edited by RudiC; 12-29-2018 at 05:30 PM.. Reason: corrected ICODE --> CODE
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

AWK- delimiting the strings and matching the fields

Hello, I am newbie in awk. I have just started learning it. 1) I have input file which looks like: {4812 4009 1602 2756 306} {4814 4010 1603 2757 309} {8116 9362 10779 } {10779 10121 9193 10963 10908} {1602 2756 306 957 1025} {1603 2757 307} and so on..... 2) In output: a)... (10 Replies)
Discussion started by: kajolo
10 Replies

2. Shell Programming and Scripting

Delete lines in file containing duplicate strings, keeping longer strings

The question is not as simple as the title... I have a file, it looks like this <string name="string1">RZ-LED</string> <string name="string2">2.0</string> <string name="string2">Version 2.0</string> <string name="string3">BP</string> I would like to check for duplicate entries of... (11 Replies)
Discussion started by: raidzero
11 Replies

3. UNIX for Dummies Questions & Answers

Extraction of strings from a file, after pattern matching

I need to extract strings from a file. The file contains data like: Plan ABCD IN-+-172BB---118C2C---GGN_342-+-MM77_23--+-LAS24_3|GGK_774 | | \-LAS24_2|GGN_774 | +-AA_800_1-+-BAS_000|GGK_362 | | \-BAS_001|GGK_360 | \-DD_000T1---DAM_001|STEEL_0 Plan SHELL_1... (3 Replies)
Discussion started by: abkush
3 Replies

4. Shell Programming and Scripting

Extract two strings from a file and create a new file with these strings

I have the following lines in a log file. It would be great if some one can help me to create a new file with the just entries in the below format. 66.150.161.195 HPSAC=Z05 66.150.161.196 HPSAC=A05 That is just extract the IP address and the string DPSAC=its value 66.150.161.195 -... (1 Reply)
Discussion started by: Tuxidow
1 Replies

5. Shell Programming and Scripting

Need to append matching strings in a file

Hi , I am writing a shell script to check pvsizes in linux box. # for i in `cat vgs1` > do > echo "########### $i ###########" > pvs|grep -i $i|awk '{print $2,$1,$5}'>pvs_$i > pvs|grep -i $i|awk '{print $1}'|while read a > do > fdisk -l $a|head -2|tail -1|awk '{print $2,$3}'>pvs_$i1 >... (3 Replies)
Discussion started by: nanduri
3 Replies

6. Shell Programming and Scripting

awk extract strings matching multiple patterns

Hi, I wasn't quite sure how to title this one! Here goes: I have some already partially parsed log files, which I now need to extract info from. Because of the way they are originally and the fact they have been partially processed already, I can't make any assumptions on the number of... (8 Replies)
Discussion started by: chrissycc
8 Replies

7. UNIX for Dummies Questions & Answers

1st time awk user strings not matching right....

So I was given a file,and I want to count how many occurrences happen with a specific string. I have two, that could have up to 3 different outcomes. Now my trouble I believe starts with this string, "news.cais.net" but why? as of now my output is this... accepted rejected ... (3 Replies)
Discussion started by: squidGreen
3 Replies

8. Shell Programming and Scripting

Output counts of all matching strings lessthan a number using awk

The awk below is supposed to count all the matching $5 strings and count how many $7 values is less than 20. I don't think I need the portion in bold as I do not need any decimal point or format, but can not seem to get the correct counts. Thank you :). file chr5 77316500 77316628 ... (6 Replies)
Discussion started by: cmccabe
6 Replies

9. UNIX for Beginners Questions & Answers

How to pass strings from a list of strings from another file and create multiple files?

Hello Everyone , Iam a newbie to shell programming and iam reaching out if anyone can help in this :- I have two files 1) Insert.txt 2) partition_list.txt insert.txt looks like this :- insert into emp1 partition (partition_name) (a1, b2, c4, s6, d8) select a1, b2, c4, (2 Replies)
Discussion started by: nubie2linux
2 Replies

10. UNIX for Beginners Questions & Answers

Use strings from nth field from one file to match strings in entire line in another file, awk

I cannot seem to get what should be a simple awk one-liner to work correctly and cannot figure out why. I would like to use patterns from a specific field in one file as regex to search for matching strings in the entire line ($0) of another file. I would like to output the lines of File2 which... (1 Reply)
Discussion started by: jvoot
1 Replies
All times are GMT -4. The time now is 08:57 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy