awk Associative Array and/or Referring to Field by String (Nonconstant String Value)

02-01-2019

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

One could still try:

Code:

awk '$1 ~ /^PS/ {for(i=3; i<=NF; i++) if($i == "<Ob>]"){print $1,substr($(i-1), 2); next}}' file

without needing to use split() (unless I misunderstood and you changed your input file format to remove the <space> before the <Ob>]).

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

02-01-2019

Registered User

58, 2

Join Date: Aug 2014

Last Activity: 6 April 2020, 3:03 PM EDT

Posts: 58

Thanks Given: 61

Thanked 2 Times in 2 Posts

I'm so sorry Scrutinizer, but as my input is many thousand lines long I did not notice a potential complicating issue that I was wondering if I could get your help addressing. There are time where the desired string between an initial "[" and "<Ob>] contains a space.

So for example, given:

Code:

 PS028,006 [KJ <Cj>] [CM< <Pr>] [QWL TXNWNJ- <Ob>]
 Lexeme     KJ      # CM<      # QWL TXNWN J      #
 PhraseType  6(6) 1(1:2) 2(2.1,2.1,7)
 PhraseLab  509[0]    501[0]     503[0]
 ClauseType xQt0

Which I would pare down with INPUT | awk '$1 ~/^ PS/' to get:

Code:

PS028,006 [KJ <Cj>] [CM< <Pr>] [QWL TXNWNJ- <Ob>]

In this case, the desired output would be:

Code:

PS028,006 QWL TXNWNJ-

Code:

PS028,006 [QWL TXNWNJ- <Ob>]

The code you helped me with only gives:

Code:

PS028,006 QWL

Again, I apologize that I did not see the possibility of the space within the desired string until I double-checked the output against INPUT | sed -e 's/.* \[$.*$ <Ob>\].*/\1/' which gives me the desired string but not the $1 when $1 ~/^ PS/.

Would you be able to help me iron this out?

--- Post updated at 10:02 PM ---

Quote:

Originally Posted by Don Cragun

One could still try:

Code:

awk '$1 ~ /^PS/ {for(i=3; i<=NF; i++) if($i == "<Ob>]"){print $1,substr($(i-1), 2); next}}' file

without needing to use split() (unless I misunderstood and you changed your input file format to remove the <space> before the <Ob>]).

This works well Don except that I represented the desired output strings as "ABC" and "XYZ" which it seems that you took at being a three character string. I should have been more specific and said that "ABC" and "XYZ" represents a string of any length. Thus something like ["some amount of text" <Ob>].

jvoot

View Public Profile for jvoot

Find all posts by jvoot

02-01-2019

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

OK... One final attempt...

Based on your single sample latest input file, the following seems to do what you want and will at least show you lines where it wasn't able to match:

Code:

awk '
$1 ~ /^PS/ {
	if(match($0, /[[][^[]* <Ob>[]]/))
		print $1, substr($0, RSTART + 1, RLENGTH - 7)
	else
		print "No Match Found on line " NR, $0
}' file

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

02-01-2019

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Try also

Code:

awk -F"[][]" '/^ *PS.*<Ob>/ {sub(/ *<Ob>.*$/, ""); print $1, $NF}' file
 PS028,005  ABC 
 PS028,005  XYZ 
 PS028,006  QWL TXNWNJ-

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

02-01-2019

Registered User

58, 2

Join Date: Aug 2014

Last Activity: 6 April 2020, 3:03 PM EDT

Posts: 58

Thanks Given: 61

Thanked 2 Times in 2 Posts

Quote:

Originally Posted by RudiC

Try also

Code:

awk -F"[][]" '/^ *PS.*<Ob>/ {sub(/ *<Ob>.*$/, ""); print $1, $NF}' file
 PS028,005  ABC 
 PS028,005  XYZ 
 PS028,006  QWL TXNWNJ-

Quote:

Originally Posted by RudiC

Try also

Code:

awk -F"[][]" '/^ *PS.*<Ob>/ {sub(/ *<Ob>.*$/, ""); print $1, $NF}' file
 PS028,005  ABC 
 PS028,005  XYZ 
 PS028,006  QWL TXNWNJ-

That did it RubiC! Such a simple and elegant way to accomplish it! Thanks so much also to Scrutinizer and Don Cragun for your help!

If I may, could I please ask two questions about how this code is working? The first is about the field separator value. The man AWK page seems to only imply rather than being explicit that the use of the square brackets when setting the field separator from the command line tells AWK to interpret what is between them as a regex rather than simply a fixed string which would otherwise be indicated by "..."? Is this correct?

Secondly, since the value for FS has been set to "][" how come when the print statement calls for {print $1} is does not print from the beginning of the line to the first instance of "][" but rather prints what would be $1 when FS is set to whitespace? In other words, given:

Code:

 PS028,006 [KJ <Cj>] [CM< <Pr>] [QWL TXNWNJ- <Ob>]
 Lexeme     KJ      # CM<      # QWL TXNWN J      #
 PhraseType  6(6) 1(1:2) 2(2.1,2.1,7)
 PhraseLab  509[0]    501[0]     503[0]
 ClauseType xQt0

Why does RudiC's code not give:PS028,006 [KJ <Cj> for {print $1} if FS is set to "]["?

Rather it gives the (desired) first field if FS was at default PS028,006?

Thanks again!

jvoot

View Public Profile for jvoot

Find all posts by jvoot

02-01-2019

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by jvoot

That did it RubiC! Such a simple and elegant way to accomplish it! Thanks so much also to Scrutinizer and Don Cragun for your help!

If I may, could I please ask a question about the field separator value? The man AWK page seems to only imply rather than being explicit that the use of the square brackets when setting the field separator from the command line tells AWK to interpret what is between them as a regex rather than simply a fixed string which would otherwise be indicated by "..."? Is this correct? Thanks again!

--- Post updated at 04:30 PM ---

That did it RubiC! Such a simple and elegant way to accomplish it! Thanks so much also to Scrutinizer and Don Cragun for your help!

If I may, could I please ask two questions about how this code is working? The first is about the field separator value. The man AWK page seems to only imply rather than being explicit that the use of the square brackets when setting the field separator from the command line tells AWK to interpret what is between them as a regex rather than simply a fixed string which would otherwise be indicated by "..."? Is this correct?

Secondly, since the value for FS has been set to "][" how come when the print statement calls for {print $1} is does not print from the beginning of the line to the first instance of "][" but rather prints what would be $1 when FS is set to whitespace? In other words, given:

Code:

 PS028,006 [KJ <Cj>] [CM< <Pr>] [QWL TXNWNJ- <Ob>]
 Lexeme     KJ      # CM<      # QWL TXNWN J      #
 PhraseType  6(6) 1(1:2) 2(2.1,2.1,7)
 PhraseLab  509[0]    501[0]     503[0]
 ClauseType xQt0

Why does RudiC's code not give:PS028,006 [KJ <Cj> for {print $1} if FS is set to "]["?

Rather it gives the (desired) first field if FS was at default PS028,006?

Thanks again!

Hi jvoot,
The standards clearly state that the value of the awk FS variable is an extended regular expression and it doesn't matter whether it is set using the -F option, using the -v option, using an assignment statement between pathname operands, or using an assignment statement in the awk script itself. When the ERE is set to [][] that is a bracket expression that specifies that the <open-square-bracket> character ([) and the <close-square-bracket> character (]) are each to be treated as separate field separators.

With the FS value RudiC used, field 1 is everything that appears in the record before the 1st open or close square bracket character (including the leading and trailing <space>). I chose to use the default FS value because I didn't think you wanted the leading and trailing <space> characters at the start of lines in your input data to be included in your output.

Hope this helps,
Don

These 2 Users Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

02-02-2019

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Quote:

Originally Posted by jvoot

...
The first is about the field separator value. The man AWK page seems to only imply rather than being explicit that the use of the square brackets when setting the field separator from the command line tells AWK to interpret what is between them as a regex rather than simply a fixed string which would otherwise be indicated by "..."? Is this correct?

You are partly right, the field separator string will be interpreted as a regex, and always. In Scrutinizers proposal (from which I stole shamelessly), he uses the bracket expression [][].
man regex:

Quote:

A bracket expression is a list of characters enclosed in "[]". It normally matches any single character from the list.

So awk splits the input line at any occurrence of either [ or ] .

BTW, awk's default FS is a bracket expression regular expression (/[ \t\n]+/) by itself.

Quote:

Secondly, since the value for FS has been set to "][" how come when the print statement calls for {print $1} is does not print from the beginning of the line to the first instance of "][" but rather prints what would be $1 when FS is set to whitespace?

It does. Please apply what has been said to the repective line:

Code:

 PS028,006 [KJ <Cj>] [CM< <Pr>] [QWL TXNWNJ- <Ob>]
^          ^       ^ ^        ^ ^                ^--- last separator; $NF is empty
|          +-------+-+--------+-+-------------------- all FS
+---------------------------------------------------- field 1

Is that clearer now? If you want to remove the leading space from field 1, additional measures must be taken.

These 2 Users Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

UNIX for Beginners Questions & Answers

awk Associative Array and/or Referring to Field by String (Nonconstant String Value)

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to average field if matching string in another

Discussion started by: cmccabe

2. UNIX for Beginners Questions & Answers

String has * as the field delimiter and I need echo/awk to escape it, how?

Discussion started by: newbie_01

3. Shell Programming and Scripting

Awk: Dealing with whitespace in associative array indicies

Discussion started by: Michael Stora

4. Shell Programming and Scripting

Split string into map (Associative Array)

Discussion started by: chitech

5. Shell Programming and Scripting

sed or awk command to replace a string pattern with another string based on position of this string

Discussion started by: vivek d r

6. Shell Programming and Scripting

Help needed on Associative array in awk

Discussion started by: imsularif

7. Homework & Coursework Questions

passing letters from an array into a string for string comparison

Discussion started by: lotsofideas

8. Shell Programming and Scripting

awk, associative array, compare files

Discussion started by: shruthi123

9. Shell Programming and Scripting

Awk Search text string in field, not all in field.

Discussion started by: rocket_dog

10. Shell Programming and Scripting

Problem with lookup values on AWK associative array

Discussion started by: JasonHamm