Awk: Dealing with whitespace in associative array indicies

04-30-2015

Registered User

183, 15

Join Date: Jul 2010

Last Activity: 22 June 2015, 3:25 PM EDT

Posts: 183

Thanks Given: 56

Thanked 15 Times in 13 Posts

Awk: Dealing with whitespace in associative array indicies

Is there a reliable way to deal with whitespace in array indicies?

I am trying to annotate fails in a database using a table of known fails.

In a begin block I have code like this:

Code:

# Read in Known Fail List
getline < "'"$failListFile"'"; getline < "'"$failListFile"'"; getline < "'"$failListFile"'" # Header Rows
while (getline < "'"$failListFile"'") { split( $0, a, ","); failMessage[a[1]a[2]a[3]a[4]a[5]]=a[8] }
close("'"$failListFile"'")

And in the main part, code like this:

Code:

if ( $10 > limit) { $12 = "Fail"; if ( $2$3$4$5$9 in failMessage)
                                      { $12 = "Known Fail"; if ($7 == "\"\"") gsub ( "\"$", "Known Fail: "failMessage[$2$3$4$5$9]"\"", $7 )
                                                            else gsub ( "\"$", "|Known Fail: "failMessage[$2$3$4$5$9]"\"", $7 ) } }
else $12 = "Pass"

$10 is a test value, $12 is pass/fail, $7 is a comment, $2-$5 are building, room, position in room, etc. Later in my code I translate "|" characters to new lines. Everything works fine except when some of my room names have whitespace.

Code:

$ echo | awk 'BEGIN { a[1]="abc"; if (1 in a) print "index exists"}'
index exists

$ echo | awk 'BEGIN { a[1]="abc"; if ("1" in a) print "index exists"}'
index exists

$ echo | awk 'BEGIN { a["1"]="abc"; if ("1" in a) print "index exists"}'
index exists

$ echo | awk 'BEGIN { a[contains whitespace]="abc"; if (contains whitespace in a) print "index exists"}'
index exists

$ echo | awk 'BEGIN { a[contains whitespace]="abc"; if (contains purplespace in a) print "index exists"}'
index exists FALSE POSITIVE

$ echo a b c | awk 'NR==1 { a[$1$2$3]="abc"; if ( "abc" in a ) { print "index exists"} }'
index exists

$ echo a b " c"| awk 'NR==1 { a[$1$2$3]="abc"; if ( "ab c" in a ) { print "index exists"} }'

$ echo a b " c"| awk 'NR==1 { a["$1$2$3"]="abc"; if ( "ab c" in a ) { print "index exists"} }'

Mike

Last edited by Michael Stora; 04-30-2015 at 11:30 PM..

Michael Stora

View Public Profile for Michael Stora

Find all posts by Michael Stora

05-01-2015

Moderator

3,791, 1,452

Join Date: Oct 2010

Last Activity: 1 August 2020, 1:38 AM EDT

Posts: 3,791

Thanks Given: 183

Thanked 1,452 Times in 1,302 Posts

Code:

$ echo | awk 'BEGIN { a[contains whitespace]="abc"; if (contains purplespace in a) print "index exists"}'
index exists FALSE POSITIVE

This is not a false positive you are concatenating two null variables producing a null value just like:
contains whitespace == contains purplespace

Consider:

Code:

echo | awk 'BEGIN { whitespace="ws"; purplespace="ps"; a[contains whitespace]="abc"; if (contains purplespace in a) print "index exists"}'

Chubler_XL

View Public Profile for Chubler_XL

Find all posts by Chubler_XL

05-01-2015

Registered User

183, 15

Join Date: Jul 2010

Last Activity: 22 June 2015, 3:25 PM EDT

Posts: 183

Thanks Given: 56

Thanked 15 Times in 13 Posts

Quote:

Originally Posted by Chubler_XL

Code:

$ echo | awk 'BEGIN { a[contains whitespace]="abc"; if (contains purplespace in a) print "index exists"}'
index exists FALSE POSITIVE

This is not a false positive you are concatenating two null variables producing a null value just like:
contains whitespace == contains purplespace

Consider:

Code:

echo | awk 'BEGIN { whitespace="ws"; purplespace="ps"; a[contains whitespace]="abc"; if (contains purplespace in a) print "index exists"}'

I meant false positive in the sense of what I am trying to do, not in the sense of "hey, I found a bug in AWK!". I was just showing trying all possible combinations with and without quoting. I realized that it was parsing them as variables.

The problem is the existence of white space in the variables and I'm asking what is the most elegant way to deal with white space in an array index.

The best I can come up with is to remove the whitespace before the array assignment and the contains check

Code:

$ echo x," contains whitespace ",z | awk -F, 'NR==1 { i=$1$2$3; print i; gsub ( " ", "", i ); print i }'
x contains whitespace z
xcontainswhitespacez

Mike

Last edited by Michael Stora; 05-01-2015 at 05:02 AM..

Michael Stora

View Public Profile for Michael Stora

Find all posts by Michael Stora

05-01-2015

Registered User

945, 306

Join Date: Jun 2011

Last Activity: 1 January 2020, 5:25 PM EST

Location: South Carolina, USA

Posts: 945

Thanks Given: 32

Thanked 306 Times in 284 Posts

... Why is whitespace a problem? It seems later part of your problem is that you're embedding shell variables in awk code, you're not properly quoting, and you've a lack of understanding of how awk splits records.

awk splits at spaces, however many there are. Then you put them in the index without any spaces because $1$2 has them stripped out. Using $1 " " $2 would squeeze multiple spaces into single ones.

Code:

echo 'ab c' | awk 'NR==1 { a[$0]="abc"; if ( "ab c" in a ) { print "index exists"} }'
index exists

Treating the entire line as the index, you see spaces work fine.

Please reword the issue if I'm not understanding you...

edit: I re-read the problem. What is the FS you're using? I see the split() uses ,.

neutronscott

View Public Profile for neutronscott

Visit neutronscott's homepage!

Find all posts by neutronscott

05-01-2015

Registered User

183, 15

Join Date: Jul 2010

Last Activity: 22 June 2015, 3:25 PM EDT

Posts: 183

Thanks Given: 56

Thanked 15 Times in 13 Posts

My attempts to demonstrate what I thought was happening with simpler examples on the command line were flawed and introduced additional erros (although I did duplicte the issues using Awk -F, I did not post those examples) but I don't believe my "real" code has the same flaws.

I believe that I am quoting the shell varibes correctly (or else the readlines in my code below would fail) and by specifying an alternate delimiter in my AWK statement the issues with whitespace in array indicies are not comming from either splitting or my understaning of splitting.

First I am reading in a csv file that may have non-seperating commas, whitespace, and other potentially problematic characters inside double quoted fields. So I start off with a double-quote counting parser and an alternative delimiter (the old ASCII unit seperator form puncard/paper tape days).

Code:

 
delim=$'\037' # ASCII Unit Seperator (US)
 
awk '{ quote=0; for(i=1;i<=length;i++)
                   { ch=substr($0, i, 1)
                     if ( ch == "\"" ) quote=( ++quote % 2)
                         else if ( quote == 0 && ch == ",") ch="'"$delim"'"
                     printf ch }
       print ""
     }' "$scratchDir""My_Input_File" |

Next I have a rather cryptic and very long awk command routine that transposes data and then stacks 21 variable columns into two columns (a variable name column and a variable value column and adds some other valus form shell variables to the pipe). It is very long and very complicates (as well as cryptic) so Iwill omit that part (it is working exactly as expected).

The next part of a code manipulates the data based on values in three different configuration files. I think you will find that I am quoting the external file names from BASH correctly. If I don't get the nesting of single and double quotes exactly right the readlines fail.

limit file contains a list of spec limits for different parameters (in the first column). For both areas considered sensitive and areas considered insensitive (two different columns). Then there is a file with a list of sensitive areas. finally there is a file with a list of known issues that I wish to substitute.
Edit: In reponse to you the question in your edit, these config files I am reading in are comma seperated values. They do not have the issues of commas inside quotes but one of them does have spaces inside rows of a few columns.

Code:

# Look up and apply limits.
awk -F "$delim" 'BEGIN { OFS=FS
                 # Read in Vibrtation Limits
                 getline < "'"$limitFile"'" # Header Row
                 while (getline < "'"$limitFile"'") { split( $0, a, ","); vLim[a[1]]=a[2]; vLim[a[1]"Sen"]=a[3] }
                 close("'"$limitFile"'")
                 # Read in Sensitive Bay List 
                 getline < "'"$bayListFile"'" # Header Row
                 while (getline < "'"$bayListFile"'") { split( $0, a, ","); sBL[a[1]a[2]]="Sen" }
                 close("'"$bayListFile"'") 
                 # Read in Known Fail List
                 getline < "'"$failListFile"'"; getline < "'"$failListFile"'"; getline < "'"$failListFile"'" # Header Rows
                 while (getline < "'"$failListFile"'") { split( $0, a, ","); i=a[1]a[2]a[3]a[4]a[5]; gsub ( " ", "", i ); failMessage[i]=a[8]
                     fs=a[6]; sub ( "^$", "0000 01 01 00 00 00", fs ); failStart[i]=fs
                     fe=a[7]; sub ( "^$", "9999 12 31 23 59 59", fe ); failEnd[i]=fe }
                 close("'"$failListFile"'") }    
         NR == 1 { print $0 } 
         NR > 1 { if ( $9 ~ /Hz/ ) { limit=vLim[$9sBL[$2$3]]; $11=limit # works great since $9 $2 and $3 never have whitespace
                      if ( $10 > limit) { $12 = "Fail"; i=$2$3$4$5$9; gsub ( " ", "", i ) # sometimes $3, $4, or $5 have spaces.  My code now works with the gsub removing spaces from the index but my purpose in posting was to better understand how Awk handles whitespace in indexes ("help me understand more" not "help me write a script")
                         if ( i in failMessage ) { now = mktime(year" "month" "day" "hour" "min" "sec) #I'm still writing this time aware part and it is not part of my question
                        now = mktime(2015 01 01 00 00 00) # debug
                            if ( now >= mktime(failStart[i]) && now <= mktime(failEnd[i]) ) {
                                $12 = "Known Fail"; if ($7 == "\"\"") gsub ( "\"$", "Known Fail: "failMessage[i]"\"", $7 ) #column 7 is a double quoted sting.  When "null" it is actually ""
                                                    else gsub ( "\"$", "|Known Fail: "failMessage[i]"\"", $7 ) } } } #I convert "|" into DOS style new lines at a later time in my code.  When there is already a comment in $7 I want a new line between my known failure mode message
                      else $12 = "Pass"; print $0 }
                }' |
# Remove alternative delimiter
sed -e 's/'"$delim"'/,/g' > "$scratchDir""My Output File"

Of course a ton of code before and after what I included exists but it is outside the context of my question

Just as a reminder of the context of my question, the code works now that I am removing spaces from the parts in PINK using the gsub commands. My question is educational (what does awk do with whitespace in a index assignment)?

Also in general my questions in UNIX.com are not even about getting something to work but getting something to work efficiently when parsing huge files. In this particular project I am dealing with about 10,000 files that end up in a ~200k line database but in some other projects I am dealing with more than 50 million lines of data. In the case of this particular project I got the running time down from 70 min for a year of data to less than 5 min for 2 years of data. This involved moving a lot of BASH code to AWK, eliminating utility calls (using only built-ins in the interest of speed, even if more complex ) and file interactions in any kind of loop or itterative part and timing alternate versions of portions of my code in AWK, PERL, SED, BASH etc and picking the best performing one. I appologise if my questions about understanding more about how something works or if there are alternatives are being misinterperated as "this is broken, show me an implementation" types of questions. Generally when I ask these questions I have something working but I have a suspicion that there is a better/faster way.

Mike

Last edited by Michael Stora; 05-01-2015 at 04:01 PM..

Michael Stora

View Public Profile for Michael Stora

Find all posts by Michael Stora

05-01-2015

Registered User

945, 306

Join Date: Jun 2011

Last Activity: 1 January 2020, 5:25 PM EST

Location: South Carolina, USA

Posts: 945

Thanks Given: 32

Thanked 306 Times in 284 Posts

Well I won't pick apart at your code too much more then.

To answer the question on whitespace, awk shouldn't be messing with them aside from the default FS being equivalent to [[:space:]][[:space:]]*.

With that in mind, your first call to awk uses the default FS which may lead to your data having trailing/leading whitespaces removed.

Print the indexes upon assignment and before the gsub in the latter awk to stderr and comb through it. It's probably what's happening.

This User Gave Thanks to neutronscott For This Post:

neutronscott

View Public Profile for neutronscott

Visit neutronscott's homepage!

Find all posts by neutronscott

05-01-2015

Registered User

183, 15

Join Date: Jul 2010

Last Activity: 22 June 2015, 3:25 PM EDT

Posts: 183

Thanks Given: 56

Thanked 15 Times in 13 Posts

Trying to be more rigorous with attempts to demonstrate the question in simple code examples results in me being unable to duplicate the problem . . .

Now I am left to ponder if the problem I thought I fixed with the gsub commands was even a problem and I somehow accidently fixed somthing else (entropy notwithstanding) . . .

Code:

 
$ echo "abc,d e f,xyz" | awk -F, 'NR==1 { a[1]=$1; a[2]=$2; a[3]=$3; c[a[1]a[2]a[3]]="value"; if ( a[1]a[2]a[3] in c) print "index found" }'
index found
 
$ echo "abc,d e f,xyz" | awk -F, 'NR==1 { a[1]=$1; a[2]=$2; a[3]=$3; c[a[1]a[2]a[3]]="value"; if ( "abcd e fxyz" in c) print "index found" }'
index found
 
$ echo "abc,d e f,xyz" | awk -F, 'NR==1 { a[1]=$1; a[2]=$2; a[3]=$3; c[a[1]a[2]a[3]]="value"; if ( "something else" in c ) print "index found" }'
 
$ echo "abc,d e f,xyz" | awk -F, 'NR==1 { a[1]=$1; a[2]=$2; a[3]=$3; c[a[1]a[2]a[3]]="value"; if ( $1$2$3 in c ) print "index found" }'
index found
 
$ echo "abc,d e f,xyz" | awk -F, 'NR==1 { a[1]=$1; a[2]=$2; a[3]=$3; c[$1$2$3]="value"; if ( a[1]a[2]a[3] in c ) print "index found" }'
index found

Mike

---------- Post updated at 12:31 PM ---------- Previous update was at 12:20 PM ----------

Quote:

Originally Posted by neutronscott

Well I won't pick apart at your code too much more then. Smilie

Thanks for the suggestion but I think the error is creaping in through another source since my first awk invovation uses only $0 which appears to never get parsed and my second omitted awk statement uses the alternative delimiter.

Code:

$ echo " this is a line with beginning and end spaces plus several internal spaces ( ) " | awk 'NR==1 {print "x"$0"x"}'
x this is a line with beginning and end spaces plus several internal spaces ( ) x

Actually AWK doesn't but UNIX.com does so you'll have to take my work for it

However, you may be right about the error creaping in from something else in my input file. I will look in that direction since I have exhausted other avenues.

BTW: I have already kicked myself several times for not starting with tab seperated values from the very beginning of the project

Tabs never exist in my fields.

Mike

Last edited by Michael Stora; 05-01-2015 at 04:36 PM..

Michael Stora

View Public Profile for Michael Stora

Find all posts by Michael Stora

Shell Programming and Scripting

Awk: Dealing with whitespace in associative array indicies

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

awk Associative Array and/or Referring to Field by String (Nonconstant String Value)

Discussion started by: jvoot

2. Shell Programming and Scripting

Using associative array for comparison

Discussion started by: Decstasy

3. Shell Programming and Scripting

Morse Code with Associative Array

Discussion started by: ksmarine1980

4. Shell Programming and Scripting

Associative Array with more than one item per entry

Discussion started by: kcpoole

5. Shell Programming and Scripting

Associative array

Discussion started by: leghorn

6. Shell Programming and Scripting

Help needed on Associative array in awk

Discussion started by: imsularif

7. Shell Programming and Scripting

awk, associative array, compare files

Discussion started by: shruthi123

8. Shell Programming and Scripting

Problem with lookup values on AWK associative array

Discussion started by: JasonHamm

9. Shell Programming and Scripting

Perl: Sorting an associative array

Discussion started by: tine

10. Shell Programming and Scripting

Associative Array

Discussion started by: tine