Awk: Dealing with whitespace in associative array indicies


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Awk: Dealing with whitespace in associative array indicies
# 1  
Old 04-30-2015
Awk: Dealing with whitespace in associative array indicies

Is there a reliable way to deal with whitespace in array indicies?

I am trying to annotate fails in a database using a table of known fails.

In a begin block I have code like this:
Code:
# Read in Known Fail List
getline < "'"$failListFile"'"; getline < "'"$failListFile"'"; getline < "'"$failListFile"'" # Header Rows
while (getline < "'"$failListFile"'") { split( $0, a, ","); failMessage[a[1]a[2]a[3]a[4]a[5]]=a[8] }
close("'"$failListFile"'")

And in the main part, code like this:
Code:
if ( $10 > limit) { $12 = "Fail"; if ( $2$3$4$5$9 in failMessage)
                                      { $12 = "Known Fail"; if ($7 == "\"\"") gsub ( "\"$", "Known Fail: "failMessage[$2$3$4$5$9]"\"", $7 )
                                                            else gsub ( "\"$", "|Known Fail: "failMessage[$2$3$4$5$9]"\"", $7 ) } }
else $12 = "Pass"

$10 is a test value, $12 is pass/fail, $7 is a comment, $2-$5 are building, room, position in room, etc. Later in my code I translate "|" characters to new lines. Everything works fine except when some of my room names have whitespace.

Code:
$ echo | awk 'BEGIN { a[1]="abc"; if (1 in a) print "index exists"}'
index exists

$ echo | awk 'BEGIN { a[1]="abc"; if ("1" in a) print "index exists"}'
index exists

$ echo | awk 'BEGIN { a["1"]="abc"; if ("1" in a) print "index exists"}'
index exists

$ echo | awk 'BEGIN { a[contains whitespace]="abc"; if (contains whitespace in a) print "index exists"}'
index exists

$ echo | awk 'BEGIN { a[contains whitespace]="abc"; if (contains purplespace in a) print "index exists"}'
index exists FALSE POSITIVE

$ echo a b c | awk 'NR==1 { a[$1$2$3]="abc"; if ( "abc" in a ) { print "index exists"} }'
index exists

$ echo a b " c"| awk 'NR==1 { a[$1$2$3]="abc"; if ( "ab c" in a ) { print "index exists"} }'

$ echo a b " c"| awk 'NR==1 { a["$1$2$3"]="abc"; if ( "ab c" in a ) { print "index exists"} }'

Mike

Last edited by Michael Stora; 04-30-2015 at 11:30 PM..
# 2  
Old 05-01-2015
Code:
$ echo | awk 'BEGIN { a[contains whitespace]="abc"; if (contains purplespace in a) print "index exists"}'
index exists FALSE POSITIVE

This is not a false positive you are concatenating two null variables producing a null value just like:
contains whitespace == contains purplespace

Consider:

Code:
echo | awk 'BEGIN { whitespace="ws"; purplespace="ps"; a[contains whitespace]="abc"; if (contains purplespace in a) print "index exists"}'

# 3  
Old 05-01-2015
Quote:
Originally Posted by Chubler_XL
Code:
$ echo | awk 'BEGIN { a[contains whitespace]="abc"; if (contains purplespace in a) print "index exists"}'
index exists FALSE POSITIVE

This is not a false positive you are concatenating two null variables producing a null value just like:
contains whitespace == contains purplespace

Consider:

Code:
echo | awk 'BEGIN { whitespace="ws"; purplespace="ps"; a[contains whitespace]="abc"; if (contains purplespace in a) print "index exists"}'

I meant false positive in the sense of what I am trying to do, not in the sense of "hey, I found a bug in AWK!". I was just showing trying all possible combinations with and without quoting. I realized that it was parsing them as variables.

The problem is the existence of white space in the variables and I'm asking what is the most elegant way to deal with white space in an array index.

The best I can come up with is to remove the whitespace before the array assignment and the contains check
Code:
$ echo x," contains whitespace ",z | awk -F, 'NR==1 { i=$1$2$3; print i; gsub ( " ", "", i ); print i }'
x contains whitespace z
xcontainswhitespacez

Mike

Last edited by Michael Stora; 05-01-2015 at 05:02 AM..
# 4  
Old 05-01-2015
... Why is whitespace a problem? It seems later part of your problem is that you're embedding shell variables in awk code, you're not properly quoting, and you've a lack of understanding of how awk splits records.

awk splits at spaces, however many there are. Then you put them in the index without any spaces because $1$2 has them stripped out. Using $1 " " $2 would squeeze multiple spaces into single ones.

Code:
echo 'ab c' | awk 'NR==1 { a[$0]="abc"; if ( "ab c" in a ) { print "index exists"} }'
index exists

Treating the entire line as the index, you see spaces work fine.

Please reword the issue if I'm not understanding you...

edit: I re-read the problem. What is the FS you're using? I see the split() uses ,.
# 5  
Old 05-01-2015
My attempts to demonstrate what I thought was happening with simpler examples on the command line were flawed and introduced additional erros (although I did duplicte the issues using Awk -F, I did not post those examples) but I don't believe my "real" code has the same flaws.

I believe that I am quoting the shell varibes correctly (or else the readlines in my code below would fail) and by specifying an alternate delimiter in my AWK statement the issues with whitespace in array indicies are not comming from either splitting or my understaning of splitting.

First I am reading in a csv file that may have non-seperating commas, whitespace, and other potentially problematic characters inside double quoted fields. So I start off with a double-quote counting parser and an alternative delimiter (the old ASCII unit seperator form puncard/paper tape days).

Code:
 
delim=$'\037' # ASCII Unit Seperator (US)
 
awk '{ quote=0; for(i=1;i<=length;i++)
                   { ch=substr($0, i, 1)
                     if ( ch == "\"" ) quote=( ++quote % 2)
                         else if ( quote == 0 && ch == ",") ch="'"$delim"'"
                     printf ch }
       print ""
     }' "$scratchDir""My_Input_File" |

Next I have a rather cryptic and very long awk command routine that transposes data and then stacks 21 variable columns into two columns (a variable name column and a variable value column and adds some other valus form shell variables to the pipe). It is very long and very complicates (as well as cryptic) so Iwill omit that part (it is working exactly as expected).

The next part of a code manipulates the data based on values in three different configuration files. I think you will find that I am quoting the external file names from BASH correctly. If I don't get the nesting of single and double quotes exactly right the readlines fail.

limit file contains a list of spec limits for different parameters (in the first column). For both areas considered sensitive and areas considered insensitive (two different columns). Then there is a file with a list of sensitive areas. finally there is a file with a list of known issues that I wish to substitute.
Edit: In reponse to you the question in your edit, these config files I am reading in are comma seperated values. They do not have the issues of commas inside quotes but one of them does have spaces inside rows of a few columns.

Code:
# Look up and apply limits.
awk -F "$delim" 'BEGIN { OFS=FS
                 # Read in Vibrtation Limits
                 getline < "'"$limitFile"'" # Header Row
                 while (getline < "'"$limitFile"'") { split( $0, a, ","); vLim[a[1]]=a[2]; vLim[a[1]"Sen"]=a[3] }
                 close("'"$limitFile"'")
                 # Read in Sensitive Bay List 
                 getline < "'"$bayListFile"'" # Header Row
                 while (getline < "'"$bayListFile"'") { split( $0, a, ","); sBL[a[1]a[2]]="Sen" }
                 close("'"$bayListFile"'") 
                 # Read in Known Fail List
                 getline < "'"$failListFile"'"; getline < "'"$failListFile"'"; getline < "'"$failListFile"'" # Header Rows
                 while (getline < "'"$failListFile"'") { split( $0, a, ","); i=a[1]a[2]a[3]a[4]a[5]; gsub ( " ", "", i ); failMessage[i]=a[8]
                     fs=a[6]; sub ( "^$", "0000 01 01 00 00 00", fs ); failStart[i]=fs
                     fe=a[7]; sub ( "^$", "9999 12 31 23 59 59", fe ); failEnd[i]=fe }
                 close("'"$failListFile"'") }    
         NR == 1 { print $0 } 
         NR > 1 { if ( $9 ~ /Hz/ ) { limit=vLim[$9sBL[$2$3]]; $11=limit # works great since $9 $2 and $3 never have whitespace
                      if ( $10 > limit) { $12 = "Fail"; i=$2$3$4$5$9; gsub ( " ", "", i ) # sometimes $3, $4, or $5 have spaces.  My code now works with the gsub removing spaces from the index but my purpose in posting was to better understand how Awk handles whitespace in indexes ("help me understand more" not "help me write a script")
                         if ( i in failMessage ) { now = mktime(year" "month" "day" "hour" "min" "sec) #I'm still writing this time aware part and it is not part of my question
                        now = mktime(2015 01 01 00 00 00) # debug
                            if ( now >= mktime(failStart[i]) && now <= mktime(failEnd[i]) ) {
                                $12 = "Known Fail"; if ($7 == "\"\"") gsub ( "\"$", "Known Fail: "failMessage[i]"\"", $7 ) #column 7 is a double quoted sting.  When "null" it is actually ""
                                                    else gsub ( "\"$", "|Known Fail: "failMessage[i]"\"", $7 ) } } } #I convert "|" into DOS style new lines at a later time in my code.  When there is already a comment in $7 I want a new line between my known failure mode message
                      else $12 = "Pass"; print $0 }
                }' |
# Remove alternative delimiter
sed -e 's/'"$delim"'/,/g' > "$scratchDir""My Output File"

Of course a ton of code before and after what I included exists but it is outside the context of my question

Just as a reminder of the context of my question, the code works now that I am removing spaces from the parts in PINK using the gsub commands. My question is educational (what does awk do with whitespace in a index assignment)?

Also in general my questions in UNIX.com are not even about getting something to work but getting something to work efficiently when parsing huge files. In this particular project I am dealing with about 10,000 files that end up in a ~200k line database but in some other projects I am dealing with more than 50 million lines of data. In the case of this particular project I got the running time down from 70 min for a year of data to less than 5 min for 2 years of data. This involved moving a lot of BASH code to AWK, eliminating utility calls (using only built-ins in the interest of speed, even if more complex ) and file interactions in any kind of loop or itterative part and timing alternate versions of portions of my code in AWK, PERL, SED, BASH etc and picking the best performing one. I appologise if my questions about understanding more about how something works or if there are alternatives are being misinterperated as "this is broken, show me an implementation" types of questions. Generally when I ask these questions I have something working but I have a suspicion that there is a better/faster way.

Mike

Last edited by Michael Stora; 05-01-2015 at 04:01 PM..
# 6  
Old 05-01-2015
Well I won't pick apart at your code too much more then. Smilie

To answer the question on whitespace, awk shouldn't be messing with them aside from the default FS being equivalent to [[:space:]][[:space:]]*.

With that in mind, your first call to awk uses the default FS which may lead to your data having trailing/leading whitespaces removed.

Print the indexes upon assignment and before the gsub in the latter awk to stderr and comb through it. It's probably what's happening.
This User Gave Thanks to neutronscott For This Post:
# 7  
Old 05-01-2015
Trying to be more rigorous with attempts to demonstrate the question in simple code examples results in me being unable to duplicate the problem . . .

Now I am left to ponder if the problem I thought I fixed with the gsub commands was even a problem and I somehow accidently fixed somthing else (entropy notwithstanding) . . .

Code:
 
$ echo "abc,d e f,xyz" | awk -F, 'NR==1 { a[1]=$1; a[2]=$2; a[3]=$3; c[a[1]a[2]a[3]]="value"; if ( a[1]a[2]a[3] in c) print "index found" }'
index found
 
$ echo "abc,d e f,xyz" | awk -F, 'NR==1 { a[1]=$1; a[2]=$2; a[3]=$3; c[a[1]a[2]a[3]]="value"; if ( "abcd e fxyz" in c) print "index found" }'
index found
 
$ echo "abc,d e f,xyz" | awk -F, 'NR==1 { a[1]=$1; a[2]=$2; a[3]=$3; c[a[1]a[2]a[3]]="value"; if ( "something else" in c ) print "index found" }'
 
$ echo "abc,d e f,xyz" | awk -F, 'NR==1 { a[1]=$1; a[2]=$2; a[3]=$3; c[a[1]a[2]a[3]]="value"; if ( $1$2$3 in c ) print "index found" }'
index found
 
$ echo "abc,d e f,xyz" | awk -F, 'NR==1 { a[1]=$1; a[2]=$2; a[3]=$3; c[$1$2$3]="value"; if ( a[1]a[2]a[3] in c ) print "index found" }'
index found

Mike

---------- Post updated at 12:31 PM ---------- Previous update was at 12:20 PM ----------

Quote:
Originally Posted by neutronscott
Well I won't pick apart at your code too much more then. Smilie

To answer the question on whitespace, awk shouldn't be messing with them aside from the default FS being equivalent to [[:space:]][[:space:]]*.

With that in mind, your first call to awk uses the default FS which may lead to your data having trailing/leading whitespaces removed.

Print the indexes upon assignment and before the gsub in the latter awk to stderr and comb through it. It's probably what's happening.
Thanks for the suggestion but I think the error is creaping in through another source since my first awk invovation uses only $0 which appears to never get parsed and my second omitted awk statement uses the alternative delimiter.
Code:
$ echo " this is a line with beginning and end spaces plus several internal spaces ( ) " | awk 'NR==1 {print "x"$0"x"}'
x this is a line with beginning and end spaces plus several internal spaces ( ) x

Actually AWK doesn't but UNIX.com does so you'll have to take my work for it Smilie

However, you may be right about the error creaping in from something else in my input file. I will look in that direction since I have exhausted other avenues.

BTW: I have already kicked myself several times for not starting with tab seperated values from the very beginning of the project Smilie Tabs never exist in my fields.

Mike

Last edited by Michael Stora; 05-01-2015 at 04:36 PM..
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

awk Associative Array and/or Referring to Field by String (Nonconstant String Value)

I will start with an example of what I'm trying to do and then describe how I am approaching the issue. File PS028,005 Lexeme HRS # M # PhraseType 1(1:1) 7(7) PhraseLab 501 503 ClauseType ZYq0 PS028,005 Lexeme W # L> # BNH # M #... (17 Replies)
Discussion started by: jvoot
17 Replies

2. Shell Programming and Scripting

Using associative array for comparison

Hello together, i make something wrong... I want an array that contains information to associate it for further processing. Here is something from my bash... You will know, what I'm trying to do. I have to point out in advance, that the variable $SYSOS is changing and not as static as in my... (2 Replies)
Discussion started by: Decstasy
2 Replies

3. Shell Programming and Scripting

Morse Code with Associative Array

Continuing my quest to learn BASH, Bourne, Awk, Grep, etc. on my own through the use of a few books. I've come to an exercise that has me absolutely stumped. The specifics: 1. Using ONLY BASH scripting commands (not sed, awk, etc.), write a script to convert a string on the command line to... (22 Replies)
Discussion started by: ksmarine1980
22 Replies

4. Shell Programming and Scripting

Associative Array with more than one item per entry

Hi all I have a problem where i have a large list ( up to 1000 of items) and need to have 2 items pulled from it into variables in a bash script my list is like the following and I could have it as an array or possibly an external text file maintained separately. Every line is different and... (6 Replies)
Discussion started by: kcpoole
6 Replies

5. Shell Programming and Scripting

Associative array

I have an associative array named table declare -A table table="fruit" table="veggie" table="GT" table="eminem" Now say I have a variable returning the value highway How do I find corresponding value GT ?? (this value that I find (GT in this case) is supposed to be the name of a mysql... (1 Reply)
Discussion started by: leghorn
1 Replies

6. Shell Programming and Scripting

Help needed on Associative array in awk

Hi All, I got stuck up with shell script where i use awk. The scenario which i am working on is as below. I have a file text.txt with contents COL1 COL2 COL3 COL4 1 A 500 400 1 B 500 400 1 A 500 200 2 A 290 300 2 B 290 280 3 C 100 100 I could able to sum col 3 and col4 based on... (3 Replies)
Discussion started by: imsularif
3 Replies

7. Shell Programming and Scripting

awk, associative array, compare files

i have a file like this < '393200103052';'H3G';'20081204' < '393200103059';'TIM';'20110111' < '393200103061';'TIM';'20060206' < '393200103064';'OPI';'20110623' > '393200103052';'HKG';'20081204' > '393200103056';'TIM';'20110111' > '393200103088';'TIM';'20060206' Now i have to generate a file... (9 Replies)
Discussion started by: shruthi123
9 Replies

8. Shell Programming and Scripting

Problem with lookup values on AWK associative array

I'm at wits end with this issue and my troubleshooting leads me to believe it is a problem with the file formatting of the array referenced by my script: awk -F, '{if (NR==FNR) {a=$4","$3","$2}\ else {print a "," $0}}' WBTSassignments1.txt RNCalarms.tmp On the WBTSassignments1.txt file... (2 Replies)
Discussion started by: JasonHamm
2 Replies

9. Shell Programming and Scripting

Perl: Sorting an associative array

Hi, When using sort on an associative array: foreach $key (sort(keys(%opalfabet))){ $value = $opalfabet{$key}; $result .= $value; } How does it handle double values? It seems to me that it removes them, is that true? If so, is there a way to get... (2 Replies)
Discussion started by: tine
2 Replies

10. Shell Programming and Scripting

Associative Array

Hi, I am trying to make an associative array to use in a popup_menu on a website. Here is what i have: foreach $entr ( @entries ) { $temp_uid = $entr->get_value(uid); $temp_naam = $entr->get_value(sn); $s++; } This is the popup_menu i want to use it in. popup_menu(-name=>'modcon',... (4 Replies)
Discussion started by: tine
4 Replies
Login or Register to Ask a Question