Fill in missing rows with zero to have uniform table

03-23-2015

Registered User

564, 13

Join Date: Sep 2009

Last Activity: 26 May 2021, 8:59 AM EDT

Location: Saskatchewan, Canada

Posts: 564

Thanks Given: 376

Thanked 13 Times in 12 Posts

No, it is not a bug, but I did not explain it well enough at the beginning.
Still, the block before NR==FNR is hard for me. When and how is this block executed at reading the stream of the two files file1 & file2?

Code:

{  
    i=$1    set i to the first field ($1) 
    $1=""   empty the first field     
    C[i]    Create an index i in array C (which will contain all possible $1 values in the END part)   
}

I think I understand the syntax each row, but not the orders as compared with read in A[i] and B[i] after NR==FNR.
Thanks a lot again!

yifangt

View Public Profile for yifangt

Find all posts by yifangt

03-23-2015

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Hi, this block is executed for every line, for both file1 and file2...

This User Gave Thanks to Scrutinizer For This Post:

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

03-23-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Your stated requirements and your examples don't match!

You talk about wanting to create a list for S01 through S96, but your samples only show data for S01 through S12.

As Scrutinizer said, your sample file1 contains the line:

Code:

S05    1290    41074    135810

but your file1_padded contains the line:

Code:

S05    0    41074    135810

and your file_joined contains the line:

Code:

S05    0    41074    135810           S05    43427    68368    109344

I don't understand why the 1290 in your input file became 0 in your output files???

I can almost produce the output you said you want with:

Code:

awk '
FNR == 1 {
	f++
}
{	d[f, $1] = $0
	if($1 > m) m = $1
}
END {	for(i = 1; i <= 96; i++) {
		id = sprintf("S%02d", i)
		printf("%-37s %s\n",
			((1, id) in d) ? d[1,id] : id "    0    0    0",
			((2, id) in d) ? d[2,id] : id "    0    0    0")
		if(id == m) exit
	}
}' file[12]

I say almost because the output from the above is with the difference noted above marked in red:

Code:

S01    36407    53706    88540        S01    0    0    0
S02    69343    87098    87316        S02    14644    37964    70990
S03    50133    59721    107923       S03    0    0    0
S04    0    0    0                    S04    52922    84177    87225
S05    1290    41074    135810        S05    43427    68368    109344
S06    11285    30164    40361        S06    4212    15654    15664
S07    11285    30164    40361        S07    0    0    0
S08    0    0    0                    S08    16257    41558    45595
S09    0    0    0                    S09    0    0    0
S10    11298    30165    40361        S10    1290    41074    135810
S11    18311    37266    135798       S11    0    0    0
S12    14567    35958    62691        S12    30731    60117    77166

If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk.

If you want output for S01 through S096 instead of limiting output to the highest SXX value found in your input files, delete the two lines of code in the awk script shown in red.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

03-23-2015

Registered User

564, 13

Join Date: Sep 2009

Last Activity: 26 May 2021, 8:59 AM EDT

Location: Saskatchewan, Canada

Posts: 564

Thanks Given: 376

Thanked 13 Times in 12 Posts

This is related to my understanding how awk works on the streams. I have difficulty to catch the order of the blocks executed when reading file1 & file2.

Code:

BEGIN{}
{i=$1 
$1=""         #To empty $1
}
NR==FNR{A[i]=$0; next}
{B[i]=$0}
END{}

The BEGIN{} and END{} blocks are obvious.

Code:

NR==FNR {A[i]=$0; next} # Only true when reading file1
{B[i]=$0}               # Only true when reading file2

Is the second block (i.e. i=$1; $1="") saved in memory when going through the two files?
$1="" is to used to empty $1, if not, the IDs in A[i] and B[i] will be concatenated IDs when that ID exists in both files.
But I do not get how $1="" works here.

Code:

awk -v n=12 '
  BEGIN {
    OFS="\t"
    zero=OFS 0 OFS 0 OFS 0
}
  {
    i=$1
#    print i"->"$1, FILENAME
#    $1=""                     #Commented on purpose for testing
  }
  NR==FNR {
    A[i]=$0
    next
  }
  {
    B[i]=$0
  }
  END {
    for(j=1; j<=n; j++) {
      i=sprintf("S%02d",j)
       print i (i in A?A[i]:zero), "", i (i in B?B[i]:zero)
    } 
} ' file1 file2

Output is:

Code:

S01S01    36407    53706    88540        S01    0    0    0
S02S02    69343    87098    87316        S02S02    14644    37964    70990
S03S03    50133    59721    107923        S03    0    0    0
S04    0    0    0        S04S04    52922    84177    87225
S05S05    1290    41074    135810        S05S05    43427    68368    109344
S06S06    11285    30164    40361        S06S06    4212    15654    15664
S07S07    11285    30164    40361        S07    0    0    0
S08    0    0    0        S08S08    16257    41558    45595
S09    0    0    0        S09    0    0    0
S10S10    11298    30165    40361        S10S10    1290    41074    135810
S11S11    18311    37266    135798        S11    0    0    0
S12S12    14567    35958    62691        S12S12    30731    60117    77166

1) How does $1="" works behind in awk?
2) It is confusing to me with i in A?A[i]:zero which seems to me as comparing the value with the key of the associate array? What did I miss?
Can you explain for me? Thanks!
----------------------------------------------------------------------------------------------------------------
Thanks Don!
Your reply was posted while I was preparing mine!
There are 96 rows in my real file, but I only used 12 for the post to cut the space.
My apology for the S05 confusion, due to many copy-paste.

Last edited by yifangt; 03-23-2015 at 06:13 PM..

yifangt

View Public Profile for yifangt

Find all posts by yifangt

03-23-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

If we take Scrutinizer's code:

Code:

awk '
  BEGIN {
    OFS="\t"
    zero=OFS 0 OFS 0 OFS 0
  }
  {
    i=$1
    $1=""
    C[i]
  }
  NR==FNR {
    A[i]=$0
    next
  } 
  {
    B[i]=$0
  }
  END {
    for(i in C)
      print i (i in A?A[i]:zero), "", i (i in B?B[i]:zero)
  }
' file1 file2

and duplicate the code you seem to be having problems understanding into a longer script that does exactly the same thing:

Code:

awk '
  BEGIN {
    OFS="\t"
    zero=OFS 0 OFS 0 OFS 0
  }
  NR==FNR {
    i=$1
    $1=""
    C[i]
    A[i]=$0
    next
  } 
  {
    i=$1
    $1=""
    C[i]
    B[i]=$0
  }
  END {
    for(i in C)
      print i (i in A?A[i]:zero), "", i (i in B?B[i]:zero)
  }
' file1 file2

does it make it any easier for you to understand?

If the first field is not cleared before $0 is saved in A[i] and B[i], printing i and A[i] or B[i] in the END clause prints the contents of field one from the input lines twice (as you see in your output when you removed the following line:

Code:

$1=""

).

The expression print i (i in A?A[i]:zero) prints the contents of i followed by:

the contents of A[i] if i is an index in A[], or
the contents of the variable zero if i is NOT an index in A[].

Factoring out the common code (as Scrutinizer did) makes it easier to make updates. If one of those three lines needs to be updated at some point in the future, you don't have to notice that the same code has to be changed twice if you only have one copy of the code in your script in the first place.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

03-23-2015

Registered User

564, 13

Join Date: Sep 2009

Last Activity: 26 May 2021, 8:59 AM EDT

Location: Saskatchewan, Canada

Posts: 564

Thanks Given: 376

Thanked 13 Times in 12 Posts

Thanks Don!
Now the topic has deviated from the original title!
I was trying to catch the two points of awk in more fundamental details these days, not only merely remember the syntax, then swallow and use it;
Point 1: how awk process the two files line by line in this very example? The block

Code:

{
  i=$1
  $1=""
  C[i]
}

gave me a very good example to understand it, especially when I tried your expanded code from Scruntinizer's, the output is not sorted as I am expecting as my two original files are sorted in ascending order respectively:

Code:

S10    11298    30165    40361        S10    1290    41074    135810
S01    36407    53706    88540        S01    0    0    0
S11    18311    37266    135798        S11    0    0    0
S02    69343    87098    87316        S02    14644    37964    70990
S12    14567    35958    62691        S12    30731    60117    77166
S03    50133    59721    107923        S03    0    0    0
S04    0    0    0        S04    52922    84177    87225
S05    1290    41074    135810        S05    43427    68368    109344
S06    11285    30164    40361        S06    4212    15654    15664
S07    11285    30164    40361        S07    0    0    0
S08    0    0    0        S08    16257    41558    45595

It seems to me stack/heap is related, but not quite sure here.
Point 2: I understand the ternary comparison i in A?A[i]:zero), but not quite well with why the associate array in awk, i in A means "if A has an index i", especially when you compare with similar perl hash, python dictionary, or C-style array. There must be a good reason for this.

Last edited by yifangt; 03-23-2015 at 07:18 PM..

yifangt

View Public Profile for yifangt

Find all posts by yifangt

03-24-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by yifangt

Thanks Don!
Now the topic has deviated from the original title!
I was trying to catch the two points of awk in more fundamental details these days, not only merely remember the syntax, then swallow and use it;
Point 1: how awk process the two files line by line in this very example? The block

Code:

{
  i=$1
  $1=""
  C[i]
}

gave me a very good example to understand it, especially when I tried your expanded code from Scruntinizer's, the output is not sorted as I am expecting as my two original files are sorted in ascending order respectively:

Code:

S10    11298    30165    40361        S10    1290    41074    135810
S01    36407    53706    88540        S01    0    0    0
S11    18311    37266    135798        S11    0    0    0
S02    69343    87098    87316        S02    14644    37964    70990
S12    14567    35958    62691        S12    30731    60117    77166
S03    50133    59721    107923        S03    0    0    0
S04    0    0    0        S04    52922    84177    87225
S05    1290    41074    135810        S05    43427    68368    109344
S06    11285    30164    40361        S06    4212    15654    15664
S07    11285    30164    40361        S07    0    0    0
S08    0    0    0        S08    16257    41558    45595

It seems to me stack/heap is related, but not quite sure here.
Point 2: I understand the ternary comparison i in A?A[i]:zero), but not quite well with why the associate array in awk, i in A means "if A has an index i", especially when you compare with similar perl hash, python dictionary, or C-style array. There must be a good reason for this.

Did you try the code I suggested? If so, did it do what you want? Do you understand how it works?

I wrote it explicitly because Scrutinizer's code does not guarantee ordered output (which you did not explicitly, originally request), does not produce any output for SXX values that are not present in either of your input files (which was not clear in your original request), and uses tabs in places where your input files use spaces . (I find the output with the tabs used by Scrutinzer easier to read, but that is not what you asked for; I tried to produce the output you said you wanted.)

I will claim to be a reasonably proficient writer of programs using C, shell, awk, and most other POSIX standard utilities. I used to also be a capable writer of FORTRAN, COBOL, ALGOL, Lisp, Snobol, Java, PL/I, and a few dozen various assembler languages; but if I wanted to use these languages again now, I'd need to take some time to familiarize myself with them again. I don't claim to have ever been a passable author of perl or python code. (So I won't attempt to compare C arrays nor awk arrays to a perl hash nor a python library.)

In C you declare an array and its dimensions before you use it. Your C compiler will determine how many dimensions an array can have. In C, an index into an array is always an integral type and indices (for each dimension in the array) start at 0 and run up to one less than the declared size for that dimension.

In awk, you never declare an array. Whenever you reference an array element, an array is created (if the array didn't already exist) and the index used in that reference is added to the array. Array indices in awk can be integral values, floating point values (although floating point types are really dangerous unless you really know what you're doing), or strings. Integral array indices in awk need not be contiguous. The awk index in array operator returns true (1) if there is the element array[index] and returns false (0) otherwise. In awk, there is no inherent order in array elements and the awk for loop that walks through all of the elements of an array:

Code:

for(index in array) {
	printf("array[%s] is set to %s\n", index, array[index])
}

does so in what should be assumed to be a random order. (Although with any awk I've used, if you walk through the elements of an array twice without adding or removing elements to or from the array between those walks, you'll get elements listed in the same order on both walks.) Nonetheless, if you use contiguous integral values when you create array elements, you can walk through that array in order using a normal awk for loop:

Code:

for(i = low_index; i <= high_index; i++) {
	printf("array[%d] is set to %s\n", i, array[i])
}

Note also that awk arrays only have one dimension. You can simulate multidimensional arrays in awk by separating dimension indices with a comma (or, equivalently, the awk variable SUBSEP). (Note that SUBSEP has a default value assigned by awk, but the value used varies from system to system. On many systems, the SUBSEP string is defined to be the ASCII FS character (octal value 034).) And, whereas the commands:

Code:

print $1, $2, $3

and:

Code:

print $1 OFS $2 OFS $3

are equivalent, the array index specified by:

Code:

array[$1, $2, $3]

and by:

Code:

array[$1 SUBSEP $2 SUBSEP $3]

are also equivalent.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

Shell Programming and Scripting

Fill in missing rows with zero to have uniform table

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Fill in missing values

Discussion started by: theanh0508

2. UNIX for Beginners Questions & Answers

Fill in missing hours and interpolate values using awk.

Discussion started by: emirzaei

3. Shell Programming and Scripting

How to get the missing date and day in a table?

Discussion started by: Venkatesh1

4. Shell Programming and Scripting

Fill in missing Data

Discussion started by: Nolph

5. Shell Programming and Scripting

Fill missing values with 2

Discussion started by: shoaibjameel123

6. Shell Programming and Scripting

Fill missing numbers in second column with zeros

Discussion started by: shoaibjameel123

7. Shell Programming and Scripting

Compare columns and rows with template, and fill empty slots.

Discussion started by: hizzle

8. Shell Programming and Scripting

fill in missing columns

Discussion started by: avvk

9. Shell Programming and Scripting

Fill in missing numbers in range

Discussion started by: bistru