Fill in missing rows with zero to have uniform table


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Fill in missing rows with zero to have uniform table
# 8  
Old 03-23-2015
No, it is not a bug, but I did not explain it well enough at the beginning.
Still, the block before NR==FNR is hard for me. When and how is this block executed at reading the stream of the two files file1 & file2?
Code:
{  
    i=$1    set i to the first field ($1) 
    $1=""   empty the first field     
    C[i]    Create an index i in array C (which will contain all possible $1 values in the END part)   
}

I think I understand the syntax each row, but not the orders as compared with read in A[i] and B[i] after NR==FNR.
Thanks a lot again!
# 9  
Old 03-23-2015
Hi, this block is executed for every line, for both file1 and file2...
This User Gave Thanks to Scrutinizer For This Post:
# 10  
Old 03-23-2015
Your stated requirements and your examples don't match!

You talk about wanting to create a list for S01 through S96, but your samples only show data for S01 through S12.

As Scrutinizer said, your sample file1 contains the line:
Code:
S05    1290    41074    135810

but your file1_padded contains the line:
Code:
S05    0    41074    135810

and your file_joined contains the line:
Code:
S05    0    41074    135810           S05    43427    68368    109344

I don't understand why the 1290 in your input file became 0 in your output files???

I can almost produce the output you said you want with:
Code:
awk '
FNR == 1 {
	f++
}
{	d[f, $1] = $0
	if($1 > m) m = $1
}
END {	for(i = 1; i <= 96; i++) {
		id = sprintf("S%02d", i)
		printf("%-37s %s\n",
			((1, id) in d) ? d[1,id] : id "    0    0    0",
			((2, id) in d) ? d[2,id] : id "    0    0    0")
		if(id == m) exit
	}
}' file[12]

I say almost because the output from the above is with the difference noted above marked in red:
Code:
S01    36407    53706    88540        S01    0    0    0
S02    69343    87098    87316        S02    14644    37964    70990
S03    50133    59721    107923       S03    0    0    0
S04    0    0    0                    S04    52922    84177    87225
S05    1290    41074    135810        S05    43427    68368    109344
S06    11285    30164    40361        S06    4212    15654    15664
S07    11285    30164    40361        S07    0    0    0
S08    0    0    0                    S08    16257    41558    45595
S09    0    0    0                    S09    0    0    0
S10    11298    30165    40361        S10    1290    41074    135810
S11    18311    37266    135798       S11    0    0    0
S12    14567    35958    62691        S12    30731    60117    77166

If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk.

If you want output for S01 through S096 instead of limiting output to the highest SXX value found in your input files, delete the two lines of code in the awk script shown in red.
This User Gave Thanks to Don Cragun For This Post:
# 11  
Old 03-23-2015
This is related to my understanding how awk works on the streams. I have difficulty to catch the order of the blocks executed when reading file1 & file2.
Code:
BEGIN{}
{i=$1 
$1=""         #To empty $1
}
NR==FNR{A[i]=$0; next}
{B[i]=$0}
END{}

The BEGIN{} and END{} blocks are obvious.
Code:
NR==FNR {A[i]=$0; next} # Only true when reading file1
{B[i]=$0}               # Only true when reading file2

Is the second block (i.e. i=$1; $1="") saved in memory when going through the two files?
$1="" is to used to empty $1, if not, the IDs in A[i] and B[i] will be concatenated IDs when that ID exists in both files.
But I do not get how $1="" works here.
Code:
awk -v n=12 '
  BEGIN {
    OFS="\t"
    zero=OFS 0 OFS 0 OFS 0
}
  {
    i=$1
#    print i"->"$1, FILENAME
#    $1=""                     #Commented on purpose for testing
  }
  NR==FNR {
    A[i]=$0
    next
  }
  {
    B[i]=$0
  }
  END {
    for(j=1; j<=n; j++) {
      i=sprintf("S%02d",j)
       print i (i in A?A[i]:zero), "", i (i in B?B[i]:zero)
    } 
} ' file1 file2

Output is:
Code:
S01S01    36407    53706    88540        S01    0    0    0
S02S02    69343    87098    87316        S02S02    14644    37964    70990
S03S03    50133    59721    107923        S03    0    0    0
S04    0    0    0        S04S04    52922    84177    87225
S05S05    1290    41074    135810        S05S05    43427    68368    109344
S06S06    11285    30164    40361        S06S06    4212    15654    15664
S07S07    11285    30164    40361        S07    0    0    0
S08    0    0    0        S08S08    16257    41558    45595
S09    0    0    0        S09    0    0    0
S10S10    11298    30165    40361        S10S10    1290    41074    135810
S11S11    18311    37266    135798        S11    0    0    0
S12S12    14567    35958    62691        S12S12    30731    60117    77166

1) How does $1="" works behind in awk?
2) It is confusing to me with i in A?A[i]:zero which seems to me as comparing the value with the key of the associate array? What did I miss?
Can you explain for me? Thanks!
----------------------------------------------------------------------------------------------------------------
Thanks Don!
Your reply was posted while I was preparing mine!
There are 96 rows in my real file, but I only used 12 for the post to cut the space.
My apology for the S05 confusion, due to many copy-paste.

Last edited by yifangt; 03-23-2015 at 06:13 PM..
# 12  
Old 03-23-2015
If we take Scrutinizer's code:
Code:
awk '
  BEGIN {
    OFS="\t"
    zero=OFS 0 OFS 0 OFS 0
  }
  {
    i=$1
    $1=""
    C[i]
  }
  NR==FNR {
    A[i]=$0
    next
  } 
  {
    B[i]=$0
  }
  END {
    for(i in C)
      print i (i in A?A[i]:zero), "", i (i in B?B[i]:zero)
  }
' file1 file2

and duplicate the code you seem to be having problems understanding into a longer script that does exactly the same thing:
Code:
awk '
  BEGIN {
    OFS="\t"
    zero=OFS 0 OFS 0 OFS 0
  }
  NR==FNR {
    i=$1
    $1=""
    C[i]
    A[i]=$0
    next
  } 
  {
    i=$1
    $1=""
    C[i]
    B[i]=$0
  }
  END {
    for(i in C)
      print i (i in A?A[i]:zero), "", i (i in B?B[i]:zero)
  }
' file1 file2

does it make it any easier for you to understand?

If the first field is not cleared before $0 is saved in A[i] and B[i], printing i and A[i] or B[i] in the END clause prints the contents of field one from the input lines twice (as you see in your output when you removed the following line:
Code:
$1=""

).

The expression print i (i in A?A[i]:zero) prints the contents of i followed by:
  • the contents of A[i] if i is an index in A[], or
  • the contents of the variable zero if i is NOT an index in A[].

Factoring out the common code (as Scrutinizer did) makes it easier to make updates. If one of those three lines needs to be updated at some point in the future, you don't have to notice that the same code has to be changed twice if you only have one copy of the code in your script in the first place.
This User Gave Thanks to Don Cragun For This Post:
# 13  
Old 03-23-2015
Thanks Don!
Now the topic has deviated from the original title!
I was trying to catch the two points of awk in more fundamental details these days, not only merely remember the syntax, then swallow and use it;
Point 1: how awk process the two files line by line in this very example? The block
Code:
{
  i=$1
  $1=""
  C[i]
}

gave me a very good example to understand it, especially when I tried your expanded code from Scruntinizer's, the output is not sorted as I am expecting as my two original files are sorted in ascending order respectively:
Code:
S10    11298    30165    40361        S10    1290    41074    135810
S01    36407    53706    88540        S01    0    0    0
S11    18311    37266    135798        S11    0    0    0
S02    69343    87098    87316        S02    14644    37964    70990
S12    14567    35958    62691        S12    30731    60117    77166
S03    50133    59721    107923        S03    0    0    0
S04    0    0    0        S04    52922    84177    87225
S05    1290    41074    135810        S05    43427    68368    109344
S06    11285    30164    40361        S06    4212    15654    15664
S07    11285    30164    40361        S07    0    0    0
S08    0    0    0        S08    16257    41558    45595

It seems to me stack/heap is related, but not quite sure here.
Point 2: I understand the ternary comparison i in A?A[i]:zero), but not quite well with why the associate array in awk, i in A means "if A has an index i", especially when you compare with similar perl hash, python dictionary, or C-style array. There must be a good reason for this.

Last edited by yifangt; 03-23-2015 at 07:18 PM..
# 14  
Old 03-24-2015
Quote:
Originally Posted by yifangt
Thanks Don!
Now the topic has deviated from the original title!
I was trying to catch the two points of awk in more fundamental details these days, not only merely remember the syntax, then swallow and use it;
Point 1: how awk process the two files line by line in this very example? The block
Code:
{
  i=$1
  $1=""
  C[i]
}

gave me a very good example to understand it, especially when I tried your expanded code from Scruntinizer's, the output is not sorted as I am expecting as my two original files are sorted in ascending order respectively:
Code:
S10    11298    30165    40361        S10    1290    41074    135810
S01    36407    53706    88540        S01    0    0    0
S11    18311    37266    135798        S11    0    0    0
S02    69343    87098    87316        S02    14644    37964    70990
S12    14567    35958    62691        S12    30731    60117    77166
S03    50133    59721    107923        S03    0    0    0
S04    0    0    0        S04    52922    84177    87225
S05    1290    41074    135810        S05    43427    68368    109344
S06    11285    30164    40361        S06    4212    15654    15664
S07    11285    30164    40361        S07    0    0    0
S08    0    0    0        S08    16257    41558    45595

It seems to me stack/heap is related, but not quite sure here.
Point 2: I understand the ternary comparison i in A?A[i]:zero), but not quite well with why the associate array in awk, i in A means "if A has an index i", especially when you compare with similar perl hash, python dictionary, or C-style array. There must be a good reason for this.
Did you try the code I suggested? If so, did it do what you want? Do you understand how it works?

I wrote it explicitly because Scrutinizer's code does not guarantee ordered output (which you did not explicitly, originally request), does not produce any output for SXX values that are not present in either of your input files (which was not clear in your original request), and uses tabs in places where your input files use spaces . (I find the output with the tabs used by Scrutinzer easier to read, but that is not what you asked for; I tried to produce the output you said you wanted.)

I will claim to be a reasonably proficient writer of programs using C, shell, awk, and most other POSIX standard utilities. I used to also be a capable writer of FORTRAN, COBOL, ALGOL, Lisp, Snobol, Java, PL/I, and a few dozen various assembler languages; but if I wanted to use these languages again now, I'd need to take some time to familiarize myself with them again. I don't claim to have ever been a passable author of perl or python code. (So I won't attempt to compare C arrays nor awk arrays to a perl hash nor a python library.)

In C you declare an array and its dimensions before you use it. Your C compiler will determine how many dimensions an array can have. In C, an index into an array is always an integral type and indices (for each dimension in the array) start at 0 and run up to one less than the declared size for that dimension.

In awk, you never declare an array. Whenever you reference an array element, an array is created (if the array didn't already exist) and the index used in that reference is added to the array. Array indices in awk can be integral values, floating point values (although floating point types are really dangerous unless you really know what you're doing), or strings. Integral array indices in awk need not be contiguous. The awk index in array operator returns true (1) if there is the element array[index] and returns false (0) otherwise. In awk, there is no inherent order in array elements and the awk for loop that walks through all of the elements of an array:
Code:
for(index in array) {
	printf("array[%s] is set to %s\n", index, array[index])
}

does so in what should be assumed to be a random order. (Although with any awk I've used, if you walk through the elements of an array twice without adding or removing elements to or from the array between those walks, you'll get elements listed in the same order on both walks.) Nonetheless, if you use contiguous integral values when you create array elements, you can walk through that array in order using a normal awk for loop:
Code:
for(i = low_index; i <= high_index; i++) {
	printf("array[%d] is set to %s\n", i, array[i])
}

Note also that awk arrays only have one dimension. You can simulate multidimensional arrays in awk by separating dimension indices with a comma (or, equivalently, the awk variable SUBSEP). (Note that SUBSEP has a default value assigned by awk, but the value used varies from system to system. On many systems, the SUBSEP string is defined to be the ASCII FS character (octal value 034).) And, whereas the commands:
Code:
print $1, $2, $3

and:
Code:
print $1 OFS $2 OFS $3

are equivalent, the array index specified by:
Code:
array[$1, $2, $3]

and by:
Code:
array[$1 SUBSEP $2 SUBSEP $3]

are also equivalent.
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Fill in missing values

Hi, I have a data sample as shown below. I want to fill in the left column so that the line will be continuous. For example, between 1 and 5 should be 2,3,4. And corresponding values in the right column will be 0. Thus the expected data should look like that: 1 1 1 10 1 2 1 3 1 5 1 6 2 0... (6 Replies)
Discussion started by: theanh0508
6 Replies

2. UNIX for Beginners Questions & Answers

Fill in missing hours and interpolate values using awk.

I have a time series data like this 40754,35.6931,51.3092,201610160700,21.0 40754,35.6931,51.3092,201610160800,23.0 40754,35.6931,51.3092,201610160900,24.0 40754,35.6931,51.3092,201610161000,24.0 40754,35.6931,51.3092,201610161300,25.0 40754,35.6931,51.3092,201610161400,23.0... (6 Replies)
Discussion started by: emirzaei
6 Replies

3. Shell Programming and Scripting

How to get the missing date and day in a table?

Hi Am using unix Aix Ksh Have Created table called vv and i have inserted two date Select * from vv; Output :- New_date 21/02/2013 24/02/2013 I have tried Using One query but Unsuccessful so far.. SELECT l.new_date + '1 day' as miss from vv as l (7 Replies)
Discussion started by: Venkatesh1
7 Replies

4. Shell Programming and Scripting

Fill in missing Data

hello everyone, I have a task to input missing data into a file. example of my data below: Wed Feb 01 09:00:02 EST 2012,,,0.4,0.3,,0.3,,0.3,,0.5,,0.3,,,0.4,0.3, Wed Feb 01 09:00:11 EST 2012,,,,,,,0.2,,,,,,,,,, Wed Feb 01 09:00:22 EST... (23 Replies)
Discussion started by: Nolph
23 Replies

5. Shell Programming and Scripting

Fill missing values with 2

Hi All, I have 100 .txt files which look like this: 3 4 5 6 7 Now, some files have some numbers missing in them and they look like this: 4 5 6 (6 Replies)
Discussion started by: shoaibjameel123
6 Replies

6. Shell Programming and Scripting

Fill missing numbers in second column with zeros

Hi All, I have 100 files with names like this: 1.dat, 2.dat, 3.dat until 100.dat. My dat files look like this: 42323 0 438939 1 434 0 0.9383 3434 120.23 3 234 As you can see in the second column, some numbers are missing. I want to fill those missing places with 0's in all... (3 Replies)
Discussion started by: shoaibjameel123
3 Replies

7. Shell Programming and Scripting

Compare columns and rows with template, and fill empty slots.

Hi, I'm working on a script that will take the contents of a file, that is in a row and column format, and compare it to a arrangment file. Such that if there is any or all blanks in my content file, the blank will be filled with a flag and will retain the row and column configuration. Ex. ... (2 Replies)
Discussion started by: hizzle
2 Replies

8. Shell Programming and Scripting

fill in missing columns

It can't be that hard, but I just can't figure it out: I have file like: File Sub-brick M_1 S_1 M_2 S_2 M_4 S_4 ... xxx 1 214 731 228 621 132 578 ... and would like to get 0 0 where M_3 S_3 is missing xxx 1 214 731 228 621 0 0 132 578 ... I wrote following script, but can't figure out... (3 Replies)
Discussion started by: avvk
3 Replies

9. Shell Programming and Scripting

Fill in missing numbers in range

I need to edit a list of numbers on the following form: 1 1.0 2 1.4 5 2.1 7 1.9 I want: 1 1.0 2 1.4 3 0.0 4 0.0 5 2.1 6 0.0 7 1.9 (i want to add the missing number in column 1 together with 0.0 in column 2). I guess it is rather trivial but i didn't even manage to read column... (5 Replies)
Discussion started by: bistru
5 Replies
Login or Register to Ask a Question