Sum elements of 2 arrays excluding labels


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Sum elements of 2 arrays excluding labels
# 8  
Old 03-29-2018
Hi.

I like awk solutions. However, I also like packaged solutions. In this case GNU datamash can do the grouping and summing with:
Code:
datamash -g 1 sum 2,3

which will sum fields 2 and 3 for items in groups of field 1.

However simple as this appears, there are additional complexities. First, datamash, as with many standard utilities, likes TAB-delimited files by default. Although headers can be ignored, we can combine replacing runs of spaces with a TAB as well as deleting headers with a sed operation. So we can append all modified input files to a single input file, which is also what datamash likes.

As you can imagine, it is best and easiest when the lines for the group operation are collected together. There is a datamash option for such sorting, but your choice of group names are mixed alphabetic and numeric -- perhaps called a hybrid string. A program that can handle that is msort.

This data preparation can be combined into a loop that can handle a number of data files. Here we have added 3 additional data files as an illustration. The script uses as input all file names that begin with the string data -- data1, data2, etc.

Then we can run the command as noted above.

If we want to make the output pretty, we can add a header, and use a simple perl script called align, which aligns fields automatically, but can also be directed to align left, center, right, etc.

With all that in mind, here is a script that shows these operations and the results:
Code:
#!/usr/bin/env bash

# @(#) s2       Demonstrate grouping, summing fields, many files, datamash.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
em() { pe "$*" >&2 ; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C specimen datamash msort align

FILES=${1-data*}
E=expected-output

rm all-data
pl " Input data files $FILES:"
head -n 5  $FILES

pl " Sample of file collection, TABBED, stripped header, etc.:"
for file in data*
do
  sed '1d;2,$s/  */\t/g' $file >> all-data
done 
specimen 4:4:4 all-data

pl " Expected output:"
cat $E

pl " Results:"
echo "SAMPLE    TOTAL DERIVED   TOTAL ANCESTRAL" > f1
msort -j -q -l -n 1,1 -c hybrid all-data |
datamash -g 1 sum 2,3 |
tee -a f1

pl " Beautify results:"
align -alrr f1

exit 0

producing:
Code:
$ ./s2

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 3.16.0-5-amd64, x86_64
Distribution        : Debian 8.9 (jessie) 
bash GNU bash 4.3.30
specimen (local) 1.17
datamash (GNU datamash) 1.2
msort 8.53
align 1.7.0

-----
 Input data files data*:
==> data1 <==
SAMPLE    DERIVED    ANCESTRAL
Sample1    14352    0
Sample2    14352    0
Sample3    14352    0
Sample4    9880    4472

==> data2 <==
SAMPLE    DERIVED    ANCESTRAL
Sample1    14352    0
Sample2    14352    0
Sample3    14352    0
Sample4    13674    678

==> data3 <==
SAMPLE    DERIVED    ANCESTRAL
Sample14        -1      -1

==> data4 <==
SAMPLE    DERIVED    ANCESTRAL
Sample14        -3      -3

==> data5 <==
SAMPLE    DERIVED    ANCESTRAL
Sample14        4       4

-----
 Sample of file collection, TABBED, stripped header, etc.:
Edges: 4:4:4 of 29 lines in file "all-data"
Sample1 14352   0
Sample2 14352   0
Sample3 14352   0
Sample4 9880    4472
   ---
Sample1 14352   0
Sample2 14352   0
Sample3 14352   0
Sample4 13674   678
   ---
Sample13        13713   639
Sample14        -1      -1
Sample14        -3      -3
Sample14        4       4

-----
 Expected output:
SAMPLE    TOTAL DERIVED    TOTAL ANCESTRAL
Sample1    28704    0
Sample2    28704    0
Sample3    28704    0
Sample4    23554    5150
Sample5    23535    5169
Sample6    23547    5157
Sample7    23469    5235
Sample8    23477    5227
Sample9    23448    5256
Sample10    23434    5270
Sample11    23333    5371
Sample12    23477    5227
Sample13        23453   5251
Sample14        0       0

-----
 Results:
Sample1 28704   0
Sample2 28704   0
Sample3 28704   0
Sample4 23554   5150
Sample5 23535   5169
Sample6 23547   5157
Sample7 23469   5235
Sample8 23477   5227
Sample9 23448   5256
Sample10        23434   5270
Sample11        23333   5371
Sample12        23477   5227
Sample13        23453   5251
Sample14        0       0

-----
 Beautify results:
SAMPLE   TOTAL DERIVED TOTAL ANCESTRAL
Sample1          28704               0
Sample2          28704               0
Sample3          28704               0
Sample4          23554            5150
Sample5          23535            5169
Sample6          23547            5157
Sample7          23469            5235
Sample8          23477            5227
Sample9          23448            5256
Sample10         23434            5270
Sample11         23333            5371
Sample12         23477            5227
Sample13         23453            5251
Sample14             0               0

Here are some details about the utilities used:
Code:
datamash        command-line calculations (man)
Path    : /usr/local/bin/datamash
Version : 1.2
Type    : ELF 64-bit LSB executable, x86-64, version 1 (SYS ...)
Help    : probably available with -h,--help
Repo    : Debian 8.9 (jessie) 
Home    : https://savannah.gnu.org/projects/datamash/ (pm)
Home    : http://www.gnu.org/software/datamash (doc)

msort   sort records in complex ways (man)
Path    : /usr/bin/msort
Version : 8.53
Type    : ELF 64-bit LSB executable, x86-64, version 1 (SYS ...)
Repo    : Debian 8.9 (jessie) 
Home    : http://www.billposer.org/Software/msort.html (pm)
Home    : http://billposer.org/Software/msort.html (doc)

align   Align columns of text. (what)
Path    : ~/p/stm/common/scripts/align
Version : 1.7.0
Length  : 270 lines
Type    : Perl script, ASCII text executable
Shebang : #!/usr/bin/perl
Home    : http://kinzler.com/me/align/ (doc)
Modules : (for perl codes)
 Getopt::Std    1.10

Best wishes ... cheers, drl
This User Gave Thanks to drl For This Post:
# 9  
Old 03-29-2018
Quote:
Originally Posted by drl
Hi.

I like awk solutions. However, I also like packaged solutions. In this case GNU datamash can do the grouping and summing with:
Code:
datamash -g 1 sum 2,3

which will sum fields 2 and 3 for items in groups of field 1.

However simple as this appears, there are additional complexities. First, datamash, as with many standard utilities, likes TAB-delimited files by default. Although headers can be ignored, we can combine replacing runs of spaces with a TAB as well as deleting headers with a sed operation. So we can append all modified input files to a single input file, which is also what datamash likes.

As you can imagine, it is best and easiest when the lines for the group operation are collected together. There is a datamash option for such sorting, but your choice of group names are mixed alphabetic and numeric -- perhaps called a hybrid string. A program that can handle that is msort.

This data preparation can be combined into a loop that can handle a number of data files. Here we have added 3 additional data files as an illustration. The script uses as input all file names that begin with the string data -- data1, data2, etc.

Then we can run the command as noted above.

If we want to make the output pretty, we can add a header, and use a simple perl script called align, which aligns fields automatically, but can also be directed to align left, center, right, etc.

With all that in mind, here is a script that shows these operations and the results:
Code:
#!/usr/bin/env bash

# @(#) s2       Demonstrate grouping, summing fields, many files, datamash.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
em() { pe "$*" >&2 ; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C specimen datamash msort align

FILES=${1-data*}
E=expected-output

rm all-data
pl " Input data files $FILES:"
head -n 5  $FILES

pl " Sample of file collection, TABBED, stripped header, etc.:"
for file in data*
do
  sed '1d;2,$s/  */\t/g' $file >> all-data
done 
specimen 4:4:4 all-data

pl " Expected output:"
cat $E

pl " Results:"
echo "SAMPLE    TOTAL DERIVED   TOTAL ANCESTRAL" > f1
msort -j -q -l -n 1,1 -c hybrid all-data |
datamash -g 1 sum 2,3 |
tee -a f1

pl " Beautify results:"
align -alrr f1

exit 0

producing:
Code:
$ ./s2

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 3.16.0-5-amd64, x86_64
Distribution        : Debian 8.9 (jessie) 
bash GNU bash 4.3.30
specimen (local) 1.17
datamash (GNU datamash) 1.2
msort 8.53
align 1.7.0

-----
 Input data files data*:
==> data1 <==
SAMPLE    DERIVED    ANCESTRAL
Sample1    14352    0
Sample2    14352    0
Sample3    14352    0
Sample4    9880    4472

==> data2 <==
SAMPLE    DERIVED    ANCESTRAL
Sample1    14352    0
Sample2    14352    0
Sample3    14352    0
Sample4    13674    678

==> data3 <==
SAMPLE    DERIVED    ANCESTRAL
Sample14        -1      -1

==> data4 <==
SAMPLE    DERIVED    ANCESTRAL
Sample14        -3      -3

==> data5 <==
SAMPLE    DERIVED    ANCESTRAL
Sample14        4       4

-----
 Sample of file collection, TABBED, stripped header, etc.:
Edges: 4:4:4 of 29 lines in file "all-data"
Sample1 14352   0
Sample2 14352   0
Sample3 14352   0
Sample4 9880    4472
   ---
Sample1 14352   0
Sample2 14352   0
Sample3 14352   0
Sample4 13674   678
   ---
Sample13        13713   639
Sample14        -1      -1
Sample14        -3      -3
Sample14        4       4

-----
 Expected output:
SAMPLE    TOTAL DERIVED    TOTAL ANCESTRAL
Sample1    28704    0
Sample2    28704    0
Sample3    28704    0
Sample4    23554    5150
Sample5    23535    5169
Sample6    23547    5157
Sample7    23469    5235
Sample8    23477    5227
Sample9    23448    5256
Sample10    23434    5270
Sample11    23333    5371
Sample12    23477    5227
Sample13        23453   5251
Sample14        0       0

-----
 Results:
Sample1 28704   0
Sample2 28704   0
Sample3 28704   0
Sample4 23554   5150
Sample5 23535   5169
Sample6 23547   5157
Sample7 23469   5235
Sample8 23477   5227
Sample9 23448   5256
Sample10        23434   5270
Sample11        23333   5371
Sample12        23477   5227
Sample13        23453   5251
Sample14        0       0

-----
 Beautify results:
SAMPLE   TOTAL DERIVED TOTAL ANCESTRAL
Sample1          28704               0
Sample2          28704               0
Sample3          28704               0
Sample4          23554            5150
Sample5          23535            5169
Sample6          23547            5157
Sample7          23469            5235
Sample8          23477            5227
Sample9          23448            5256
Sample10         23434            5270
Sample11         23333            5371
Sample12         23477            5227
Sample13         23453            5251
Sample14             0               0

Here are some details about the utilities used:
Code:
datamash        command-line calculations (man)
Path    : /usr/local/bin/datamash
Version : 1.2
Type    : ELF 64-bit LSB executable, x86-64, version 1 (SYS ...)
Help    : probably available with -h,--help
Repo    : Debian 8.9 (jessie) 
Home    : https://savannah.gnu.org/projects/datamash/ (pm)
Home    : http://www.gnu.org/software/datamash (doc)

msort   sort records in complex ways (man)
Path    : /usr/bin/msort
Version : 8.53
Type    : ELF 64-bit LSB executable, x86-64, version 1 (SYS ...)
Repo    : Debian 8.9 (jessie) 
Home    : http://www.billposer.org/Software/msort.html (pm)
Home    : http://billposer.org/Software/msort.html (doc)

align   Align columns of text. (what)
Path    : ~/p/stm/common/scripts/align
Version : 1.7.0
Length  : 270 lines
Type    : Perl script, ASCII text executable
Shebang : #!/usr/bin/perl
Home    : http://kinzler.com/me/align/ (doc)
Modules : (for perl codes)
 Getopt::Std    1.10

Best wishes ... cheers, drl

Awesome drl !Great work, I appreciate all the effort Smilie
This User Gave Thanks to Geneanalyst For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

awk sum of 2 arrays and compare

i'm new to awk, and i've been searching on the forum for sum of a column but all the scripts does sum a column of an entire file. I've a file like this: cat file.txt 1234 5678 5678 1234 I want to use awk to do sum of each column per line not entire file, compare the two then write the... (1 Reply)
Discussion started by: chofred
1 Replies

2. UNIX for Beginners Questions & Answers

Multiply elements of 2 arrays together into another array

So I need to Write an array processing program using a Linux shell programming language to perform the following. Load array X of 20 numbers from an input file X. Load array Y of 20 numbers from an input file Y. Compute array Z by multiply Xi * Yi then compute the square-root of this... (2 Replies)
Discussion started by: sarapham409
2 Replies

3. UNIX for Beginners Questions & Answers

Awk: count unique elements in a field and sum their occurence across the entire file

Hi, Sure it's an easy one, but it drives me insane. input ("|" separated): 1|A,B,C,A 2|A,D,D 3|A,B,B I would like to count the occurence of each capital letters in $2 across the entire file, knowing that duplicates in each record count as 1. I am trying to get this output... (5 Replies)
Discussion started by: beca123456
5 Replies

4. Shell Programming and Scripting

Compare multiple arrays elements using awk

I need your help to discover missing elements for each box. In theory each box should have 4 items: ITEM01, ITEM02, ITEM08, and ITEM10. Some boxes either have a missing item (BOX02 ITEM08) or might have da duplicate item (BOX03 ITEM02) and missing another one (BOX03 ITEM01). file01.txt ... (2 Replies)
Discussion started by: alex2005
2 Replies

5. Shell Programming and Scripting

Help reading the array and sum of the array elements

Hi All, need help with reading the array and sum of the array elements. given an array of integers of size N . You need to print the sum of the elements in the array, keeping in mind that some of those integers may be quite large. Input Format The first line of the input consists of an... (1 Reply)
Discussion started by: nishantrefound
1 Replies

6. UNIX for Dummies Questions & Answers

Labels in VI

Hi, Is there a concept of lables in vi editor. In mainframes ISPF editor there is a concept of labels where one can label a line say ".a" and after that wherever you are in the file, if one want to go back to that particular line where the label was set...he could do by "l .a"....Is there... (1 Reply)
Discussion started by: whoami191
1 Replies

7. Shell Programming and Scripting

How do I find the sum of values from two arrays?

Hi I have redc containing the values 3, 6, 2, 8, and 1. I have work containing the values 8, 2, 11, 7, and 9. Is there a way to find the sum of redc and work? I need to compare the sum of those two arrays to something else, so is it okay to put that into my END? TY! (4 Replies)
Discussion started by: razrnaga
4 Replies

8. Programming

question about int arrays and file pointer arrays

if i declare both but don't input any variables what values will the int array and file pointer array have on default, and if i want to reset any of the elements of both arrays to default, should i just set it to 0 or NULL or what? (1 Reply)
Discussion started by: omega666
1 Replies

9. Shell Programming and Scripting

How to access the elements of two arrays with a single loop using the inbuilt index.

Hi all, I wanted to access two arrays (of same size) using one for loop. Ex: #!/bin/bash declare -a num declare -a words num=(1 2 3 4 5 6 7) words=(one two three four five six seven) for num in ${num} do echo ":$num: :${words}:" done Required Output: :1: :one: (11 Replies)
Discussion started by: 14341
11 Replies

10. Shell Programming and Scripting

PHP arrays as array elements

PHP question...I posted this on the Web Development forum, but maybe this is a better place! I have an SQL query that's pulled back user IDs as a set of columns. Rather than IDs, I want to use their names. So I have an array of columns $col with values 1,7,3,12 etc and I've got an array $person... (3 Replies)
Discussion started by: JerryHone
3 Replies
Login or Register to Ask a Question