Sum based on certain column

07-19-2013

Registered User

139, 0

Join Date: Mar 2013

Last Activity: 15 August 2017, 8:48 AM EDT

Posts: 139

Thanks Given: 51

Thanked 0 Times in 0 Posts

Mr Don,

First step:
1. Need to sort based on column 1 and column 6
2. For every line in the input file where the 1st field and the 6th field are the same, print the 1st field, the 6th field, and the sum of the values in the 3rd field

i think if we already sort the date (column1) and column6 then the output should be in order

thanks

radius

View Public Profile for radius

Find all posts by radius

07-19-2013

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by radius

Your sample data happens to sort correctly on field 1 only because the month and year for all lines are the same and the days in your input sample are all from the 1st nine days of the month. And your sample input data is not sorted on fields 1 and 6; it is only sorted on field 1 (and possibly on one or more of fields 2 through 5 as secondary keys).

Do you want the output sorted by increasing year, month, day of month, and input column 6 value? Or, do you want the output sorted by increasing alphanumeric value of input columns 1 and 6? To be clear; if the following dates are included in your output, should the output order be:

Code:

1/1/2013
10/30/2012
12/6/2012
2/10/2012
7/16/2013

or:

Code:

2/10/2012
10/30/2012
12/6/2012
1/1/2013
7/16/2013

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

07-20-2013

Registered User

139, 0

Join Date: Mar 2013

Last Activity: 15 August 2017, 8:48 AM EDT

Posts: 139

Thanks Given: 51

Thanked 0 Times in 0 Posts

the last one Mr Don...sort the date

radius

View Public Profile for radius

Find all posts by radius

07-20-2013

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

You could try something like the following:

Code:

awk '
BEGIN { FS = "[ /\t]+"
        OFS = "    "
        s = " "
}
{       v[$1 s $2 s $3 OFS $8] += $5 }
END {   for(i in v)
                printf("%s%s%.*f\n",
                        i, OFS, 9 - int(log(v[i]) / log(10)), v[i])
}' file1 | sort -k3n,3 -k1n,1 -k2n,2 -k4,4 | sed 's# #/#;s# #/#'

As always, if you are going to run this on a Solaris/SunOS system, use /usr/xpg4/bin/awk, /usr/xpg6/bin/awk, or nawk instead of /usr/bin/awk or /bin/awk.

By having awk use sequences of spaces, slashes, and tabs as field separators, the date field is split into month, day, and year fields as input lines are read. The subscript used for the v[] array (which contains the sum of the values in column 3 [field 5 after splitting the date field]) is the month followed by a space followed by the day of the month followed by a space followed by the year followed by four spaces followed by the contents of the 6th column (8th field after splitting the date). The END clause prints the subscript for each value found along with the sum of the values accumulated for each subscript.

Translating the slashes in the date field to spaces allows the sort command to sort the output produced by awk on the various numeric components of the date and the original contents of the alphanumeric input file's 6th column. After sorting the output, the sed command converts the 1st two spaces on the output line back to slashes thereby restoring the date field to its original format.

The above script produces the output you said you wanted in the 1st message in this thread except that the output shown in red below was rounded differently than in your example:

Code:

1/1/2013    X1    1012.909698
1/1/2013    X2    600.8333588
1/2/2013    X1    844.2973022
1/2/2013    X2    833.9300537
1/3/2013    X1    563.6917419
1/3/2013    X2    632.0749969
1/4/2013    X1    48.33055687

Note that the log() calculations in the awk printf statement are there to calculate the varying number of decimal places you showed in your desired output. That printf statement could be simplified if you were willing to accept a constant number of digits after the decimal point in the printed sums.

Alternatively, you could split the date field, sort the input into the desired output order, reform the date field in the sorted input and use the procedures outlined in the thread bakunin referenced. I haven't made any attempt to compare the efficiency of these alternative approaches.

Hope this helps,
Don

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

07-20-2013

Registered User

139, 0

Join Date: Mar 2013

Last Activity: 15 August 2017, 8:48 AM EDT

Posts: 139

Thanks Given: 51

Thanked 0 Times in 0 Posts

perfect...

Could explain the code above Mr Don?
And in case we want to sum the value of column 4 (previously we sum column 3) ==> how about the code?

another case : we want to sum the value of column 3 but based on column 1 and column 7 (previously column 7 is column 6)

btw, so many thanks Master Don..

---------- Post updated at 01:56 AM ---------- Previous update was at 01:32 AM ----------

right now, i just do simple awk to move the column 7 to 6 and then i run your code, it works..but i would like to learn the master code of your Mr Don..eager to learn

radius

View Public Profile for radius

Find all posts by radius

07-20-2013

Registered User

6,384, 2,214

Join Date: May 2005

Last Activity: 28 October 2019, 4:59 PM EDT

Location: In the leftmost byte of /dev/kmem

Posts: 6,384

Thanks Given: 143

Thanked 2,214 Times in 1,548 Posts

Quote:

Originally Posted by radius

but i would like to learn the master code of your Mr Don..eager to learn

That is a laudable attitude.

Quote:

Originally Posted by Don Cragun

Code:

awk '
BEGIN { FS = "[ /\t]+"
        OFS = "    "
        s = " "
}
{       v[$1 s $2 s $3 OFS $8] += $5 }
END {   for(i in v)
                printf("%s%s%.*f\n",
                        i, OFS, 9 - int(log(v[i]) / log(10)), v[i])
}' file1 | sort -k3n,3 -k1n,1 -k2n,2 -k4,4 | sed 's# #/#;s# #/#'

First, to appreciate what each part of the above command (actually a pipeline of three different commands) does you might want to redirect the output into a file, examine this and then run the file through the next step to see what this does. I suggest you use a small input file so that it is easy to oversee the output and notice any changes. You can even use several slightly altered versions of an input file to see how it affects the outcome.

In one word: its only files, which you can copy infinitely - play around.

Code:

awk '
BEGIN { FS = "[ /\t]+"
        OFS = "    "
        s = " "
}
{       v[$1 s $2 s $3 OFS $8] += $5 }
END {   for(i in v)
                printf("%s%s%.*f\n",
                        i, OFS, 9 - int(log(v[i]) / log(10)), v[i])
}' file1 > tempfile1

sort -k3n,3 -k1n,1 -k2n,2 -k4,4 tempfile1 > tempfile2

sed 's# #/#;s# #/#' tempfile2 > tempfile3

Let us start with the last part. "sed" is a non-interactive text-editor. It gets a script containing changes it should make in a text file and then does these changes. Here, two change rules are in the script:

Code:

s# #/#
s# #/#

These are "substitution"-rules: they search for a pattern in the first part, then substitute it with what is in the last part:

Code:

s<delimiter><pattern-to-search-for><delimiter><replacement><delimiter>

Usually "/" is used as delimiter, but as Don wanted to replace "/" he couldn't use it as delimiter, therefore he went for "#". He replaces a space char with a "/". This rule is there twice because per default each rule only subsitutes the first occurrance and he wanted to change the first two.

Code:

sort -k3n,3 -k1n,1 -k2n,2 -k4,4 tempfile1 > tempfile2

This sorts the output. I suggest you read the man page of all the commands used but the man page of this one will explain most: He constructs a sorting key for the date. As the date format is "M/D/Y" he first sorts on the year (field 3), then on the month (field 1), then on the day (field 2). Only then he sorts on field 4. All but the last key parts are sorted numerically.

Finally, the core piece: a really elegant awk script, which consists of three parts.

Code:

BEGIN { FS = "[ /\t]+"
        OFS = "    "
        s = " "
}
{       v[$1 s $2 s $3 OFS $8] += $5 }
END {   for(i in v)
                printf("%s%s%.*f\n",
                        i, OFS, 9 - int(log(v[i]) / log(10)), v[i])
}

awk processes input files line by line. The middle part:

Code:

{       v[$1 s $2 s $3 OFS $8] += $5 }

is what is executed for every line of the input file. It adds the content of the field over which to sum to a record in an associative array with the key value(s) as array index. This way lines with identical key values get summed automatically.

The first part:

Code:

BEGIN { FS = "[ /\t]+"
        OFS = "    "
        s = " "
}

Is executed once before the first line of the input file is read. It sets up the "Field Separator" and the "Output Field Separator" and a variable "s", which holds a single space. When you use "$1" (field 1) or "$2" (field 2) in an "awk" script it has to be told how to separate "field 1" from "field 2". It does so by splitting the input line at a "field separator" character. Per default this is a space, but Don redefines it here so that "field" is what you said it should be.

The last part

Code:

END {   for(i in v)
                printf("%s%s%.*f\n",
                        i, OFS, 9 - int(log(v[i]) / log(10)), v[i])
}

is executed once after the last line of the input is processed. This here is a simple for-llop which outputs the associative array which was collected in the middle part in a formatted way.

I hope this helps.

bakunin

bakunin

View Public Profile for bakunin

Find all posts by bakunin

UNIX for Dummies Questions & Answers

Sum based on certain column

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Sum in file based column

Discussion started by: arunkumar_mca

2. Shell Programming and Scripting

Sum of a column as new column based on header in a script

Discussion started by: mkathi

3. UNIX for Dummies Questions & Answers

Match sum of values in each column with the corresponding column value present in trailer record

Discussion started by: tpk

4. Shell Programming and Scripting

Sum column values based in common identifier in 1st column.

Discussion started by: sargotrons

5. Shell Programming and Scripting

awk to sum a column based on duplicate strings in another column and show split totals

Discussion started by: prashob123

6. UNIX for Dummies Questions & Answers

Sum based on column 1

Discussion started by: radius

7. Shell Programming and Scripting

Sum Of Column Based On Column Condition

Discussion started by: siramitsharma

8. Shell Programming and Scripting

Sum a column value based on multiple keys

Discussion started by: nirnkv

9. Shell Programming and Scripting

sum multiple columns based on column value

Discussion started by: jjoe

10. UNIX for Dummies Questions & Answers

How do I sum one column based on another column?

Discussion started by: phil_heath