Sum based on certain column


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Sum based on certain column
# 8  
Old 07-19-2013
Mr Don,

First step:
1. Need to sort based on column 1 and column 6
2. For every line in the input file where the 1st field and the 6th field are the same, print the 1st field, the 6th field, and the sum of the values in the 3rd field

i think if we already sort the date (column1) and column6 then the output should be in order


thanks
# 9  
Old 07-19-2013
Quote:
Originally Posted by radius
Mr Don,

First step:
1. Need to sort based on column 1 and column 6
2. For every line in the input file where the 1st field and the 6th field are the same, print the 1st field, the 6th field, and the sum of the values in the 3rd field

i think if we already sort the date (column1) and column6 then the output should be in order


thanks
Your sample data happens to sort correctly on field 1 only because the month and year for all lines are the same and the days in your input sample are all from the 1st nine days of the month. And your sample input data is not sorted on fields 1 and 6; it is only sorted on field 1 (and possibly on one or more of fields 2 through 5 as secondary keys).

Do you want the output sorted by increasing year, month, day of month, and input column 6 value? Or, do you want the output sorted by increasing alphanumeric value of input columns 1 and 6? To be clear; if the following dates are included in your output, should the output order be:
Code:
1/1/2013
10/30/2012
12/6/2012
2/10/2012
7/16/2013

or:
Code:
2/10/2012
10/30/2012
12/6/2012
1/1/2013
7/16/2013

# 10  
Old 07-20-2013
the last one Mr Don...sort the date
# 11  
Old 07-20-2013
You could try something like the following:
Code:
awk '
BEGIN { FS = "[ /\t]+"
        OFS = "    "
        s = " "
}
{       v[$1 s $2 s $3 OFS $8] += $5 }
END {   for(i in v)
                printf("%s%s%.*f\n",
                        i, OFS, 9 - int(log(v[i]) / log(10)), v[i])
}' file1 | sort -k3n,3 -k1n,1 -k2n,2 -k4,4 | sed 's# #/#;s# #/#'

As always, if you are going to run this on a Solaris/SunOS system, use /usr/xpg4/bin/awk, /usr/xpg6/bin/awk, or nawk instead of /usr/bin/awk or /bin/awk.

By having awk use sequences of spaces, slashes, and tabs as field separators, the date field is split into month, day, and year fields as input lines are read. The subscript used for the v[] array (which contains the sum of the values in column 3 [field 5 after splitting the date field]) is the month followed by a space followed by the day of the month followed by a space followed by the year followed by four spaces followed by the contents of the 6th column (8th field after splitting the date). The END clause prints the subscript for each value found along with the sum of the values accumulated for each subscript.

Translating the slashes in the date field to spaces allows the sort command to sort the output produced by awk on the various numeric components of the date and the original contents of the alphanumeric input file's 6th column. After sorting the output, the sed command converts the 1st two spaces on the output line back to slashes thereby restoring the date field to its original format.

The above script produces the output you said you wanted in the 1st message in this thread except that the output shown in red below was rounded differently than in your example:
Code:
1/1/2013    X1    1012.909698
1/1/2013    X2    600.8333588
1/2/2013    X1    844.2973022
1/2/2013    X2    833.9300537
1/3/2013    X1    563.6917419
1/3/2013    X2    632.0749969
1/4/2013    X1    48.33055687

Note that the log() calculations in the awk printf statement are there to calculate the varying number of decimal places you showed in your desired output. That printf statement could be simplified if you were willing to accept a constant number of digits after the decimal point in the printed sums.

Alternatively, you could split the date field, sort the input into the desired output order, reform the date field in the sorted input and use the procedures outlined in the thread bakunin referenced. I haven't made any attempt to compare the efficiency of these alternative approaches.

Hope this helps,
Don
This User Gave Thanks to Don Cragun For This Post:
# 12  
Old 07-20-2013
perfect...

Could explain the code above Mr Don?
And in case we want to sum the value of column 4 (previously we sum column 3) ==> how about the code?

another case : we want to sum the value of column 3 but based on column 1 and column 7 (previously column 7 is column 6)

btw, so many thanks Master Don..

---------- Post updated at 01:56 AM ---------- Previous update was at 01:32 AM ----------

right now, i just do simple awk to move the column 7 to 6 and then i run your code, it works..but i would like to learn the master code of your Mr Don..eager to learn
# 13  
Old 07-20-2013
Quote:
Originally Posted by radius
but i would like to learn the master code of your Mr Don..eager to learn
That is a laudable attitude.

Quote:
Originally Posted by Don Cragun
Code:
awk '
BEGIN { FS = "[ /\t]+"
        OFS = "    "
        s = " "
}
{       v[$1 s $2 s $3 OFS $8] += $5 }
END {   for(i in v)
                printf("%s%s%.*f\n",
                        i, OFS, 9 - int(log(v[i]) / log(10)), v[i])
}' file1 | sort -k3n,3 -k1n,1 -k2n,2 -k4,4 | sed 's# #/#;s# #/#'

First, to appreciate what each part of the above command (actually a pipeline of three different commands) does you might want to redirect the output into a file, examine this and then run the file through the next step to see what this does. I suggest you use a small input file so that it is easy to oversee the output and notice any changes. You can even use several slightly altered versions of an input file to see how it affects the outcome.

In one word: its only files, which you can copy infinitely - play around.

Code:
awk '
BEGIN { FS = "[ /\t]+"
        OFS = "    "
        s = " "
}
{       v[$1 s $2 s $3 OFS $8] += $5 }
END {   for(i in v)
                printf("%s%s%.*f\n",
                        i, OFS, 9 - int(log(v[i]) / log(10)), v[i])
}' file1 > tempfile1

sort -k3n,3 -k1n,1 -k2n,2 -k4,4 tempfile1 > tempfile2

sed 's# #/#;s# #/#' tempfile2 > tempfile3

Let us start with the last part. "sed" is a non-interactive text-editor. It gets a script containing changes it should make in a text file and then does these changes. Here, two change rules are in the script:

Code:
s# #/#
s# #/#

These are "substitution"-rules: they search for a pattern in the first part, then substitute it with what is in the last part:

Code:
s<delimiter><pattern-to-search-for><delimiter><replacement><delimiter>

Usually "/" is used as delimiter, but as Don wanted to replace "/" he couldn't use it as delimiter, therefore he went for "#". He replaces a space char with a "/". This rule is there twice because per default each rule only subsitutes the first occurrance and he wanted to change the first two.

Code:
sort -k3n,3 -k1n,1 -k2n,2 -k4,4 tempfile1 > tempfile2

This sorts the output. I suggest you read the man page of all the commands used but the man page of this one will explain most: He constructs a sorting key for the date. As the date format is "M/D/Y" he first sorts on the year (field 3), then on the month (field 1), then on the day (field 2). Only then he sorts on field 4. All but the last key parts are sorted numerically.

Finally, the core piece: a really elegant awk script, which consists of three parts.

Code:
BEGIN { FS = "[ /\t]+"
        OFS = "    "
        s = " "
}
{       v[$1 s $2 s $3 OFS $8] += $5 }
END {   for(i in v)
                printf("%s%s%.*f\n",
                        i, OFS, 9 - int(log(v[i]) / log(10)), v[i])
}

awk processes input files line by line. The middle part:

Code:
{       v[$1 s $2 s $3 OFS $8] += $5 }

is what is executed for every line of the input file. It adds the content of the field over which to sum to a record in an associative array with the key value(s) as array index. This way lines with identical key values get summed automatically.

The first part:

Code:
BEGIN { FS = "[ /\t]+"
        OFS = "    "
        s = " "
}

Is executed once before the first line of the input file is read. It sets up the "Field Separator" and the "Output Field Separator" and a variable "s", which holds a single space. When you use "$1" (field 1) or "$2" (field 2) in an "awk" script it has to be told how to separate "field 1" from "field 2". It does so by splitting the input line at a "field separator" character. Per default this is a space, but Don redefines it here so that "field" is what you said it should be.

The last part

Code:
END {   for(i in v)
                printf("%s%s%.*f\n",
                        i, OFS, 9 - int(log(v[i]) / log(10)), v[i])
}

is executed once after the last line of the input is processed. This here is a simple for-llop which outputs the associative array which was collected in the middle part in a formatted way.

I hope this helps.

bakunin
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Sum in file based column

Hi All, I have a file as below and want to sum based on the id in the first column Input 10264;ATE; 12 10265;SES;11 10266AUT;50 10264;ATE;10 10265;SES;13 10266AUT;89 10264;ATE;1 10265;SES;15 10266AUT;78 Output 10264;ATE; 23 10265;SES;39 10266AUT;139 (6 Replies)
Discussion started by: arunkumar_mca
6 Replies

2. Shell Programming and Scripting

Sum of a column as new column based on header in a script

Hello, I am trying to store sum of a column as a new column inside a file but have to find the column names dynamically I/p c1,c2,c3,c4,c5 10,20,30,40,50 20,30,40,50,60 If i want to find sum only column c1, c3 and output it as c6,c7 O/p c1,c2,c3,c4,c5,c6,c7 10,20,30,40,50,30,70... (6 Replies)
Discussion started by: mkathi
6 Replies

3. UNIX for Dummies Questions & Answers

Match sum of values in each column with the corresponding column value present in trailer record

Hi All, I have a requirement where I need to find sum of values from column D through O present in a CSV file and check whether the sum of each Individual column matches with the value present for that corresponding column present in the trailer record. For example, let's assume for column D... (9 Replies)
Discussion started by: tpk
9 Replies

4. Shell Programming and Scripting

Sum column values based in common identifier in 1st column.

Hi, I have a table to be imported for R as matrix or data.frame but I first need to edit it because I've got several lines with the same identifier (1st column), so I want to sum the each column (2nd -nth) of each identifier (1st column) The input is for example, after sorted: K00001 1 1 4 3... (8 Replies)
Discussion started by: sargotrons
8 Replies

5. Shell Programming and Scripting

awk to sum a column based on duplicate strings in another column and show split totals

Hi, I have a similar input format- A_1 2 B_0 4 A_1 1 B_2 5 A_4 1 and looking to print in this output format with headers. can you suggest in awk?awk because i am doing some pattern matching from parent file to print column 1 of my input using awk already.Thanks! letter number_of_letters... (5 Replies)
Discussion started by: prashob123
5 Replies

6. UNIX for Dummies Questions & Answers

Sum based on column 1

i have file input aaa ccc,45567,rterw,1 bbb dcs,564543,hjghgh,1 aaa ccc,454,rterw,6 i want to sum based on column 1 expected output aaa ccc,7 bbb dcs,1 (4 Replies)
Discussion started by: radius
4 Replies

7. Shell Programming and Scripting

Sum Of Column Based On Column Condition

I have a following inputfile MT,AP,CDM,TTML,MUM,GS,SUCC,3 MT,AP,CDM,TTSL,AP,GS,FAIL,9 MT,AP,CDM,RCom,MAH,GS,SUCC,3 MT,AP,CDM,RTL,HP,GS,SUCC,1 MT,AP,CDM,Uni,UPE,GS,SUCC,2 MT,AP,CDM,Uni,MUM,GS,SUCC,2 TTSL,AP,GS,MT,MAH,CDM,SUCC,20 TTML,AP,GS,MT,MAH,CDM,FAIL,10... (2 Replies)
Discussion started by: siramitsharma
2 Replies

8. Shell Programming and Scripting

Sum a column value based on multiple keys

Hi, I have below as i/p file: 5ABC 36488989 K 000010000ASB BYTRES 5PQR 45757754 K 000200005KPC HGTRET 5ABC 36488989 K 000045000ASB HGTRET 5GTH 36488989 K 000200200ASB BYTRES 5FTU ... (2 Replies)
Discussion started by: nirnkv
2 Replies

9. Shell Programming and Scripting

sum multiple columns based on column value

i have a file - it will be in sorted order on column 1 abc 0 1 abc 2 3 abc 3 5 def 1 7 def 0 1 -------- i'd like (awk maybe?) to get the results (any ideas)??? abc 5 9 def 1 8 (2 Replies)
Discussion started by: jjoe
2 Replies

10. UNIX for Dummies Questions & Answers

How do I sum one column based on another column?

Hi, I am new to this forum and new to awk. I have a file that contains 2 columns. Heres an example of what it looks like: 10 + 20 + 40 + 50 - 70 - So the file is tab-delimited. What I want to do is add 10 to column 1 whenever column 2 is + and substract 10 from column 1... (1 Reply)
Discussion started by: phil_heath
1 Replies
Login or Register to Ask a Question