Sum Column Value


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Sum Column Value
# 1  
Old 07-17-2014
Sum Column Value

I have a requirement to sum all records in a column in a file and get the total sum value .There are 15 million records in the file.Data in the file has 2 decimal places.

I use the below AWK function
Code:
cut -d ","  -f4 data.txt|awk ' {sum+=$1 } END { Printf("%.2f\n",sum)}'

I get the below answer
Code:
375173879877.18

But the expected answer is
Code:
375173879877.15

Please advice why there is a difference of 3 cents

Do i need to use any other format other than float

Last edited by Franklin52; 07-18-2014 at 03:03 PM.. Reason: Please use code tags
# 2  
Old 07-17-2014
Check your math!

BTW, $0 is faster than $1 but why not skip the cut and awk on the 4th field?
Code:
awk -F,  ' {sum+=$4 } END { Printf("%.2f\n",sum)}' < data.txt

Mike
# 3  
Old 07-17-2014
Thanks Mike.
But this does not solve my problem i still get 3 cent difference

BTW if i use BC by reading the entire file row by row, i get the correct answer, but it takes 6 hours to run, as there are 15 million records


total=0
for a in `cat data.txt`;do
total=`echo "scale=2;($total+$a)/1"|bc`

done
echo $total
# 4  
Old 07-17-2014
You may find the solution in this thread:
# 5  
Old 07-18-2014
Awk defaults to double. Must be happening in the decimal to binary conversion. Your original data is too how many decimal places?

Try multiplying by 10 or 100 or 1000 before the adding and dividing at the end. Try rounding to a certain precizsion before adding.

Mike

---------- Post updated at 08:07 PM ---------- Previous update was at 07:33 PM ----------

I do floating addition, subtraction, or comparison in BASH all the time by string multiplication and division (moving the decimal point) to convert them to integers.

With a huge file like that your key to speed is going to be avoiding creating and closing subshells in and explicit or implicit loops by calling non built-in utilities. Stick with built-ins only and you will be fast, even with huge files. AWK does not have integers so you're going to have to use BASH.

Using built-in string functions like printf zero pad everything out to the maximum precision, "multiply" by dropping the decimal point, then use integer addition and then at the end "divide" by adding the decimal point back in.

123.45 + 678.9 = (12345 + 67890)/100 All integer math in the loop!

Mike
# 6  
Old 07-18-2014
The standards say that awk uses double precision floating point arithmetic when performing arithmetic calculations. That type gives you 15 to 17 significant digits for any particular value, but since some decimal values aren't exactly representable in binary, adding sequences of floating point values can result in values that are not accurate to a full 15 significant digits.

You didn't mention anything about the ranges of values in field 4 in your input file. If you are adding numbers like 375173879877.15 (appearing once) and .00001 (appearing 30,000 times) you need more than 17 significant digits when you add the large number and any of the small numbers.

If you need arbitrary precision decimal arithmetic, you need to use something like bc instead of awk to perform your calculations. Here is one way to do it using awk and bc:
Code:
awk '
BEGIN {	print "sum=0"}
{	print "sum += " $4
	asum += $4
}
END {	printf "sum\nscale=2\nsum=sum/1\nsum\n"
	print asum > "awk.out"
	printf("%.2f\n", asum) > "awk.out"
}' data.txt | bc

The code shown in red is not needed to use bc to perform the calculations; it is just here to demonstrate that awk has limited precision for arithmetic operations.

If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk, /usr/xpg6/bin/awk, or nawk.
If data.txt contains:
Code:
0 1 2 5
0 1 2 50
0 1 2 500
0 1 2 5000
0 1 2 50000
0 1 2 500000
0 1 2 5000000
0 1 2 50000000
0 1 2 500000000
0 1 2 5000000000
0 1 2 50000000000
0 1 2 500000000000
0 1 2 5000000000000
0 1 2 50000000000000
0 1 2 500000000000000
0 1 2 5000000000000000
0 1 2 50000000000000000
0 1 2 500000000000000000
0 1 2 5000000000000000000
0 1 2 50000000000000000000
0 1 2 500000000000000000000
0 1 2 .5
0 1 2 .05
0 1 2 .005
0 1 2 .0005
0 1 2 .00005
0 1 2 .000005
0 1 2 .0000005
0 1 2 .00000005
0 1 2 .000000005
0 1 2 .0000000005
0 1 2 .00000000005
0 1 2 .000000000005

the output produced (showing the results of using bc to perform the calculations (both with unlimited precision and truncated to two digits after the decimal point) is:
Code:
555555555555555555555.555555555555
555555555555555555555.55

and the results written to awk.out (showing the results of using awk to perform the calculations is:
Code:
555555555555555540992
555555555555555540992.00

(which shows that with these input values, awk gets 16 digits at the start of the output correct).
These 2 Users Gave Thanks to Don Cragun For This Post:
# 7  
Old 07-18-2014
Thanks Mike multiplying 100 and dividing by 100 work


Quote:
Originally Posted by Michael Stora
Awk defaults to double. Must be happening in the decimal to binary conversion. Your original data is too how many decimal places?

Try multiplying by 10 or 100 or 1000 before the adding and dividing at the end. Try rounding to a certain precizsion before adding.

Mike

---------- Post updated at 08:07 PM ---------- Previous update was at 07:33 PM ----------

I do floating addition, subtraction, or comparison in BASH all the time by string multiplication and division (moving the decimal point) to convert them to integers.

With a huge file like that your key to speed is going to be avoiding creating and closing subshells in and explicit or implicit loops by calling non built-in utilities. Stick with built-ins only and you will be fast, even with huge files. AWK does not have integers so you're going to have to use BASH.

Using built-in string functions like printf zero pad everything out to the maximum precision, "multiply" by dropping the decimal point, then use integer addition and then at the end "divide" by adding the decimal point back in.

123.45 + 678.9 = (12345 + 67890)/100 All integer math in the loop!

Mike
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Sum of a column as new column based on header in a script

Hello, I am trying to store sum of a column as a new column inside a file but have to find the column names dynamically I/p c1,c2,c3,c4,c5 10,20,30,40,50 20,30,40,50,60 If i want to find sum only column c1, c3 and output it as c6,c7 O/p c1,c2,c3,c4,c5,c6,c7 10,20,30,40,50,30,70... (6 Replies)
Discussion started by: mkathi
6 Replies

2. Shell Programming and Scripting

awk to Sum columns when other column has duplicates and append one column value to another with Care

Hi Experts, Please bear with me, i need help I am learning AWk and stuck up in one issue. First point : I want to sum up column value for column 7, 9, 11,13 and column15 if rows in column 5 are duplicates.No action to be taken for rows where value in column 5 is unique. Second point : For... (1 Reply)
Discussion started by: as7951
1 Replies

3. UNIX for Beginners Questions & Answers

Sum the values in the column using date column

I have a file which need to be summed up using date column. I/P: 2017/01/01 a 10 2017/01/01 b 20 2017/01/01 c 40 2017/01/01 a 60 2017/01/01 b 50 2017/01/01 c 40 2017/01/01 a 20 2017/01/01 b 30 2017/01/01 c 40 2017/02/01 a 10 2017/02/01 b 20 2017/02/01 c 30 2017/02/01 a 10... (6 Replies)
Discussion started by: Booo
6 Replies

4. UNIX for Dummies Questions & Answers

Match sum of values in each column with the corresponding column value present in trailer record

Hi All, I have a requirement where I need to find sum of values from column D through O present in a CSV file and check whether the sum of each Individual column matches with the value present for that corresponding column present in the trailer record. For example, let's assume for column D... (9 Replies)
Discussion started by: tpk
9 Replies

5. Shell Programming and Scripting

Sum column values based in common identifier in 1st column.

Hi, I have a table to be imported for R as matrix or data.frame but I first need to edit it because I've got several lines with the same identifier (1st column), so I want to sum the each column (2nd -nth) of each identifier (1st column) The input is for example, after sorted: K00001 1 1 4 3... (8 Replies)
Discussion started by: sargotrons
8 Replies

6. Shell Programming and Scripting

awk to sum a column based on duplicate strings in another column and show split totals

Hi, I have a similar input format- A_1 2 B_0 4 A_1 1 B_2 5 A_4 1 and looking to print in this output format with headers. can you suggest in awk?awk because i am doing some pattern matching from parent file to print column 1 of my input using awk already.Thanks! letter number_of_letters... (5 Replies)
Discussion started by: prashob123
5 Replies

7. Shell Programming and Scripting

Sum Of Column Based On Column Condition

I have a following inputfile MT,AP,CDM,TTML,MUM,GS,SUCC,3 MT,AP,CDM,TTSL,AP,GS,FAIL,9 MT,AP,CDM,RCom,MAH,GS,SUCC,3 MT,AP,CDM,RTL,HP,GS,SUCC,1 MT,AP,CDM,Uni,UPE,GS,SUCC,2 MT,AP,CDM,Uni,MUM,GS,SUCC,2 TTSL,AP,GS,MT,MAH,CDM,SUCC,20 TTML,AP,GS,MT,MAH,CDM,FAIL,10... (2 Replies)
Discussion started by: siramitsharma
2 Replies

8. UNIX for Dummies Questions & Answers

How to sum rows in e.g. column 1 by a category in e.g. column 2

Hi, I've shown an example of what I would like to achieve below. In the example file, I would like to sum the values in column 2 for each distinct category in column 3 (presumably making an array?) and print the sum as well as the category name and length (note:length always corresponds with... (8 Replies)
Discussion started by: auburn
8 Replies

9. UNIX for Dummies Questions & Answers

How do I sum one column based on another column?

Hi, I am new to this forum and new to awk. I have a file that contains 2 columns. Heres an example of what it looks like: 10 + 20 + 40 + 50 - 70 - So the file is tab-delimited. What I want to do is add 10 to column 1 whenever column 2 is + and substract 10 from column 1... (1 Reply)
Discussion started by: phil_heath
1 Replies

10. Shell Programming and Scripting

Column Sum

Hi Friends :D Please help me to modify a small script. I have a file with columns like: xxx 200 xxx 10 yyy 150 yyy 1500 zzz 05 www 120 and so on I have to add the column 2 for a particular column 1. I am trying the script given below, it works fine but takes a lot of time... (2 Replies)
Discussion started by: vanand420
2 Replies
Login or Register to Ask a Question