Match sum of values in each column with the corresponding column value present in trailer record


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Match sum of values in each column with the corresponding column value present in trailer record
# 1  
Old 04-24-2015
Match sum of values in each column with the corresponding column value present in trailer record

Hi All,

I have a requirement where I need to find sum of values from column D through O present in a CSV file and check whether the sum of each Individual column matches with the value present for that corresponding column present in the trailer record.

For example, let's assume for column D excluding Header and Trailer in the csv find the sum of all data records in column D and check whether that sum is equal to value present in column D in the trailer record. This same process needs to be done for all the columns from D through O.

For this I have developed one shell script which does the same (I know you experts can do it in better way instead of creating so many temp files. But as I am very new to shell scripting I have just applied my thought in my way).

This shell is behaving differently for each file, For file pf_20150127.csv it is working perfectly because the temp files which I am comparing are giving the same results, PFA the snapshot of values match (Sum_Match.jpb) in temp files.

If I execute the same script for file pf_20150325.csv, The counts does not match. The trailer record value in the original file now is being displayed with 2 decimal places and my sum output does not have decimal values. I don't understand whether it's a file problem or unix has some internal mechanism which reads files and displays values in different manner. PFA the temp file outputs of this file (Sum_mismatch.jpg).

I believe it's not a file problem, Now where is the problem in my script. How can I read and compare the sum with that of the value in the trailer record irrespective of original tariler record have decimals or whole numbers.

I have attached the actual test csv files which I have mentioned and temp files output of both files. Please help me out as I am in real help and I could not think of any other way of doing it. Please suggest if I have to change my design entirely to achieve my requirement, If yes please provide me the solution.

Thanks is advance!!!!

Code:
#!/usr/bin/sh
#
cd /var/datastage/FRPDEVL/work/source/landing/dspf
for fname in pf_*.csv;do
#Check for files existence in the corresponding directory and perform validation
if [ -f "$fname" ]
then
echo "Expected file(s) found, Performing Validations for file: "$fname
filename=`basename $fname`
fdate=`echo $filename|tr -dc '[:digit:]'`
echo $filename","$fdate

  #Validation 1: Sum of all the columns from D to O (numeric data type) respectively should be equal
  #to the value present in trailer row against the respective column.
  if [ $filename = 'pf_'$fdate'.csv' ]
  then
  echo "------------------------------------------------------------------------------------"
  echo "Checking Specific Validations 2 for File: $filename"
  echo "------------------------------------------------------------------------------------"
  #Trim Header and Trailer record and create temporary file temp1_$fdate.tmp
  sed '1d;$d' $filename >temp1_$fdate.tmp
  
  #Trim the trailer record only from original file and create another temporary file temp_original_$fdate.tmp
  #which will be used for comparison after finding sum from D to O column
  tail -1 $filename|cut -d "," -f 4->temp_original_$fdate.tmp
  
  #Perform sum from column D to O on temporary file temp1_$fdate.tmp and create another temporary file temp_sum_$fdate.tmp
  awk -F, -v OFS="," -v OFMT="%.2E" '{s1+=$4;s2+=$5;s3+=$6;s4+=$7;s5+=$8;s6+=$9;s7+=$10;s8+=$11;s9+=$12;s10+=$13;s11+=$14;s12+=$15}END{print s1,s2,s3,s4,s5,s6,s7,s8,s9,s10,s11,s12}' temp1_$fdate.tmp>temp_sum_$fdate.tmp
  #awk -F, -v OFS="," '{s1+=$4;s2+=$5;s3+=$6;s4+=$7;s5+=$8;s6+=$9;s7+=$10;s8+=$11;s9+=$12;s10+=$13;s11+=$14;s12+=$15}END{print s1,s2,s3,s4,s5,s6,s7,s8,s9,s10,s11,s12}' temp1_$fdate.tmp>temp_sum_$fdate.tmp
  
  #Now compare the sum that is present in trailer record of original file with that of the sum taken from Column D to O,
  #If both the values match in the two files, then the matching record will be printed and a count will be taken which will be 
  #always one. If both the data does not match then the count will be 0
  val=`awk 'NR==FNR{a[$0];next}$0 in a{print $0}' temp_original_$fdate.tmp temp_sum_$fdate.tmp|wc -l`
  
  #If $val is =0, which means the sum is not matching with Trailer record sum, hence kill the job
  if [ "$val" -eq "0" ]
  then
  echo "The sum of either or all columns is not matching with last row sum value of corresponding column. Hence exiting the Job"
  
  #If the validation fails remove all the temporary files before exiting from further processing
  #rm -f temp1_$fdate.tmp
  #rm -f temp_original_$fdate.tmp
  #rm -f temp_sum_$fdate.tmp
  
  #Exit with code 16, If the sums are not matching
  exit 16  
  else
  echo "Sums are matching"  
  fi  
  echo "------------------------------------------------------------------------------------"
  echo "Specific Validations check for File: $filename completed"
  echo "------------------------------------------------------------------------------------"  
  fi
  
  #Remove all temp files if all the validations pass
  #rm -f temp_$fdate.tmp
  #rm -f temp1_$fdate.tmp
  #rm -f temp_original_$fdate.tmp
  #rm -f temp_sum_$fdate.tmp

#If files are not there in landing directory, will not perform validations and exit with normal status  
else
	echo "Expected files not found, Hence not performing any validations"
	exit 0
fi

#End of main For loop
done

With Regards,
TPK
# 2  
Old 04-24-2015
I'm not sure I understand your problem correctly, but calculating column sums in a spread sheet (gnumeric) shows that in file pf_20150127.csv there's a discrepancy in column D, which this proposal
Code:
awk 'NR>1 {for (i=4; i<=NF; i++) SUM[i]+=$i} END {for (i=4; i<=NF; i++) {SUM[i]-=2*$i; if (SUM[i]) printf "Column %c: %.0f\n", i+64, SUM[i]}}' FS="," /tmp/pf_20150127.csv 
Column D: -71415152

shows as well. The other data file is calculated correctly, has no errors and hence doesn't give an output.
# 3  
Old 04-25-2015
Hi Rudic,

I don't get any issue with pf_20150127.csv. The sums are matching with that of trailer record.

When I execute the same shell for pf_20150325.csv it gives me wrongly. The sum which I am doing from column D through O does not output any decimal values, where as the trailer record where I am cutting from original file and creating an temp file is displaying decimal values and when i compare my sum temp file with this temp file both are not matching. I don't understand why unix is treating the earlier pf_20150127.csv file in one way and pf_20150325.csv file in another way by displaying decimal values.

I could not find the value you have quoted in your comment, from which file it is and how have you arrived at that value.

Please let me know where you didn't understand my problem, I would be happy to explain again. Please help me out.

Note: I hope you are calculating the sum of column D from D2 : D592 in pf_20150127.csv and D2 : D602 in pf_20150325.csv file. like wise we should calculate for E2:E592,F2:F592,G2:G592,H2:H592.......O2:O592 for pf_20150127.csv file and in same manner for pf_20150325.csv file

With Regards,
TPK

---------- Post updated at 10:50 PM ---------- Previous update was at 11:44 AM ----------

Hi All,

Any Updates? Please help me out.

With Regards,
TPK

Last edited by tpk; 04-24-2015 at 01:55 PM.. Reason: Correction
# 4  
Old 04-27-2015
Hi All/Experts,

Let me put my requirement in short and precise, PFB my requirement and issue where I am stuck,


1. In the attached zip file, I want to read the csv files and in the csv files, I want to exclude Header and Trailer record.
2. After excluding Header and Trailer, In column D I want to calculate sum of all rows present under column D and check whether the sum matches with the value present in Trailer record/last record under the same Column D.
3. After doing a match with Trailer record, If the sum matches with trailer record, then simply echo "All Correct" else If the sum does not match with trailer record value under the same column, in our case Column D, then echo "Sum does not match" and exit with return code 16.
4. The entire same process defined in point 1,2,3 needs to be followed for column E,F,G,H,I,J,K,L,M,N,O also.

I hope now the above is clear. For this I have written some code but I don't know unix is behaving differently for each csv file. I execute my code for pf_20150127.csv file sum matches correctly with that of trailer record in the file, When I execute the same code for pf_20150325.csv file sum does not match with trailer record.

For pf_20150325.csv file, Unix is reading the trailer record value with decimal values like XXXXXXXX.00, and my sum I am calculating doesn't have any decimal values.

For pf_20150127.csv file, Unix is reading the trailer record with out decimal values which is quite not understandable to me.

Finally, I am confused and wanted to ask whether this kind of validation mentioned above in points 1,2,3,4 are possible to do in Unix shell scripting or Is there any other way of doing it. Please help me the solution that of shell script only.

With Regards,
TPK

---------- Post updated 04-27-15 at 04:49 AM ---------- Previous update was 04-26-15 at 10:48 PM ----------

Hi All,

Any Updates? Please help me out.

With Regards,
TPK

Last edited by tpk; 04-27-2015 at 12:56 AM.. Reason: Correction
# 5  
Old 04-27-2015
Your comparison compares two whole lines as strings.
In order to cope with numbers in E notation, you need to compare them field by field - you need a for loop in awk.
Knowing there is only one line, and by setting RS="," each field becomes a record (line), so the automatic processing loop can be used.
Here is the section to be changed:
Code:
...
  val=`
awk '(NR==FNR){a[FNR]=$1+0;next} ($1+0!=a[FNR]){printf "col %.f: %.f != %.f\n",FNR,$1+0,a[FNR]}' RS=, temp_original_20150127.tmp temp_sum_20150127.tmp
`
  #If $val is not empty, the sum is not matching with Trailer record sum, hence kill the job
  if [ -n "$val" ]
  then
  echo "The sum of either or all columns is not matching with last row sum value of corresponding column. Hence exiting the Job"
  echo "$val"
...

Verbosity is increased, and the output is checked to be empty (if all check-sums match).
+0 was added to make sure awk treats the fields as numbers not strings.
# 6  
Old 04-27-2015
Linux

Hi MadeInGermany,

Thank You for the solution!!!!.

The solution is working for pf_20150127.csv file, I ran the same script for pf_20150325.csv file and it got failed with the below error

Code:
Expected file(s) found, Performing Validations for file: pf_20150325.csv
pf_20150325.csv,20150325
------------------------------------------------------------------------------------
Checking Specific Validations 2 for File: pf_20150325.csv
------------------------------------------------------------------------------------
The sum of either or all columns is not matching with last row sum value of corresponding column. Hence exiting the Job
Errors: col 1: 173000000000 != 172928624441

So, I checked the temp files and found that the temp file temp_original_20150325.tmp where I am cutting the trailer record initially from original csv file, it is being read as below

Code:
cat temp_original_20150325.tmp

172928624441.00,334431290.00,346417133.00,354231936.00,443777494.00,526288959.00,769941370.00,918420217.00,1274200675.00,1067695005.00,1122762029.00,1181290201.00

And when I did a cat on the sum temp file temp_sum_20150325.tmp, it is calculated as below

Code:
cat temp_sum_20150325.tmp

1.73E+11,334431290,346417133,354231936,443777494,526288959,769941370,918420217,1274200675,1067695005,1122762029,1181290201

I checked in the csv file by opening it in excel and the value in the trailer record for column D is 1.72929E+11 and when I summed the rows under column D excluding Header and Trailer in the excel it turned out to be also same as that of trailer record which is 1.72929E+11. I don't understand why unix is reading the trailer record differently from the original file.

So, as there is difference in temp_original file and temp_sum file it's being failed. I don't understand why the original temp file is storing the values with XXXXXXXXXX.00. How can we make the code generic so that what ever value is present in the trailer record irrespective of e or E notation, my sum should be calculated accordingly. Please help me out.

With Regards,
TPK

Last edited by tpk; 04-27-2015 at 11:13 AM.. Reason: Correction
# 7  
Old 04-27-2015
The 1.73E+11 = 173,000,000,000 in the last line of the file appears to be a rounding error.
The 1.72929E+11 = 172,929,000,000 is better but still not the exact 172,928,584,848.
So I guess that Unix awk, that uses double-precision, is more precise than the other tool (Excel?).
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Sum the values in the column using date column

I have a file which need to be summed up using date column. I/P: 2017/01/01 a 10 2017/01/01 b 20 2017/01/01 c 40 2017/01/01 a 60 2017/01/01 b 50 2017/01/01 c 40 2017/01/01 a 20 2017/01/01 b 30 2017/01/01 c 40 2017/02/01 a 10 2017/02/01 b 20 2017/02/01 c 30 2017/02/01 a 10... (6 Replies)
Discussion started by: Booo
6 Replies

2. Shell Programming and Scripting

Help with calculate the total sum of record in column one

Input file: 101M 10M10D20M1I70M 10M10D39M4I48M 10M10D91M 10M10I13M2I7M1I58M 10M10I15M1D66M Output file: 101M 101 0 0 10M10D20M1I70M 100 1 10 10M10D39M4I48M 97 4 10 10M10D91M 101 0 10 10M10I13M2I7M1I58M 88 13 0 10M10I15M1D66M 91 10 1 I'm interested to count how many total of... (6 Replies)
Discussion started by: perl_beginner
6 Replies

3. Shell Programming and Scripting

Sum column values matching other field

this is part of a KT i am going thru. i am writing a script in bash shell, linux where i have 2 columns where 1st signifies the nth hour like 00, 01, 02...23 and 2nd the file size. sample data attached. Desired output is 3 columns which will give the nth hour, number of entries in nth hour and... (3 Replies)
Discussion started by: alpha_1
3 Replies

4. Shell Programming and Scripting

Sum if line match with first column

Hi, i have log like below: A 2 5 B 4 1 C 6 8 B 0 1 C 1 0 B 2 3 A 0 0 i want to make result if match with A then sum from column 2 and 3 so the results: A 2 5 (5 Replies)
Discussion started by: justbow
5 Replies

5. Shell Programming and Scripting

Sum column values based in common identifier in 1st column.

Hi, I have a table to be imported for R as matrix or data.frame but I first need to edit it because I've got several lines with the same identifier (1st column), so I want to sum the each column (2nd -nth) of each identifier (1st column) The input is for example, after sorted: K00001 1 1 4 3... (8 Replies)
Discussion started by: sargotrons
8 Replies

6. Shell Programming and Scripting

awk Print New Column For Every Two Lines and Match On Multiple Column Values to print another column

Hi, My input files is like this axis1 0 1 10 axis2 0 1 5 axis1 1 2 -4 axis2 2 3 -3 axis1 3 4 5 axis2 3 4 -1 axis1 4 5 -6 axis2 4 5 1 Now, these are my following tasks 1. Print a first column for every two rows that has the same value followed by a string. 2. Match on the... (3 Replies)
Discussion started by: jacobs.smith
3 Replies

7. Shell Programming and Scripting

Sum up the column values group by using some field

12-11-2012,PNL,158406 12-11-2012,RISK,4564 12-11-2012,VAR_1D,310101 12-11-2012,VAR_10D,310101 12-11-2012,CB,866 12-11-2012,STR_VAR_1D,298494 12-11-2012,STR_VAR_10D,309623 09-11-2012,PNL,1024106 09-11-2012,RISK,4565 09-11-2012,VAR_1D,317211 09-11-2012,VAR_10D,317211 09-11-2012,CB,985... (7 Replies)
Discussion started by: manas_ranjan
7 Replies

8. Shell Programming and Scripting

Getting a sum of column values

I have a file in the following layout: 201008005946873001846130058030701006131840000000000000000000 201008006784994001154259058033001009527844000000000000000000 201008007323067002418095058034801002418095000000000000000000 201008007697126001722141058029101002214158000000000000000000... (2 Replies)
Discussion started by: jclanc8
2 Replies

9. Shell Programming and Scripting

print unique values of a column and sum up the corresponding values in next column

Hi All, I have a file which is having 3 columns as (string string integer) a b 1 x y 2 p k 5 y y 4 ..... ..... Question: I want get the unique value of column 2 in a sorted way(on column 2) and the sum of the 3rd column of the corresponding rows. e.g the above file should return the... (6 Replies)
Discussion started by: amigarus
6 Replies

10. Shell Programming and Scripting

How to sum column 1 values

I have a file file like this. I want to sum all column 1 values. input A 2 A 3 A 4 B 4 B 2 Out put A 9 B 6 (3 Replies)
Discussion started by: suresh3566
3 Replies
Login or Register to Ask a Question