Match sum of values in each column with the corresponding column value present in trailer record

04-24-2015

Registered User

13, 0

Join Date: Apr 2015

Last Activity: 22 June 2015, 12:27 PM EDT

Posts: 13

Thanks Given: 3

Thanked 0 Times in 0 Posts

Match sum of values in each column with the corresponding column value present in trailer record

Hi All,

I have a requirement where I need to find sum of values from column D through O present in a CSV file and check whether the sum of each Individual column matches with the value present for that corresponding column present in the trailer record.

For example, let's assume for column D excluding Header and Trailer in the csv find the sum of all data records in column D and check whether that sum is equal to value present in column D in the trailer record. This same process needs to be done for all the columns from D through O.

For this I have developed one shell script which does the same (I know you experts can do it in better way instead of creating so many temp files. But as I am very new to shell scripting I have just applied my thought in my way).

This shell is behaving differently for each file, For file pf_20150127.csv it is working perfectly because the temp files which I am comparing are giving the same results, PFA the snapshot of values match (Sum_Match.jpb) in temp files.

If I execute the same script for file pf_20150325.csv, The counts does not match. The trailer record value in the original file now is being displayed with 2 decimal places and my sum output does not have decimal values. I don't understand whether it's a file problem or unix has some internal mechanism which reads files and displays values in different manner. PFA the temp file outputs of this file (Sum_mismatch.jpg).

I believe it's not a file problem, Now where is the problem in my script. How can I read and compare the sum with that of the value in the trailer record irrespective of original tariler record have decimals or whole numbers.

I have attached the actual test csv files which I have mentioned and temp files output of both files. Please help me out as I am in real help and I could not think of any other way of doing it. Please suggest if I have to change my design entirely to achieve my requirement, If yes please provide me the solution.

Thanks is advance!!!!

Code:

#!/usr/bin/sh
#
cd /var/datastage/FRPDEVL/work/source/landing/dspf
for fname in pf_*.csv;do
#Check for files existence in the corresponding directory and perform validation
if [ -f "$fname" ]
then
echo "Expected file(s) found, Performing Validations for file: "$fname
filename=`basename $fname`
fdate=`echo $filename|tr -dc '[:digit:]'`
echo $filename","$fdate

  #Validation 1: Sum of all the columns from D to O (numeric data type) respectively should be equal
  #to the value present in trailer row against the respective column.
  if [ $filename = 'pf_'$fdate'.csv' ]
  then
  echo "------------------------------------------------------------------------------------"
  echo "Checking Specific Validations 2 for File: $filename"
  echo "------------------------------------------------------------------------------------"
  #Trim Header and Trailer record and create temporary file temp1_$fdate.tmp
  sed '1d;$d' $filename >temp1_$fdate.tmp
  
  #Trim the trailer record only from original file and create another temporary file temp_original_$fdate.tmp
  #which will be used for comparison after finding sum from D to O column
  tail -1 $filename|cut -d "," -f 4->temp_original_$fdate.tmp
  
  #Perform sum from column D to O on temporary file temp1_$fdate.tmp and create another temporary file temp_sum_$fdate.tmp
  awk -F, -v OFS="," -v OFMT="%.2E" '{s1+=$4;s2+=$5;s3+=$6;s4+=$7;s5+=$8;s6+=$9;s7+=$10;s8+=$11;s9+=$12;s10+=$13;s11+=$14;s12+=$15}END{print s1,s2,s3,s4,s5,s6,s7,s8,s9,s10,s11,s12}' temp1_$fdate.tmp>temp_sum_$fdate.tmp
  #awk -F, -v OFS="," '{s1+=$4;s2+=$5;s3+=$6;s4+=$7;s5+=$8;s6+=$9;s7+=$10;s8+=$11;s9+=$12;s10+=$13;s11+=$14;s12+=$15}END{print s1,s2,s3,s4,s5,s6,s7,s8,s9,s10,s11,s12}' temp1_$fdate.tmp>temp_sum_$fdate.tmp
  
  #Now compare the sum that is present in trailer record of original file with that of the sum taken from Column D to O,
  #If both the values match in the two files, then the matching record will be printed and a count will be taken which will be 
  #always one. If both the data does not match then the count will be 0
  val=`awk 'NR==FNR{a[$0];next}$0 in a{print $0}' temp_original_$fdate.tmp temp_sum_$fdate.tmp|wc -l`
  
  #If $val is =0, which means the sum is not matching with Trailer record sum, hence kill the job
  if [ "$val" -eq "0" ]
  then
  echo "The sum of either or all columns is not matching with last row sum value of corresponding column. Hence exiting the Job"
  
  #If the validation fails remove all the temporary files before exiting from further processing
  #rm -f temp1_$fdate.tmp
  #rm -f temp_original_$fdate.tmp
  #rm -f temp_sum_$fdate.tmp
  
  #Exit with code 16, If the sums are not matching
  exit 16  
  else
  echo "Sums are matching"  
  fi  
  echo "------------------------------------------------------------------------------------"
  echo "Specific Validations check for File: $filename completed"
  echo "------------------------------------------------------------------------------------"  
  fi
  
  #Remove all temp files if all the validations pass
  #rm -f temp_$fdate.tmp
  #rm -f temp1_$fdate.tmp
  #rm -f temp_original_$fdate.tmp
  #rm -f temp_sum_$fdate.tmp

#If files are not there in landing directory, will not perform validations and exit with normal status  
else
	echo "Expected files not found, Hence not performing any validations"
	exit 0
fi

#End of main For loop
done

With Regards,
TPK

Files.zip (59.4 KB)

tpk

View Public Profile for tpk

Find all posts by tpk

04-24-2015

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

I'm not sure I understand your problem correctly, but calculating column sums in a spread sheet (gnumeric) shows that in file pf_20150127.csv there's a discrepancy in column D, which this proposal

Code:

awk 'NR>1 {for (i=4; i<=NF; i++) SUM[i]+=$i} END {for (i=4; i<=NF; i++) {SUM[i]-=2*$i; if (SUM[i]) printf "Column %c: %.0f\n", i+64, SUM[i]}}' FS="," /tmp/pf_20150127.csv 
Column D: -71415152

shows as well. The other data file is calculated correctly, has no errors and hence doesn't give an output.

RudiC

View Public Profile for RudiC

Find all posts by RudiC

04-25-2015

Registered User

13, 0

Join Date: Apr 2015

Last Activity: 22 June 2015, 12:27 PM EDT

Posts: 13

Thanks Given: 3

Thanked 0 Times in 0 Posts

Hi Rudic,

I don't get any issue with pf_20150127.csv. The sums are matching with that of trailer record.

When I execute the same shell for pf_20150325.csv it gives me wrongly. The sum which I am doing from column D through O does not output any decimal values, where as the trailer record where I am cutting from original file and creating an temp file is displaying decimal values and when i compare my sum temp file with this temp file both are not matching. I don't understand why unix is treating the earlier pf_20150127.csv file in one way and pf_20150325.csv file in another way by displaying decimal values.

I could not find the value you have quoted in your comment, from which file it is and how have you arrived at that value.

Please let me know where you didn't understand my problem, I would be happy to explain again. Please help me out.

Note: I hope you are calculating the sum of column D from D2 : D592 in pf_20150127.csv and D2 : D602 in pf_20150325.csv file. like wise we should calculate for E2:E592,F2:F592,G2:G592,H2:H592.......O2:O592 for pf_20150127.csv file and in same manner for pf_20150325.csv file

With Regards,
TPK

---------- Post updated at 10:50 PM ---------- Previous update was at 11:44 AM ----------

Hi All,

Any Updates? Please help me out.

With Regards,
TPK

Last edited by tpk; 04-24-2015 at 01:55 PM.. Reason: Correction

tpk

View Public Profile for tpk

Find all posts by tpk

04-27-2015

Registered User

13, 0

Join Date: Apr 2015

Last Activity: 22 June 2015, 12:27 PM EDT

Posts: 13

Thanks Given: 3

Thanked 0 Times in 0 Posts

Hi All/Experts,

Let me put my requirement in short and precise, PFB my requirement and issue where I am stuck,

1. In the attached zip file, I want to read the csv files and in the csv files, I want to exclude Header and Trailer record.
2. After excluding Header and Trailer, In column D I want to calculate sum of all rows present under column D and check whether the sum matches with the value present in Trailer record/last record under the same Column D.
3. After doing a match with Trailer record, If the sum matches with trailer record, then simply echo "All Correct" else If the sum does not match with trailer record value under the same column, in our case Column D, then echo "Sum does not match" and exit with return code 16.
4. The entire same process defined in point 1,2,3 needs to be followed for column E,F,G,H,I,J,K,L,M,N,O also.

I hope now the above is clear. For this I have written some code but I don't know unix is behaving differently for each csv file. I execute my code for pf_20150127.csv file sum matches correctly with that of trailer record in the file, When I execute the same code for pf_20150325.csv file sum does not match with trailer record.

For pf_20150325.csv file, Unix is reading the trailer record value with decimal values like XXXXXXXX.00, and my sum I am calculating doesn't have any decimal values.

For pf_20150127.csv file, Unix is reading the trailer record with out decimal values which is quite not understandable to me.

Finally, I am confused and wanted to ask whether this kind of validation mentioned above in points 1,2,3,4 are possible to do in Unix shell scripting or Is there any other way of doing it. Please help me the solution that of shell script only.

With Regards,
TPK

---------- Post updated 04-27-15 at 04:49 AM ---------- Previous update was 04-26-15 at 10:48 PM ----------

Hi All,

Any Updates? Please help me out.

With Regards,
TPK

Last edited by tpk; 04-27-2015 at 12:56 AM.. Reason: Correction

tpk

View Public Profile for tpk

Find all posts by tpk

04-27-2015

Registered User

5,091, 1,931

Join Date: May 2012

Last Activity: 15 July 2020, 4:46 AM EDT

Location: Simplicity

Posts: 5,091

Thanks Given: 565

Thanked 1,931 Times in 1,668 Posts

Your comparison compares two whole lines as strings.
In order to cope with numbers in E notation, you need to compare them field by field - you need a for loop in awk.
Knowing there is only one line, and by setting RS="," each field becomes a record (line), so the automatic processing loop can be used.
Here is the section to be changed:

Code:

...
  val=`
awk '(NR==FNR){a[FNR]=$1+0;next} ($1+0!=a[FNR]){printf "col %.f: %.f != %.f\n",FNR,$1+0,a[FNR]}' RS=, temp_original_20150127.tmp temp_sum_20150127.tmp
`
  #If $val is not empty, the sum is not matching with Trailer record sum, hence kill the job
  if [ -n "$val" ]
  then
  echo "The sum of either or all columns is not matching with last row sum value of corresponding column. Hence exiting the Job"
  echo "$val"
...

Verbosity is increased, and the output is checked to be empty (if all check-sums match).
+0 was added to make sure awk treats the fields as numbers not strings.

MadeInGermany

View Public Profile for MadeInGermany

Find all posts by MadeInGermany

04-27-2015

Registered User

13, 0

Join Date: Apr 2015

Last Activity: 22 June 2015, 12:27 PM EDT

Posts: 13

Thanks Given: 3

Thanked 0 Times in 0 Posts

Hi MadeInGermany,

Thank You for the solution!!!!.

The solution is working for pf_20150127.csv file, I ran the same script for pf_20150325.csv file and it got failed with the below error

Code:

Expected file(s) found, Performing Validations for file: pf_20150325.csv
pf_20150325.csv,20150325
------------------------------------------------------------------------------------
Checking Specific Validations 2 for File: pf_20150325.csv
------------------------------------------------------------------------------------
The sum of either or all columns is not matching with last row sum value of corresponding column. Hence exiting the Job
Errors: col 1: 173000000000 != 172928624441

So, I checked the temp files and found that the temp file temp_original_20150325.tmp where I am cutting the trailer record initially from original csv file, it is being read as below

Code:

cat temp_original_20150325.tmp

172928624441.00,334431290.00,346417133.00,354231936.00,443777494.00,526288959.00,769941370.00,918420217.00,1274200675.00,1067695005.00,1122762029.00,1181290201.00

And when I did a cat on the sum temp file temp_sum_20150325.tmp, it is calculated as below

Code:

cat temp_sum_20150325.tmp

1.73E+11,334431290,346417133,354231936,443777494,526288959,769941370,918420217,1274200675,1067695005,1122762029,1181290201

I checked in the csv file by opening it in excel and the value in the trailer record for column D is 1.72929E+11 and when I summed the rows under column D excluding Header and Trailer in the excel it turned out to be also same as that of trailer record which is 1.72929E+11. I don't understand why unix is reading the trailer record differently from the original file.

So, as there is difference in temp_original file and temp_sum file it's being failed. I don't understand why the original temp file is storing the values with XXXXXXXXXX.00. How can we make the code generic so that what ever value is present in the trailer record irrespective of e or E notation, my sum should be calculated accordingly. Please help me out.

With Regards,
TPK

Last edited by tpk; 04-27-2015 at 11:13 AM.. Reason: Correction

tpk

View Public Profile for tpk

Find all posts by tpk

04-27-2015

Registered User

5,091, 1,931

Join Date: May 2012

Last Activity: 15 July 2020, 4:46 AM EDT

Location: Simplicity

Posts: 5,091

Thanks Given: 565

Thanked 1,931 Times in 1,668 Posts

The 1.73E+11 = 173,000,000,000 in the last line of the file appears to be a rounding error.
The 1.72929E+11 = 172,929,000,000 is better but still not the exact 172,928,584,848.
So I guess that Unix awk, that uses double-precision, is more precise than the other tool (Excel?).

MadeInGermany

View Public Profile for MadeInGermany

Find all posts by MadeInGermany

UNIX for Dummies Questions & Answers

Match sum of values in each column with the corresponding column value present in trailer record

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Sum the values in the column using date column

Discussion started by: Booo

2. Shell Programming and Scripting

Help with calculate the total sum of record in column one

Discussion started by: perl_beginner

3. Shell Programming and Scripting

Sum column values matching other field

Discussion started by: alpha_1

4. Shell Programming and Scripting

Sum if line match with first column

Discussion started by: justbow

5. Shell Programming and Scripting

Sum column values based in common identifier in 1st column.

Discussion started by: sargotrons

6. Shell Programming and Scripting

awk Print New Column For Every Two Lines and Match On Multiple Column Values to print another column

Discussion started by: jacobs.smith

7. Shell Programming and Scripting

Sum up the column values group by using some field

Discussion started by: manas_ranjan

8. Shell Programming and Scripting

Getting a sum of column values

Discussion started by: jclanc8

9. Shell Programming and Scripting

print unique values of a column and sum up the corresponding values in next column

Discussion started by: amigarus

10. Shell Programming and Scripting

How to sum column 1 values

Discussion started by: suresh3566