Match sum of values in each column with the corresponding column value present in trailer record
Hi All,
I have a requirement where I need to find sum of values from column D through O present in a CSV file and check whether the sum of each Individual column matches with the value present for that corresponding column present in the trailer record.
For example, let's assume for column D excluding Header and Trailer in the csv find the sum of all data records in column D and check whether that sum is equal to value present in column D in the trailer record. This same process needs to be done for all the columns from D through O.
For this I have developed one shell script which does the same (I know you experts can do it in better way instead of creating so many temp files. But as I am very new to shell scripting I have just applied my thought in my way).
This shell is behaving differently for each file, For file pf_20150127.csv it is working perfectly because the temp files which I am comparing are giving the same results, PFA the snapshot of values match (Sum_Match.jpb) in temp files.
If I execute the same script for file pf_20150325.csv, The counts does not match. The trailer record value in the original file now is being displayed with 2 decimal places and my sum output does not have decimal values. I don't understand whether it's a file problem or unix has some internal mechanism which reads files and displays values in different manner. PFA the temp file outputs of this file (Sum_mismatch.jpg).
I believe it's not a file problem, Now where is the problem in my script. How can I read and compare the sum with that of the value in the trailer record irrespective of original tariler record have decimals or whole numbers.
I have attached the actual test csv files which I have mentioned and temp files output of both files. Please help me out as I am in real help and I could not think of any other way of doing it. Please suggest if I have to change my design entirely to achieve my requirement, If yes please provide me the solution.
I'm not sure I understand your problem correctly, but calculating column sums in a spread sheet (gnumeric) shows that in file pf_20150127.csv there's a discrepancy in column D, which this proposal
I don't get any issue with pf_20150127.csv. The sums are matching with that of trailer record.
When I execute the same shell for pf_20150325.csv it gives me wrongly. The sum which I am doing from column D through O does not output any decimal values, where as the trailer record where I am cutting from original file and creating an temp file is displaying decimal values and when i compare my sum temp file with this temp file both are not matching. I don't understand why unix is treating the earlier pf_20150127.csv file in one way and pf_20150325.csv file in another way by displaying decimal values.
I could not find the value you have quoted in your comment, from which file it is and how have you arrived at that value.
Please let me know where you didn't understand my problem, I would be happy to explain again. Please help me out.
Note: I hope you are calculating the sum of column D from D2 : D592 in pf_20150127.csv and D2 : D602 in pf_20150325.csv file. like wise we should calculate for E2:E592,F2:F592,G2:G592,H2:H592.......O2:O592 for pf_20150127.csv file and in same manner for pf_20150325.csv file
With Regards,
TPK
---------- Post updated at 10:50 PM ---------- Previous update was at 11:44 AM ----------
Hi All,
Any Updates? Please help me out.
With Regards,
TPK
Last edited by tpk; 04-24-2015 at 01:55 PM..
Reason: Correction
Let me put my requirement in short and precise, PFB my requirement and issue where I am stuck,
1. In the attached zip file, I want to read the csv files and in the csv files, I want to exclude Header and Trailer record.
2. After excluding Header and Trailer, In column D I want to calculate sum of all rows present under column D and check whether the sum matches with the value present in Trailer record/last record under the same Column D.
3. After doing a match with Trailer record, If the sum matches with trailer record, then simply echo "All Correct" else If the sum does not match with trailer record value under the same column, in our case Column D, then echo "Sum does not match" and exit with return code 16.
4. The entire same process defined in point 1,2,3 needs to be followed for column E,F,G,H,I,J,K,L,M,N,O also.
I hope now the above is clear. For this I have written some code but I don't know unix is behaving differently for each csv file. I execute my code for pf_20150127.csv file sum matches correctly with that of trailer record in the file, When I execute the same code for pf_20150325.csv file sum does not match with trailer record.
For pf_20150325.csv file, Unix is reading the trailer record value with decimal values like XXXXXXXX.00, and my sum I am calculating doesn't have any decimal values.
For pf_20150127.csv file, Unix is reading the trailer record with out decimal values which is quite not understandable to me.
Finally, I am confused and wanted to ask whether this kind of validation mentioned above in points 1,2,3,4 are possible to do in Unix shell scripting or Is there any other way of doing it. Please help me the solution that of shell script only.
With Regards,
TPK
---------- Post updated 04-27-15 at 04:49 AM ---------- Previous update was 04-26-15 at 10:48 PM ----------
Hi All,
Any Updates? Please help me out.
With Regards,
TPK
Last edited by tpk; 04-27-2015 at 12:56 AM..
Reason: Correction
Your comparison compares two whole lines as strings.
In order to cope with numbers in E notation, you need to compare them field by field - you need a for loop in awk.
Knowing there is only one line, and by setting RS="," each field becomes a record (line), so the automatic processing loop can be used.
Here is the section to be changed:
Code:
...
val=`
awk '(NR==FNR){a[FNR]=$1+0;next} ($1+0!=a[FNR]){printf "col %.f: %.f != %.f\n",FNR,$1+0,a[FNR]}' RS=, temp_original_20150127.tmp temp_sum_20150127.tmp
`
#If $val is not empty, the sum is not matching with Trailer record sum, hence kill the job
if [ -n "$val" ]
then
echo "The sum of either or all columns is not matching with last row sum value of corresponding column. Hence exiting the Job"
echo "$val"
...
Verbosity is increased, and the output is checked to be empty (if all check-sums match).
+0 was added to make sure awk treats the fields as numbers not strings.
The solution is working for pf_20150127.csv file, I ran the same script for pf_20150325.csv file and it got failed with the below error
Code:
Expected file(s) found, Performing Validations for file: pf_20150325.csv
pf_20150325.csv,20150325
------------------------------------------------------------------------------------
Checking Specific Validations 2 for File: pf_20150325.csv
------------------------------------------------------------------------------------
The sum of either or all columns is not matching with last row sum value of corresponding column. Hence exiting the Job
Errors: col 1: 173000000000 != 172928624441
So, I checked the temp files and found that the temp file temp_original_20150325.tmp where I am cutting the trailer record initially from original csv file, it is being read as below
I checked in the csv file by opening it in excel and the value in the trailer record for column D is 1.72929E+11 and when I summed the rows under column D excluding Header and Trailer in the excel it turned out to be also same as that of trailer record which is 1.72929E+11. I don't understand why unix is reading the trailer record differently from the original file.
So, as there is difference in temp_original file and temp_sum file it's being failed. I don't understand why the original temp file is storing the values with XXXXXXXXXX.00. How can we make the code generic so that what ever value is present in the trailer record irrespective of e or E notation, my sum should be calculated accordingly. Please help me out.
With Regards,
TPK
Last edited by tpk; 04-27-2015 at 11:13 AM..
Reason: Correction
The 1.73E+11 = 173,000,000,000 in the last line of the file appears to be a rounding error.
The 1.72929E+11 = 172,929,000,000 is better but still not the exact 172,928,584,848.
So I guess that Unix awk, that uses double-precision, is more precise than the other tool (Excel?).
I have a file which need to be summed up using date column.
I/P:
2017/01/01 a 10
2017/01/01 b 20
2017/01/01 c 40
2017/01/01 a 60
2017/01/01 b 50
2017/01/01 c 40
2017/01/01 a 20
2017/01/01 b 30
2017/01/01 c 40
2017/02/01 a 10
2017/02/01 b 20
2017/02/01 c 30
2017/02/01 a 10... (6 Replies)
this is part of a KT i am going thru.
i am writing a script in bash shell, linux where i have 2 columns where 1st signifies the nth hour like 00, 01, 02...23 and 2nd the file size.
sample data attached.
Desired output is 3 columns which will give the nth hour, number of entries in nth hour and... (3 Replies)
Hi,
i have log like below:
A 2 5
B 4 1
C 6 8
B 0 1
C 1 0
B 2 3
A 0 0
i want to make result if match with A then sum from column 2 and 3
so the results:
A 2 5 (5 Replies)
Hi,
I have a table to be imported for R as matrix or data.frame but I first need to edit it because I've got several lines with the same identifier (1st column), so I want to sum the each column (2nd -nth) of each identifier (1st column)
The input is for example, after sorted:
K00001 1 1 4 3... (8 Replies)
Hi,
My input files is like this
axis1 0 1 10
axis2 0 1 5
axis1 1 2 -4
axis2 2 3 -3
axis1 3 4 5
axis2 3 4 -1
axis1 4 5 -6
axis2 4 5 1
Now, these are my following tasks
1. Print a first column for every two rows that has the same value followed by a string.
2. Match on the... (3 Replies)
I have a file in the following layout:
201008005946873001846130058030701006131840000000000000000000
201008006784994001154259058033001009527844000000000000000000
201008007323067002418095058034801002418095000000000000000000
201008007697126001722141058029101002214158000000000000000000... (2 Replies)
Hi All,
I have a file which is having 3 columns as (string string integer)
a b 1
x y 2
p k 5
y y 4
.....
.....
Question:
I want get the unique value of column 2 in a sorted way(on column 2) and the sum of the 3rd column of the corresponding rows. e.g the above file should return the... (6 Replies)