Need help regarding comparison between two files through UNIX script


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Need help regarding comparison between two files through UNIX script
# 1  
Old 06-28-2015
Need help regarding comparison between two files through UNIX script

Hi All ,
I am aware of unix command ,but not comforable in putting together in script level.I came to situation where I need to compare between two .txt files fieldwise and need a mismatch report. As I am new to unix script arena ,if anyone can help in the below scenario that will be really helpful.

We have one source mainframe .txt file (readble pipe dililimited format) and also have one target hdfs .txt file (pipe dilimited format).I need to compare two files field by field and not by whole line.Need to compare like
f1.first field with f2.first field
f1.second field with f2.second field
and so on .Please find below sample source & target file.

f1 :
Code:
000001|     1|AQWWW|234,456.00  |     | 123456|     |41|abC| 0|xyZ|     |99900|0.00999|      |1|c|6|S|
000002|     2|11  4|1,234,456.99|     |      0|     |23|   |99|!  |     |00000|0.00000|      |2| |4|#|
000003|     3|!!@#$|0,000,001.10|     |      9|     | 0|XSW|12|  7|     |00100|0.00001|      |0|S|0| |
000004|     4|     |            |     |   3400|     | 2|r7!|72|xY1|     |01200|0.00045|      |9|W|3|2|

f2 :
Code:
000001|000001|AQWWW|234,456.00  |     |123456|     |41|abC|00|xyZ|     |99900|0.00999|      |1|c|6|S
000002|000002|11  4|1,234,456.99|     |0|     |23|   |99|!  |     |0|0|      |2| |4|#
000003|000003|!!@#$|0,000,001.10|     |9|     |00|XSW|12|  7|     |100|0.00001|      |0|S|0| 
000004|000004|     |            |     |3400|     |02|r7!|72|xY1|     |1200|0.00045|      |9|W|3|2

Once the comparison between two files are complete fieldwise ,we need a mismatch report which will contain source/target count validation ,field level src & target mismatches and their corresponding mismatch details .

It will be helpful if mismatch report like below :
Code:
<<Mismatch Report >>

Src & tgt Count Validation
Source Count :4
Target Count :4

Field lebel Src & Tgt Mismatches
No of Mismatches :0

Mismatch Details :
Key Column Value	Column Name / Index	Source Data	Target Data

Source data might contain leading spaces/zero ,precision and target data might not have those.We can ignore these cases in the mismatch report.
If anyone can help me in the above scenario that will really beneficial for me.Thanks !
# 2  
Old 06-28-2015
This will find the mismatches:
Code:
awk -F\| '
                {for (i=1; i<=NF; i++) TMP[i]=$i
                 getline <FN
                 for (i=1; i<=NF; i++) {if (TMP[i] != $i) print "Mismatch in line", NR, ", field", i, ":", TMP[i], $i} }
' FN="file1" file2
Mismatch in line 3 , field 3 : !!&#$ !!@#$
Mismatch in line 4 , field 1 : 05 000004

with some mismatches added, as your two samples given obviously don't deviate except for the no. of leading zeroes in some numbers. The mismatch report that you want needs some further info:
- count is lines (records) or fields?
- What is "field lebel Src & Tgt Mismatches"?
- what do you want to show up in the details section?

Last edited by RudiC; 06-28-2015 at 12:54 PM..
# 3  
Old 06-30-2015
Hi RudiC/All ,
Please find below my comments :
count is lines (records) or fields? -I mean to say the no total records count in each file .
What is "field lebel Src & Tgt Mismatches"? -Total mismatch count between source and target file .
what do you want to show up in the details section? -Details section I need the mismatch details like rowno/Key Column Value,Column Name / Index ,their corresponding source & target value .

Please find below two sample source & target file :

Source file :
Code:
000101|     1|AQWWW|234,456.00  |     | 123456|     |41|abC| 0|xyZ|     |99900|0.00999|      |1|c|6|S|
000102|     2|11  4|1,234,456.99|     |      0|     |23|   |99|!  |     |00000|0.00000|      |2| |4|#|
000103|     3|!!@#$|0,000,001.10|     |      9|     | 0|XSW|12|  7|     |00100|0.00001|      |0|S|0| |
000104|     4|     |            |     |   3400|     | 2|r7!|72|xY1|     |01200|0.00045|      |9|W|3|2|

Target file :
Code:
000101|000001|AQWWW|234,456.00  |     |123455|     |41|abC|00|xyZ|     |99900|0.00999|      |1|c|6|S
000102|000002|11  4|1,234,456.99|     |0|     |24|   |99|!  |     |0|0|      |2| |4|#
000103|000004|!!@#$|0,000,001.10|     |9|     |00|XSW|12|  7|     |100|0.00001|      |0|S|0| 
000104|000004|     |            |     |3401|     |02|r7!|72|xY1|     |1200|0.00045|      |9|W|3|2

Once the comparison done between two files through script ,in the analysis report ,I need the details like below based on the above two sample source/target file .

Code:
<<Analysis Report >>

Src & tgt Count Validation
Source Count :4
Target Count :4

Field lebel Src & Tgt Mismatches
No of Mismatches :4

Mismatch Details :
Key Column Value  Column Name    Source Data     Target Data 
000101 		   6  			     123456          123455 
000102 		   8 			      23 		      24 
000103 		   2      		      3 		000004 
000104 		   6    		     3400 		3401

Representation can be different ,but we need the above details in the analysis report.
Also We can run the script as local directory as parameter1 ,source file as parameter2 ,target file as parameter 3.We can sort the both files based on 1st column ,and 1st column value shd be unique.

Code:
#sh  test.sh  <Processing_directory> <sourcefile> <Targetfile>

If anyone can help me in the above scenario ,that will be really benefical for me.Thanks !

---------- Post updated 06-30-15 at 01:06 AM ---------- Previous update was 06-29-15 at 07:24 AM ----------

Hi All ,

I am looking forward to a solution to the above scenario.If anyone can help me in this regard ,that will be really helpful for me.Thanks !

Last edited by STCET22; 06-30-2015 at 03:08 AM.. Reason: Add missing CODE tags.
# 4  
Old 06-30-2015
You could use :-

Code:
#!/bin/bash

echo "<<Analysis Report >>"
echo ""

echo "Src & tgt Count Validation"
echo "Source Count : `wc -l < file1`"
echo "Target Count : `wc -l < file2`"

awk -F\| 'BEGIN{ print "Mismatch in line", "Column Name", "Source Data", "Target Data"}
                {for (i=1; i<=NF; i++) arr[i]=$i
                 getline <FN
                 for (i=1; i<=NF; i++)
                                 {if (arr[i] != $i)
                                 print $1,i,arr[i], $i} }
' FN="file1" file2 > output.temp

echo ""
echo "Field lebel Src & Tgt Mismatches"
echo "No of Mismatches : `expr $(wc -l < output.temp) - 1`"
echo ""
echo "Mismatch Details:"
cat output.temp
rm output.temp
exit

This one gives you result in the format that you would like to use. It's a extended concept of Rudic that you can modify as per your need.

HTH
This User Gave Thanks to Mannu2525 For This Post:
# 5  
Old 06-30-2015
Why not cast everything into awk? Try
Code:
awk -F\| '
BEGIN           {print "<<Analysis Report >>\n\nSrc & tgt Count Validation"}

                {for (i=1; i<=NF; i++) TMP[i]=$i
                 if (getline < FN == 1) {SCNT++
                                         for (i=1; i<=NF; i++)
                                                {if (TMP[i] != $i) MMArr[++MMCNT]= $1 "\t" i "\t" TMP[i] "\t" $i}
                                        }
                }
END             {while (getline < FN == 1) SCNT++
                 printf "Source Count : %d\nTarget Count : %d\n\nField lebel Src & Tgt Mismatches\nNo of Mismatches : %d\n\n", SCNT, NR, MMCNT
                 printf "Mismatch Details :\nKey Column Value\tColumn Name\tSource Data\tTarget Data\n"
                 for (i=1; i<=MMCNT; i++) print MMArr[i]
                }
' FN="file1" file2
<<Analysis Report >>

Src & tgt Count Validation
Source Count : 8
Target Count : 4

Field lebel Src & Tgt Mismatches
No of Mismatches : 4

Mismatch Details :
Key Column Value    Column Name    Source Data    Target Data
000101    6    123455     123456
000102    8    24    23
000103    2    000004         3
000104    6    3401       3400

Be aware that I added 4 lines to the source file to verify the algorithm...
This User Gave Thanks to RudiC For This Post:
# 6  
Old 06-30-2015
Hi Mannu2525/RudiC ,
Thanks a ton for your reply.
@Mannu2525 ,
When I am running your script like below based on the below mentioned sample source & target files ,we are getting few issues in the analysis report.

Code:
./CompareScrpt.sh FILE3_S.txt FILE3_T.txt

1.No of mismatch should come 4 instead of 8.
2.In the mismatch details ,2nd ,4th ,6th ,8th row should not come in the analysis report.Please find attached the analysis report(Report.jpg).Kindly look into the red highlighted row in the mismatch details section.
3.In the analysis report , in the mismatch details section ,the representation format is not coming properly.The values of column name ,source data ,target data are all left aligned .

Code:
<<Analysis Report >>

Src & tgt Count Validation
Source Count : 4
Target Count : 4

Field lebel Src & Tgt Mismatches
No of Mismatches : 8

Mismatch Details:
Mismatch in line Column Name Source Data Target Data
000101 6 123455  123456
000101 20
000102 8 24 23
000102 20
000103 2 000004      3
000103 20
000104 6 3401    3400
000104 20

If you kindly help me in the above mentioned three issues, it will be really helpful for me.

@RudiC ,

When I am running your below mentioned script ,we are getting few issues.
1.Source count is coming 5 ,it should return 4.
2.No of mismatch should come 4 instead of 8 .
3.In the mismatch details ,2nd ,4th ,6th ,8th row should not come in the analysis report.
4.In the analysis report , in the mismatch details section ,the representation format is not coming properly.The values of column name ,source data ,target data are all left aligned .

Code:
<<Analysis Report >>

Src & tgt Count Validation
Source Count : 5
Target Count : 4

Field lebel Src & Tgt Mismatches
No of Mismatches : 8

Mismatch Details :
Key Column Value        Column Name     Source Data     Target Data
000101  6       123455   123456
000101  20
000102  8       24      23
000102  20
000103  2       000004       3
000103  20
000104  6       3401       3400
000104  20

It wud be great if you kindly look into the above mentioned issues.Thanks!
Need help regarding comparison between two files through UNIX script-reportjpg
# 7  
Old 06-30-2015
The script was tested with the data you provided in post#3 and worked correctly. So, please check your input data for line count and field count.
If you're not happy with the output format, wouldn't it be a great learning opportunity trying to adapt it yourself?
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Comparison of files

I have the requirement I have two files cat fileA something anythg nothing everythg cat fileB everythg anythg Now i shld use fileB and compare every line at fileA and get the output as something nothing (3 Replies)
Discussion started by: Priya Amaresh
3 Replies

2. Shell Programming and Scripting

UNIX file comparison

I have two files which has component name and version number separated by a space cat file1 com.acc.invm:FNS_PROD 94.0.5 com.acc.invm:FNS_TEST_DCCC_Mangment 94.1.6 com.acc.invm:FNS_APIPlat_BDMap 100.0.9 com.acc.invm:SendEmail 29.6.113 com.acc.invm:SendSms 12.23.65 cat file2 ... (8 Replies)
Discussion started by: rakeshtomar82
8 Replies

3. Shell Programming and Scripting

Comparison between two files through UNIX script

Hi All , As I am new to unix scripting ,I need a help regarding unix scripting .I have two .txt files .One is source file and another is target file.I need a script through which I can compare those two files.I need a automated comparison report in a directory after comparing between source &... (2 Replies)
Discussion started by: STCET22
2 Replies

4. Homework & Coursework Questions

Unix script Unix script which counts no. of files/sub-files

Hi All, For past some days iam trying, which not able to get to..so please help me on this.. My exact requirement is... Step1: Find how many files/sub files exist in /some/path (maybe in multiple path) Step2: Count the no. of files/sub files with their respective size. Step3: Then a file... (0 Replies)
Discussion started by: sam09
0 Replies

5. Solaris

Unix script Unix script which counts no. of files/sub-files

Hi All, For past some days iam trying, which not able to get to..so please help me on this.. My exact requirement is... Step1: Find how many files/sub files exist in /some/path (maybe in multiple path) Step2: Count the no. of files/sub files with their respective size. Step3: Then a file... (1 Reply)
Discussion started by: sam09
1 Replies

6. Shell Programming and Scripting

comparison of 2 files using unix or awk

Hello, I have 2 files and I want them to be compared in a specific fashion file1: A_1200_1250 A_1251_1300 B_1301_1350 B_1351_1400 B_1401_1450 C_1451_1500 and so on... file2: 1210 1305 1260 1295 1400 1500 1450 1495 Now The script should look for "1200" from A_1200_1250 of... (8 Replies)
Discussion started by: Diya123
8 Replies

7. Shell Programming and Scripting

Comparison of two files (sh)

Hi, I have a problem with comparison of two files file1 20100101 20090101 20080101 20071001 20121229 file2 19990112 12 456 7 20011131 19 20100101 2 567 1 987 17890709 123 555 and, sh script needs to compare of these two files and give out to me result: 20100101 2 567 1 987 it... (5 Replies)
Discussion started by: shizik
5 Replies

8. Shell Programming and Scripting

comparison of 2 files

Kindly help on follows. I have 2 files. One file contains only one column of mobile numbers. And total records in a file 12 million. Second file contains 2 columns mobile numbers and balance. and total records 30 million. I want to find out balance of each data in file 1 corresponding to file 2.... (2 Replies)
Discussion started by: kamal_418
2 Replies

9. UNIX for Dummies Questions & Answers

Comparison of 2 files in UNIX

Hi, There are two files in UNIX system with some lines are exactly the same, some lines are not. I want to compare these two files.The 2 files (both the files have data in Column format )should be compared row wise and any difference in data for a particular row should lead to storage of data of... (32 Replies)
Discussion started by: Dana Evans
32 Replies

10. UNIX for Dummies Questions & Answers

Unix comparison

I am very new to Unix. What are the similiarities and differences between ScoUnix and AIX5 if any? Where might i find the information? Which is better? (1 Reply)
Discussion started by: NewGuy100
1 Replies
Login or Register to Ask a Question