Comparing 2 UNIX files


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Comparing 2 UNIX files
# 1  
Old 10-25-2015
Comparing 2 UNIX files

Hi,

I am planning to automate the comparison of data between few tables in 2 different databases ( Teradata and sql server).
Below is the approach which I think of. Please suggest any improvements/Modification :
1) The sql server file is having more records and I have to eliminate the duplicates based on the unique key and select only few sample records for comparison. The file is pipe delimited. I am thinking of using the sort -u command to eliminate the duplicates and then will select few sample records based on the unique key.
2) Fetch the unique key from the ssql file into a new input file and then using teradata BTEQ utility will export the records corresponding to the unique key in ssql file.
3) use comp command to do column wise comparison for the table. Is there any way I can highlight the columns not matching, like in excel we can compare the columns and hightlight the values having "FALSE" result.
4) This needs to be done for around 10 tables and the structure of the tables is similar in both the databases.


Thanks in Advance!
# 2  
Old 10-25-2015
I don't see how you get rid of all duplicates by selecting on a few samples???

Are you trying to compare databases? Or, are you trying to compare text files that have been extracted from those databases?

You say the sql server file is larger. How many records are in each database? How many fields are in each database?

If you extracted each entire database into a pipe delimited text file, how many records (lines) would appear in each file and how many characters would be in the longest line in each file?

Please show us a sanitized sample of the records you want to compare and show us the output you want to produce from those input samples? (Please use CODE tags when you post the sample input and output files.)
# 3  
Old 10-25-2015
Hi Don,

Thanks for your reply.

I don't see how you get rid of all duplicates by selecting on a few samples???
- The duplicates will be eliminated first and then sample records will be selected.

Are you trying to compare databases? Or, are you trying to compare text files that have been extracted from those databases?
- text files extracted from databases(around 10 tables)

You say the sql server file is larger. How many records are in each database? How many fields are in each database?
- each of the 10 tables have different structure. On an average around 10-15 records and I would select a sample of around 1000 records from each file.

If you extracted each entire database into a pipe delimited text file, how many records (lines) would appear in each file and how many characters would be in the longest line in each file?
- the records in each file would vary from around 10K - 10L records.Sample would be around 1000 records and each file would have somewhere around 1000 characters

Please show us a sanitized sample of the records you want to compare and show us the output you want to produce from those input samples? (Please use CODE tags when you post the sample input and output files.)
- I can't provide the print of records due to the company policy but it would be something as below :

Code:
 File 1
 value1|unique_key|||value2(decimal)|value3(decimal)|||
 value1|unique_key|||value2|value3|||
  
 File 2
 value1|unique_key|||value2(decimal)|value3(decimal)|||
 value1|unique_key|||value2|value3|||
  
 result:
 want the result of the mismatched records in another file.

# 4  
Old 10-25-2015
Quote:
Originally Posted by Rahul Raj
Hi Don,

Thanks for your reply.

I don't see how you get rid of all duplicates by selecting on a few samples???
- The duplicates will be eliminated first and then sample records will be selected.

Are you trying to compare databases? Or, are you trying to compare text files that have been extracted from those databases?
- text files extracted from databases(around 10 tables)

You say the sql server file is larger. How many records are in each database? How many fields are in each database?
- each of the 10 tables have different structure. On an average around 10-15 records and I would select a sample of around 1000 records from each file.
OK. Each file contains 10 to 15 records and out of those 15 records you want to select 1000??????
Quote:
Originally Posted by Rahul Raj
If you extracted each entire database into a pipe delimited text file, how many records (lines) would appear in each file and how many characters would be in the longest line in each file?
- the records in each file would vary from around 10K - 10L records.Sample would be around 1000 records and each file would have somewhere around 1000 characters
10K usually means 10,000 or 10,240; I have no idea what 10L means.
And, if a sample file contains ~1,000 lines and the total file size is 1,000 characters; each line consists entirely of a <newline> line terminator and there is no data to compare??????
Quote:
Originally Posted by Rahul Raj
Please show us a sanitized sample of the records you want to compare and show us the output you want to produce from those input samples? (Please use CODE tags when you post the sample input and output files.)
- I can't provide the print of records due to the company policy but it would be something as below :

Code:
 File 1
 value1|unique_key|||value2(decimal)|value3(decimal)|||
 value1|unique_key|||value2|value3|||
  
 File 2
 value1|unique_key|||value2(decimal)|value3(decimal)|||
 value1|unique_key|||value2|value3|||
  
 result:
 want the result of the mismatched records in another file.

Please show us some data with actual values similar to what would appear in your files and the actual results that should be produced from that sample data. If we can't see data that we can feed into scripts that might do what you are trying to do so we can see if the code we're suggesting produces the results you are trying to produce, you are asking us to write code with both hands tied behind our backs. I have absolutely no idea from the above how we are supposed to determine what you consider to be a mismatched record.

With what you have told us, the only thing I can suggest is to use the diff utility to compare pairs of files.
# 5  
Old 11-01-2015
Hi Don,

Thanks for your help and suggestions and apologies for the confussion. Hope the below will explain better :



SOURCE_FILE
Code:
 -----------------------------------
 NAME|CITY|PINCODE
 -----------------------------------
 JUIL | OPUT | 8347648
 KIMU| PETU | 0494744
 OPEY| UEYU| 8912374


TARGET_FILE

Code:
 -----------------------------------
 NAME|CITY|PINCODE
 -----------------------------------
 JUL | OPUT | 8347648
 KIU| PETU | 049444
 OPEY| UYU| 891237



Please find the mismatches report i want to generate

Code:
  NAME|COLUMN_NAME|SOURCE_VALUE|TARGET_VALUE
 ------------------------------------------
 JUIL  | PINCODE   |0494744     |049444
 OPEY|  CITY           |UEYU        |UYU
 OPEY| PINCODE   |8912374     |891237


Last edited by Don Cragun; 11-01-2015 at 10:19 PM.. Reason: Change ICODE tags to CODE tags, and add CODE tags.
# 6  
Old 11-02-2015
The way you formatted post #5, it isn't clear whether the field names are on line 1 of your input files (as it is on your output file) or if they are on line 2 of your input files. And, even after removing leading and trailing whitespace characters from all of the fields in your sample input files there are absolutely no matching fields in your sample data. The names on the first two data lines to not match and the data in the other fields on the last line do not match, so from your sample input, I would assume that you want output something like:
Code:
NAME|COLUMN_NAME|SOURCE_VALUE|TARGET_VALUE
------------------------------------------
JUL|CITY||OPUT
JUL|PINCODE||8347648
KIU|CITY||PETU
KIU|PINCODE||049444
OPEY|CITY|UEYU|UYU
OPEY|PINCODE|8912374|891237
JUIL|PINCODE|8347648|
KIMU|CITY|PETU|
JUIL|CITY|OPUT|
KIMU|PINCODE|0494744|

You said the order of the fields could vary between files. But, assuming that the 1st line in each file contains the headings and that the first field is always the field to be used as the matching field, the following seems to do what you want even if the other fields are in random order:
Code:
#!/bin/ksh
ec=0
IAm=${0##*/}
if [ $# -ne 2 ]
then	printf '%s: Incorrect number of operands (expected 2, found %d).\n' \
	    "$IAm" $# >&2
	ec=1
else	if [ ! -r "$1" ]
	then	printf '%s: SOURCE_FILE (%s) not found or not readable\n' \
		    "$IAm" "$1" >&2
		ec=$((ec + 2))
	fi
	if [ ! -r "$2" ]
	then	printf '%s: TARGET_FILE (%s) not found or not readable\n' \
		    "$IAm" "$2" >&2
		ec=$((ec + 4))
	fi
fi
if [ $ec -ne 0 ]
then	printf 'Usage: %s SOURCE_FILE TARGET_FILE\n' "$IAm" >&2
	exit $ec
fi
awk '
BEGIN {	print "NAME|COLUMN_NAME|SOURCE_VALUE|TARGET_VALUE"
	print "------------------------------------------"
	FS = "[[:space:]]*[|][[:space:]]*"
	OFS = "|"
}
{	gsub(/^[[:space:]]*|[[:space:]]*$/, "")
}
FNR == 1 {
	f++
	for(i = 2; i <= NF; i++) {
		fn[f, i] = $i
	}
	next
}
f == 1 {for(i = 2; i <= NF; i++) {
		d[$1 OFS fn[1, i]] = $i
	}
	next
}
{	for(i = 2; i <= NF; i++) {
		if(($1 OFS fn[2, i]) in d) {
			if(d[$1 OFS fn[2, i]] != $i)
				print $1, fn[2, i], d[$1 OFS fn[2, i]], $i
			delete d[$1 OFS fn[2, i]]
		} else	print $1, fn[2, i], "", $i
	}
}
END {	for(i in d)
		print i, d[i], ""
}' "$1" "$2"

If you have the files Source1:
Code:
 NAME|PINCODE|CITY
 -----------------------------------
 JUIL | 8347648| OPUT 
 KIMU| 0494744| PETU 
 OPEY| 8912374| UEYU
 matching pin  |    12345678   |	San Jose	
 matching city  |    98765432|	San Jose

and Source2:
Code:
 NAME|CITY|PINCODE
 -----------------------------------
 JUIL | OPUT | 8347648
 KIMU| PETU | 0494744
 OPEY| UEYU| 8912374
 matching pin  |	San Jose, CA	|    12345678
 matching city  |	San Jose	|    12345679

(note that the order of the last two fields is switched between these files) and you run the script with:
Code:
./scriptname Source1 Source2

you get the output:
Code:
NAME|COLUMN_NAME|SOURCE_VALUE|TARGET_VALUE
------------------------------------------
matching pin|CITY|San Jose|San Jose, CA
matching city|PINCODE|98765432|12345679

 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Comparing two files and list the difference with common first line content of both files

I have two file as given below which shows the ACL permissions of each file. I need to compare the source file with target file and list down the difference as specified below in required output. Can someone help me on this ? Source File ************* # file: /local/test_1 # owner: own #... (4 Replies)
Discussion started by: sarathy_a35
4 Replies

2. UNIX for Beginners Questions & Answers

Comparing 2 variables in UNIX

Hi, I have 2 variables as given below. How can i compare them and say its matching ? Appreciate your help VAR1=describe/read/write VAR2=read/write/describeThanks, Please use CODE tags as required by forum rules! (4 Replies)
Discussion started by: prince1987
4 Replies

3. Shell Programming and Scripting

Comparing two files in UNIX and create a new file similar to equi join

I have 2 files namely branch.txt file & RXD.txt file as below Ex:Branch.txt ========================= B1,Branchname1,city,country B2,Branchname2,city,country B3,Branchname3,city,country B4,Branchname4,city,country B5,Branchname5,city,country RXD file : will... (11 Replies)
Discussion started by: satece
11 Replies

4. Shell Programming and Scripting

Need help in comparing two files in UNIX with a mismatch

Hi Everyone, I am comparing results of two environments using unix files. I am writing two different csv file using spool and merging both the files using paste command paste -d, file1.csv file2.csv > prod_uat_result.csv and then finding the difference and attaching the same in a mail... (8 Replies)
Discussion started by: karthik adiga
8 Replies

5. Shell Programming and Scripting

Comparing Select Columns from two CSV files in UNIX and create a third file based on comparision

Hi , I want to compare first 3 columns of File A and File B and create a new file File C which will have all rows from File B and will include rows that are present in File A and not in File B based on First 3 column comparison. Thanks in advance for your help. File A A,B,C,45,46... (2 Replies)
Discussion started by: ady_koolz
2 Replies

6. Shell Programming and Scripting

Comparing files in a directory against an array of files

I hope I can explain this correctly. I am using Bash-4.2 for my shell. I have a group of file names held in an array. I want to compare the names in this array against the names of files currently present in a directory. If the file does not exist in the directory, that is not a problem.... (5 Replies)
Discussion started by: BudMan
5 Replies

7. UNIX for Advanced & Expert Users

How to find duplicates contents in a files by comparing other files?

Hi Guys , we have one directory ...in that directory all files will be set on each day.. files must have header ,contents ,footer.. i wants to compare the header,contents,footer ..if its same means display an error message as 'files contents same' (7 Replies)
Discussion started by: Venkatesh1
7 Replies

8. Shell Programming and Scripting

Comparing the matches in two files using awk when both files have their own field separators

I've two files with data like below: file1.txt: AAA,Apples,123 BBB,Bananas,124 CCC,Carrot,125 file2.txt: Store1|AAA|123|11 Store2|BBB|124|23 Store3|CCC|125|57 Store4|DDD|126|38 So,the field separator in file1.txt is a comma and in file2.txt,it is | Now,the output should be... (2 Replies)
Discussion started by: asyed
2 Replies

9. Shell Programming and Scripting

Need help comparing two files and deleting some things in those files!

So I have two files: File1 pictures.txt 1.1 1.3 dance.txt 1.2 1.4 treehouse.txt 1.3 1.5 File2 pictures.txt 1.5 ref2313 1.4 ref2345 1.3 ref5432 1.2 ref4244 dance.txt 1.6 ref2342 1.5 ref2352 1.4 ref0695 1.3 ref5738 1.2 ref4948 1.1 treehouse.txt 1.6 ref8573 1.5 ref3284 1.4 ref5838... (24 Replies)
Discussion started by: linuxkid
24 Replies

10. UNIX for Advanced & Expert Users

comparing shadow files with real files

Hi I need to compare shadow file sizes with their real file counterparts. If the shadow file size differs form the realfile size then it must send a mail. My problem is that our system has over 1600 shadowfiles in different directories, with different names. the only consistancy is the .sh file... (4 Replies)
Discussion started by: terrym
4 Replies
Login or Register to Ask a Question