Comparing 2 UNIX files

10-25-2015

Registered User

16, 0

Join Date: Sep 2014

Last Activity: 1 November 2015, 8:43 PM EST

Posts: 16

Thanks Given: 2

Thanked 0 Times in 0 Posts

Comparing 2 UNIX files

Hi,

I am planning to automate the comparison of data between few tables in 2 different databases ( Teradata and sql server).
Below is the approach which I think of. Please suggest any improvements/Modification :
1) The sql server file is having more records and I have to eliminate the duplicates based on the unique key and select only few sample records for comparison. The file is pipe delimited. I am thinking of using the sort -u command to eliminate the duplicates and then will select few sample records based on the unique key.
2) Fetch the unique key from the ssql file into a new input file and then using teradata BTEQ utility will export the records corresponding to the unique key in ssql file.
3) use comp command to do column wise comparison for the table. Is there any way I can highlight the columns not matching, like in excel we can compare the columns and hightlight the values having "FALSE" result.
4) This needs to be done for around 10 tables and the structure of the tables is similar in both the databases.

Thanks in Advance!

Rahul Raj

View Public Profile for Rahul Raj

Find all posts by Rahul Raj

10-25-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

I don't see how you get rid of all duplicates by selecting on a few samples???

Are you trying to compare databases? Or, are you trying to compare text files that have been extracted from those databases?

You say the sql server file is larger. How many records are in each database? How many fields are in each database?

If you extracted each entire database into a pipe delimited text file, how many records (lines) would appear in each file and how many characters would be in the longest line in each file?

Please show us a sanitized sample of the records you want to compare and show us the output you want to produce from those input samples? (Please use CODE tags when you post the sample input and output files.)

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

10-25-2015

Registered User

16, 0

Join Date: Sep 2014

Last Activity: 1 November 2015, 8:43 PM EST

Posts: 16

Thanks Given: 2

Thanked 0 Times in 0 Posts

Hi Don,

Thanks for your reply.

I don't see how you get rid of all duplicates by selecting on a few samples???
- The duplicates will be eliminated first and then sample records will be selected.

Are you trying to compare databases? Or, are you trying to compare text files that have been extracted from those databases?
- text files extracted from databases(around 10 tables)

You say the sql server file is larger. How many records are in each database? How many fields are in each database?
- each of the 10 tables have different structure. On an average around 10-15 records and I would select a sample of around 1000 records from each file.

If you extracted each entire database into a pipe delimited text file, how many records (lines) would appear in each file and how many characters would be in the longest line in each file?
- the records in each file would vary from around 10K - 10L records.Sample would be around 1000 records and each file would have somewhere around 1000 characters

Please show us a sanitized sample of the records you want to compare and show us the output you want to produce from those input samples? (Please use CODE tags when you post the sample input and output files.)
- I can't provide the print of records due to the company policy but it would be something as below :

Code:

 File 1
 value1|unique_key|||value2(decimal)|value3(decimal)|||
 value1|unique_key|||value2|value3|||
  
 File 2
 value1|unique_key|||value2(decimal)|value3(decimal)|||
 value1|unique_key|||value2|value3|||
  
 result:
 want the result of the mismatched records in another file.

Rahul Raj

View Public Profile for Rahul Raj

Find all posts by Rahul Raj

10-25-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by Rahul Raj

OK. Each file contains 10 to 15 records and out of those 15 records you want to select 1000??????

Quote:

Originally Posted by Rahul Raj

If you extracted each entire database into a pipe delimited text file, how many records (lines) would appear in each file and how many characters would be in the longest line in each file?
- the records in each file would vary from around 10K - 10L records.Sample would be around 1000 records and each file would have somewhere around 1000 characters

10K usually means 10,000 or 10,240; I have no idea what 10L means.
And, if a sample file contains ~1,000 lines and the total file size is 1,000 characters; each line consists entirely of a <newline> line terminator and there is no data to compare??????

Quote:

Originally Posted by Rahul Raj

Please show us a sanitized sample of the records you want to compare and show us the output you want to produce from those input samples? (Please use CODE tags when you post the sample input and output files.)
- I can't provide the print of records due to the company policy but it would be something as below :

Code:

 File 1
 value1|unique_key|||value2(decimal)|value3(decimal)|||
 value1|unique_key|||value2|value3|||
  
 File 2
 value1|unique_key|||value2(decimal)|value3(decimal)|||
 value1|unique_key|||value2|value3|||
  
 result:
 want the result of the mismatched records in another file.

Please show us some data with actual values similar to what would appear in your files and the actual results that should be produced from that sample data. If we can't see data that we can feed into scripts that might do what you are trying to do so we can see if the code we're suggesting produces the results you are trying to produce, you are asking us to write code with both hands tied behind our backs. I have absolutely no idea from the above how we are supposed to determine what you consider to be a mismatched record.

With what you have told us, the only thing I can suggest is to use the diff utility to compare pairs of files.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

11-01-2015

Registered User

16, 0

Join Date: Sep 2014

Last Activity: 1 November 2015, 8:43 PM EST

Posts: 16

Thanks Given: 2

Thanked 0 Times in 0 Posts

Hi Don,

Thanks for your help and suggestions and apologies for the confussion. Hope the below will explain better :

SOURCE_FILE

Code:

 -----------------------------------
 NAME|CITY|PINCODE
 -----------------------------------
 JUIL | OPUT | 8347648
 KIMU| PETU | 0494744
 OPEY| UEYU| 8912374

TARGET_FILE

Code:

 -----------------------------------
 NAME|CITY|PINCODE
 -----------------------------------
 JUL | OPUT | 8347648
 KIU| PETU | 049444
 OPEY| UYU| 891237

Please find the mismatches report i want to generate

Code:

  NAME|COLUMN_NAME|SOURCE_VALUE|TARGET_VALUE
 ------------------------------------------
 JUIL  | PINCODE   |0494744     |049444
 OPEY|  CITY           |UEYU        |UYU
 OPEY| PINCODE   |8912374     |891237

Last edited by Don Cragun; 11-01-2015 at 10:19 PM.. Reason: Change ICODE tags to CODE tags, and add CODE tags.

Rahul Raj

View Public Profile for Rahul Raj

Find all posts by Rahul Raj

11-02-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

The way you formatted post #5, it isn't clear whether the field names are on line 1 of your input files (as it is on your output file) or if they are on line 2 of your input files. And, even after removing leading and trailing whitespace characters from all of the fields in your sample input files there are absolutely no matching fields in your sample data. The names on the first two data lines to not match and the data in the other fields on the last line do not match, so from your sample input, I would assume that you want output something like:

Code:

NAME|COLUMN_NAME|SOURCE_VALUE|TARGET_VALUE
------------------------------------------
JUL|CITY||OPUT
JUL|PINCODE||8347648
KIU|CITY||PETU
KIU|PINCODE||049444
OPEY|CITY|UEYU|UYU
OPEY|PINCODE|8912374|891237
JUIL|PINCODE|8347648|
KIMU|CITY|PETU|
JUIL|CITY|OPUT|
KIMU|PINCODE|0494744|

You said the order of the fields could vary between files. But, assuming that the 1st line in each file contains the headings and that the first field is always the field to be used as the matching field, the following seems to do what you want even if the other fields are in random order:

Code:

#!/bin/ksh
ec=0
IAm=${0##*/}
if [ $# -ne 2 ]
then	printf '%s: Incorrect number of operands (expected 2, found %d).\n' \
	    "$IAm" $# >&2
	ec=1
else	if [ ! -r "$1" ]
	then	printf '%s: SOURCE_FILE (%s) not found or not readable\n' \
		    "$IAm" "$1" >&2
		ec=$((ec + 2))
	fi
	if [ ! -r "$2" ]
	then	printf '%s: TARGET_FILE (%s) not found or not readable\n' \
		    "$IAm" "$2" >&2
		ec=$((ec + 4))
	fi
fi
if [ $ec -ne 0 ]
then	printf 'Usage: %s SOURCE_FILE TARGET_FILE\n' "$IAm" >&2
	exit $ec
fi
awk '
BEGIN {	print "NAME|COLUMN_NAME|SOURCE_VALUE|TARGET_VALUE"
	print "------------------------------------------"
	FS = "[[:space:]]*[|][[:space:]]*"
	OFS = "|"
}
{	gsub(/^[[:space:]]*|[[:space:]]*$/, "")
}
FNR == 1 {
	f++
	for(i = 2; i <= NF; i++) {
		fn[f, i] = $i
	}
	next
}
f == 1 {for(i = 2; i <= NF; i++) {
		d[$1 OFS fn[1, i]] = $i
	}
	next
}
{	for(i = 2; i <= NF; i++) {
		if(($1 OFS fn[2, i]) in d) {
			if(d[$1 OFS fn[2, i]] != $i)
				print $1, fn[2, i], d[$1 OFS fn[2, i]], $i
			delete d[$1 OFS fn[2, i]]
		} else	print $1, fn[2, i], "", $i
	}
}
END {	for(i in d)
		print i, d[i], ""
}' "$1" "$2"

If you have the files Source1:

Code:

 NAME|PINCODE|CITY
 -----------------------------------
 JUIL | 8347648| OPUT 
 KIMU| 0494744| PETU 
 OPEY| 8912374| UEYU
 matching pin  |    12345678   |	San Jose	
 matching city  |    98765432|	San Jose

and Source2:

Code:

 NAME|CITY|PINCODE
 -----------------------------------
 JUIL | OPUT | 8347648
 KIMU| PETU | 0494744
 OPEY| UEYU| 8912374
 matching pin  |	San Jose, CA	|    12345678
 matching city  |	San Jose	|    12345679

(note that the order of the last two fields is switched between these files) and you run the script with:

Code:

./scriptname Source1 Source2

you get the output:

Code:

NAME|COLUMN_NAME|SOURCE_VALUE|TARGET_VALUE
------------------------------------------
matching pin|CITY|San Jose|San Jose, CA
matching city|PINCODE|98765432|12345679

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

UNIX for Dummies Questions & Answers

Comparing 2 UNIX files

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Comparing two files and list the difference with common first line content of both files

Discussion started by: sarathy_a35

2. UNIX for Beginners Questions & Answers

Comparing 2 variables in UNIX

Discussion started by: prince1987

3. Shell Programming and Scripting

Comparing two files in UNIX and create a new file similar to equi join

Discussion started by: satece

4. Shell Programming and Scripting

Need help in comparing two files in UNIX with a mismatch

Discussion started by: karthik adiga

5. Shell Programming and Scripting

Comparing Select Columns from two CSV files in UNIX and create a third file based on comparision

Discussion started by: ady_koolz

6. Shell Programming and Scripting

Comparing files in a directory against an array of files

Discussion started by: BudMan

7. UNIX for Advanced & Expert Users

How to find duplicates contents in a files by comparing other files?

Discussion started by: Venkatesh1

8. Shell Programming and Scripting

Comparing the matches in two files using awk when both files have their own field separators

Discussion started by: asyed

9. Shell Programming and Scripting

Need help comparing two files and deleting some things in those files!

Discussion started by: linuxkid

10. UNIX for Advanced & Expert Users

comparing shadow files with real files

Discussion started by: terrym