Q: Howto compare 2 files


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Q: Howto compare 2 files
# 1  
Old 09-03-2012
Q: Howto compare 2 files

Greetings,

I made an extraction on 2 different databases. What I need to do is to compare those extractions to know what is on database1 which is not on database2 and vice versa.

In those files, there are only numbers. So each line is just a number witch should be present on both file. If it's not, I want to know which number is not present on X file.

Working on Linux (Red Hat) I tried compare / sdiff etc but all those tools seems to compare line number X from file 1 to line number X on file 2 instead of checking in the whole file Smilie


Here is my output :

File1 :
Code:
123456123
234561234
345612345
456123456

File2 :
Code:
234561234
345612345
123456123

So here I'd like to know that 456123456 is not present in File2 (and get the output in a third file)

Note : I ve got 15 millions lines to deal with so a simple cat | while read and grep script is too slow Smilie


Big thanks to anyone who can help me with this Smilie

Last edited by Scott; 09-03-2012 at 07:12 PM.. Reason: Code tags
# 2  
Old 09-03-2012
Maybe you could try grep -vf file2 file1, or some such.

What kind of "database" does this come from? A relational one? i.e. could you not do select X from Y where X not in (select X from Z)?
# 3  
Old 09-03-2012
Do you only want to find lines that are in file1 that are not in file2, or do you also want to find lines that are in file2 that are not in file1?

---------- Post updated at 04:42 PM ---------- Previous update was at 04:00 PM ----------

If you can't perform a select as Scott suggested, there are at least a couple of fairly straight-forward ways of handling this. The easiest may well be the best:
Code:
sort -n File1 > sFile1
sort -n File2 > sFile2
diff sFile[12]

although with 15,000,000 lines you may have to read the man page for your sort utility to find out how to specify a file system with enough space for the temporary files that will be required.

Another way would be to create an associative array in awk of the entries found in File2 and then read the entries in File1 and report entries that aren't in the array you built while reading File2. (If you need the entries in File2 that aren't in File1 as well, you could mark or delete the entries you found on the first pass and make another pass to print the entries that weren't matched.)
# 4  
Old 09-04-2012
Quote:
Originally Posted by Scott
Maybe you could try grep -vf file2 file1, or some such.

What kind of "database" does this come from? A relational one? i.e. could you not do select X from Y where X not in (select X from Z)?
2 different databases on different servers. I only have access to one, the other is from another service. I just printed with awk what i needed so that both files are formated the same way.

I ll try that

As for sorting out files well i thought about it but some entries will be missing in the middle of x file and both lines will be out of sync again so diff wouldn't work. I think.

And yes Don, I also need lines that are present in file2 but not in file1.

How would you make that associative array using awk? this sound interesting.

I ll let you know how it goes, meh Smilie

---------- Post updated at 06:44 AM ---------- Previous update was at 01:19 AM ----------

Quote:
Originally Posted by Don Cragun
Do you only want to find lines that are in file1 that are not in file2, or do you also want to find lines that are in file2 that are not in file1?

---------- Post updated at 04:42 PM ---------- Previous update was at 04:00 PM ----------

If you can't perform a select as Scott suggested, there are at least a couple of fairly straight-forward ways of handling this. The easiest may well be the best:
Code:
sort -n File1 > sFile1
sort -n File2 > sFile2
diff sFile[12]

Sort then diff seems to do the trick althought i have a weird output in the middle column. > and < I understand but what that a ' | 'mean?

Thx
# 5  
Old 09-04-2012
Save the following in a file named diffs.awk:
Code:
#!/bin/ksh
if [ $# -ne 2 ]
then
        printf "Usage: %s f1 f2\n    Two file operands are required.\n" "$(basename "$0")"
        exit 1
fi
awk 'FNR==1{
        if(NR==1) f1 = FILENAME
        else    f2 = FILENAME
}
f2=="" {# Make an entry for this record from the 1st file in array c1:
        c1[$0]
        next
}
 {      # We are now in the 2nd file.  Look for matching entry from 1st file.
        if($0 in c1) {
                # Matching entry found.  Delete the entry from the list of
                # unmatched entries from the 1st file.
                delete c1[$0]
                next
        }
        # No match found...
        if(c2only++ == 0) printf("The following entries are only in %s:\n", f2)
        print
}
END {   # Any entries remaining in c1 are present only in the 1st file.
        for(i in c1) {
                if(c1only++ == 0)
                        printf("The following entries are only in %s:\n", f1)
                print i
        }
        printf("%d unmatched entries found in %s\n", c1only, f1)
        printf("%d unmatched entries found in %s\n", c2only, f2)
}' "$@"

and make it executable:
Code:
chmod +x diffs.awk

then, for your example, invoke it with:
Code:
diffs.awk File1 File2

With File1 and File2 as shown in the first message in this thread, the output produced is:
Code:
The following entries are only in File1:
456123456
1 unmatched entries found in File1
0 unmatched entries found in File2

# 6  
Old 09-04-2012
I like the [awk approach, but bear in mind that it will need the files in the same sorted order too.

The diff is really a non-starter (even with the file sorted) because the output from diff includes the context of the difference.


Assuming:
Database1 = The local database on Computer1 over which you have some control.
Database2 = The rremote database on Computer2 over which you have little control.

The DBA's approach would be to:
1) Extract only the field needed from Database2 to a flat file.
2) Copy the extract file from Computer2 to to Computer1.
3) Load the Database2 extract file into a temporary table in Database1 with the number field as the primary key.
4) Assuming that the same field is an Indexed field in Database1.
Use a basic SQL program to compare the two files. This would only need to "seek" the records and should be very fast.
SQL idea in Post #2 of this thread.
# 7  
Old 09-04-2012
Quote:
Originally Posted by methyl
I like the [awk approach, but bear in mind that it will need the files in the same sorted order too.

... ... ...
No. The awk script I posted doesn't care about the order of entries in File1 or File2. The only assumption the script makes is that no line in either file will be duplicated in either file.
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Howto compare the columns of 2 diff tables of 2 different schemas in UNIX shell script

HI All, I am new to Unix shell scripts.. Could you please post the unix shell script for for the below request., There are two different tables(sample1, sample2) in different schemas(s_schema1, s_schema2). Unix shell script to compare the columns of two different tables of two... (2 Replies)
Discussion started by: Rajkumar Gopal
2 Replies

2. Shell Programming and Scripting

Compare multiple files, and extract items that are common to ALL files only

I have this code awk 'NR==FNR{a=$1;next} a' file1 file2 which does what I need it to do, but for only two files. I want to make it so that I can have multiple files (for example 30) and the code will return only the items that are in every single one of those files and ignore the ones... (7 Replies)
Discussion started by: castrojc
7 Replies

3. Shell Programming and Scripting

howto monitor a directory for files then sftp them

Morning all I hope I have put this in the correct forum. I have a requirement to monitor a directory on a server for files being sftp'ed in and then to sftp them of to another server. The issues I have though of are making sure the files have completely transferred onto the server before they... (6 Replies)
Discussion started by: ltodd2
6 Replies

4. Shell Programming and Scripting

Require compare command to compare 4 files

I have four files, I need to compare these files together. As such i know "sdiff and comm" commands but these commands compare 2 files together. If I use sdiff command then i have to compare each file with other which will increase the codes. Please suggest if you know some commands whcih can... (6 Replies)
Discussion started by: nehashine
6 Replies

5. Shell Programming and Scripting

Compare 2 folders to find several missing files among huge amounts of files.

Hi, all: I've got two folders, say, "folder1" and "folder2". Under each, there are thousands of files. It's quite obvious that there are some files missing in each. I just would like to find them. I believe this can be done by "diff" command. However, if I change the above question a... (1 Reply)
Discussion started by: jiapei100
1 Replies

6. Shell Programming and Scripting

HowTo translate KSH Scripts to DOS Batch Files ?

Hi there, in near future I have to change my work surrounding from HP UNIX to Windows Vista (great to get rid of old hardware :), but bad to loose UNIX :( ). As I heavily use KSH scripts to do my job, I was wondering, if there is any HowTo available, supporting me in re-writing the scripts to... (4 Replies)
Discussion started by: Joe-K7
4 Replies

7. Shell Programming and Scripting

How to compare 2 files & get only few columns based on a condition related to both files?

Hiiiii friends I have 2 files which contains huge data & few lines of it are as shown below File1: b.dat(which has 21 columns) SSR 1976 8 12 13 10 44.00 39.0700 70.7800 7.0 0 0.00 0 2.78 0.00 0.00 0 0.00 2.78 0 NULL ISC 1976 8 12 22 32 37.39 36.2942 70.7338... (6 Replies)
Discussion started by: reva
6 Replies

8. Shell Programming and Scripting

compare files in two directories and output changed files to third directory

I have searched about 30 threads, a load of Google pages and cannot find what I am looking for. I have some of the parts but not the whole. I cannot seem to get the puzzle fit together. I have three folders, two of which contain different versions of multiple files, dist/file1.php dist/file2.php... (4 Replies)
Discussion started by: bkeep
4 Replies

9. UNIX for Dummies Questions & Answers

Howto removing files with the same inode

Dear all, # ls -li total 16 2623392 drwxrwxrwx 2 root root 512 Apr 10 01:57 10HPA- 8447490 drwxr-xr-x 3 root root 512 Apr 14 05:29 118OQ- 8447490 drwxr-xr-x 3 root root 512 Apr 14 05:29 118OQ-.old 1925572 drwxrwxrwx 2 root root 512... (3 Replies)
Discussion started by: fu4d
3 Replies

10. UNIX for Dummies Questions & Answers

Howto Archive Including Hidden Files?

Hi I want to archive the following all the files and directory like listed below: $ ls -a . .. .bash_history .bash_logout .bash_profile .bashrc .emacs .mysql_history public_html .viminfo What I tried is to use the following command $ gtar cvzf allmyfiles.tar.gz * ... (1 Reply)
Discussion started by: monkfan
1 Replies
Login or Register to Ask a Question