Difference between two huge files

09-14-2008

Registered User

34, 0

Join Date: Sep 2007

Last Activity: 30 October 2012, 6:06 AM EDT

Posts: 34

Thanks Given: 0

Thanked 0 Times in 0 Posts

Hi,

Yes, Bdiff is not behaving as i wanted. I am using Sun Solaris.

As you suggested, the best approach will be writing a script. As i am a novice to shell scripting, could you please help me out how to approach in solving this problem? Are there any sample scripts avaialble?

Thanks again

Sue

pyaranoid

View Public Profile for pyaranoid

Find all posts by pyaranoid

09-14-2008

Administrator Emeritus

9,926, 461

Join Date: Aug 2001

Last Activity: 26 February 2016, 12:31 PM EST

Location: Ashburn, Virginia

Posts: 9,926

Thanks Given: 63

Thanked 461 Times in 270 Posts

Assume we have a file called "big". Then we copy "big" to a file called "little" except that we delete some lines. In that case we can display the missing lines with:

Code:

#! /usr/bin/ksh
exec < little
exec 4< big
IFS=""
while read line1 ; do
        match=0
        while ((!match)) ; do
                read -u4 line2
                if [[ "$line1" = "$line2" ]] ; then
                        match=1
                else
                        echo "$line2"
                fi
        done
done
while read -u4 line2 ; do
        echo "$line2"
done
exit 0

Perderabo

View Public Profile for Perderabo

Find all posts by Perderabo

09-15-2008

Registered User

34, 0

Join Date: Sep 2007

Last Activity: 30 October 2012, 6:06 AM EDT

Posts: 34

Thanks Given: 0

Thanked 0 Times in 0 Posts

Hi,

Your script seems to be working correctly. The problem now i am facing is that the output file kept on increasing its size as there is not end of file defined in the script.

How can we exit immediately after comparing those two files?

Thanks a ton

Sue

pyaranoid

View Public Profile for pyaranoid

Find all posts by pyaranoid

09-15-2008

Administrator Emeritus

9,926, 461

Join Date: Aug 2001

Last Activity: 26 February 2016, 12:31 PM EST

Location: Ashburn, Virginia

Posts: 9,926

Thanks Given: 63

Thanked 461 Times in 270 Posts

I don't understand. It does exit upon reaching end-of-file.

Perderabo

View Public Profile for Perderabo

Find all posts by Perderabo

09-16-2008

Registered User

82, 0

Join Date: Sep 2008

Last Activity: 26 November 2016, 1:42 PM EST

Location: pune

Posts: 82

Thanks Given: 0

Thanked 0 Times in 0 Posts

hi,

try the below code...hope it will work.

awk 'file_name=="file1" {arr[$0]++}
file_name=="file2"{ if($0 in arr) {continue}
else {print $0}}' file1 file2

subhendu81

View Public Profile for subhendu81

Find all posts by subhendu81

09-16-2008

Registered User

4,996, 477

Join Date: Dec 2003

Last Activity: 12 June 2016, 11:03 PM EDT

Location: /dev/ph

Posts: 4,996

Thanks Given: 73

Thanked 477 Times in 439 Posts

Quote:

Originally Posted by pyaranoid

Looks like your "little" file is not a proper subset of your "big" file. If you have lines in your "little" file which are not in your "big" file, Perderabo's script will never exit but will continue to increase the size of the diff file. The simple fix is to check the exit status after each read i.e.

Code:

      read -u4 line2
      if [[ $? > 0 ]]
      then
          # print $line1
          break
      fi

fpmurphy

View Public Profile for fpmurphy

Find all posts by fpmurphy

09-16-2008

Administrator Emeritus

9,926, 461

Join Date: Aug 2001

Last Activity: 26 February 2016, 12:31 PM EST

Location: Ashburn, Virginia

Posts: 9,926

Thanks Given: 63

Thanked 461 Times in 270 Posts

If the little file has any line at all which is not present in the big file, that condition will send the original script into an infinite loop. fpmurphy's fix stops the infinite loop, but the output is probably still wrong. After encountering such a line, my script will output the remainder of the big file. If the little file has a copy of the big file's final line, and all extra lines in the little file follow this final line, I guess it's ok.

BTW, if the little file has any extra lines, it is not a subset at all. A proper subset would have fewer lines. An "improper" or "non-proper" subset would be an exact copy of all lines.

Perderabo

View Public Profile for Perderabo

Find all posts by Perderabo

UNIX for Dummies Questions & Answers

Difference between two huge files

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Aggregation of Huge files

Discussion started by: Ravichander

2. Shell Programming and Scripting

Difference between two huge .csv files

Discussion started by: Dimple

3. Shell Programming and Scripting

Three Difference File Huge Data Comparison Problem.

Discussion started by: patrick87

4. Shell Programming and Scripting

Compare 2 folders to find several missing files among huge amounts of files.

Discussion started by: jiapei100

5. Programming

Huge difference between _POSIX_OPEN_MAX and sysconf(_SC_OPEN_MAX).

Discussion started by: gencon

6. Shell Programming and Scripting

Replacing second line from huge files

Discussion started by: satish.pyboyina

7. High Performance Computing

Huge Files to be Joined on Ux instead of ORACLE

Discussion started by: magedfawzy

8. UNIX for Advanced & Expert Users

Huge files manipulation

Discussion started by: Klashxx

9. AIX

Huge difference in reported Disk usage between ls,df and du

Discussion started by: cooperuf

10. Shell Programming and Scripting

Comparing two huge files

Discussion started by: kmkbuddy_1983