Best way to diff two huge directory trees | Unix Linux Forums | Shell Programming and Scripting

  Go Back    


Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here.

Best way to diff two huge directory trees

Shell Programming and Scripting


Closed Thread    
 
Thread Tools Search this Thread Display Modes
    #1  
Old 08-12-2008
same1290 same1290 is offline
Registered User
 
Join Date: Jul 2008
Last Activity: 10 October 2008, 12:03 PM EDT
Posts: 15
Thanks: 0
Thanked 0 Times in 0 Posts
Best way to diff two huge directory trees

Hi

I have a job that will be running nightly incremental backsup of a large directory tree.

I did the initial backup, now I want to write a script to verify that all the files were transferred correctly. I did something like this which works in principle on small trees:

diff -r -q $src_dir $dst_dir >& diffreport.txt

The problem with this is that it is very slow. The directory I am backing up is about 2 TB.

I also tried using the tools find and sum to dump the checksums to two file s, one for source directory and one for destination and comparing them. This is the command I used:

find $src_dir -type f -print0 | xargs -0 sum > src_dir_checksums.txt
find $dst_dir -type f -print0 | xargs -0 sum > dst_dir_checksums.txt
diff src_dir_checksums.txt dst_dir_checksums.txt

But for some reason this produces a different search order for the two directories which are on different machines.

Any help would greatly appreciated.

Thanks in advance,
Sam
Sponsored Links
    #2  
Old 08-12-2008
danmero danmero is offline Forum Advisor  
 
Join Date: Nov 2007
Last Activity: 29 July 2014, 6:33 AM EDT
Location: H3X
Posts: 2,151
Thanks: 10
Thanked 117 Times in 110 Posts
Try rsync, you can google for rsync incremental backup.
Sponsored Links
    #3  
Old 08-12-2008
Ikon's Avatar
Ikon Ikon is offline Forum Advisor  
Computer Geek
 
Join Date: Jul 2008
Last Activity: 10 July 2013, 11:06 AM EDT
Location: Frederick, MD
Posts: 748
Thanks: 4
Thanked 10 Times in 9 Posts
What about just compairing the output of


Code:
# cd /path/to/directory
# du
16      ./somedir
7200    ./somedir/1
1200    ./somedir/2
80      ./someotherdir
14512   .

This wont check the files as far as being exact copies but would verify the size of the files in the directories.
    #4  
Old 08-12-2008
same1290 same1290 is offline
Registered User
 
Join Date: Jul 2008
Last Activity: 10 October 2008, 12:03 PM EDT
Posts: 15
Thanks: 0
Thanked 0 Times in 0 Posts
Hi,
Thanks for your reply.

I actually already compared the sizes using du. They're quite similar but not the same, I think maybe because the directory entries sizes are also part of the total and those are different on the different machines where the two trees are stored (that's just a guess).

So I think I need something more reliable.

Sam
Sponsored Links
    #5  
Old 08-12-2008
Annihilannic Annihilannic is offline Forum Advisor  
 
Join Date: May 2008
Last Activity: 28 October 2009, 7:03 PM EDT
Location: Sydney, Australia
Posts: 1,009
Thanks: 0
Thanked 2 Times in 2 Posts
sort the checksum files by filename before you diff them.
Sponsored Links
    #6  
Old 08-16-2008
drl's Avatar
drl drl is offline Forum Advisor  
Registered Voter
 
Join Date: Apr 2007
Last Activity: 29 July 2014, 12:18 PM EDT
Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris
Posts: 1,663
Thanks: 34
Thanked 186 Times in 170 Posts
Hi.

There is a script cmptree at Unix Review > The Shell Corner: cmptree
which may be useful. It uses cmp to compare files. Utility cmp reads a file as binary, so non-text files can be successfully compared.

If you are solving this problem essentially once, then my feeling is that to read an entire file to get the checksum may be wasting cycles if the differences occur early in the files.

In fact, the method I prefer is first to check the length of the files. This is a low-overhead operation, either with utility stat in Linux or utility ls otherwise. If the lengths are different, then the files are different. If the lengths are the same, then one can use something like cmp to compare the files.

The one disadvantage that I saw in cmptree is that does not handle filenames with embedded whitespace, so if you have such files, then the published version of cmptree will not be useful ... cheers, drl
Sponsored Links
    #7  
Old 08-18-2008
same1290 same1290 is offline
Registered User
 
Join Date: Jul 2008
Last Activity: 10 October 2008, 12:03 PM EDT
Posts: 15
Thanks: 0
Thanked 0 Times in 0 Posts
Hi drl
Thanks for that feedback. That's a good idea. Getting the lengths is probably enough of a check and a lot quicker which is my main problem (with terabytes of data to check).

Sam
Sponsored Links
Closed Thread

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
Fine Tune - Huge files/directory - Purging senthil.ak Shell Programming and Scripting 19 07-05-2011 12:01 PM
Diff - filename and directory name are same tomix Shell Programming and Scripting 2 03-13-2011 08:12 PM
How to rsync or tar directory trees, with hidden directory, but without files? fld2007 UNIX for Advanced & Expert Users 4 09-19-2010 12:54 PM
what is diff b/w near ,far and huge pointers amitpansuria Programming 1 08-08-2007 03:35 AM



All times are GMT -4. The time now is 12:49 PM.