Unix/Linux Go Back    


Shell Programming and Scripting Unix shell scripting - KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and shell scripts and shell scripting languages here.

Best way to diff two huge directory trees

Shell Programming and Scripting


Closed Linux or Unix Question    
 
Thread Tools Search this Thread Display Modes
    #1  
Old Unix and Linux 08-12-2008
same1290 same1290 is offline
Registered User
 
Join Date: Jul 2008
Last Activity: 10 October 2008, 12:03 PM EDT
Posts: 15
Thanks: 0
Thanked 0 Times in 0 Posts
Best way to diff two huge directory trees

Hi

I have a job that will be running nightly incremental backsup of a large directory tree.

I did the initial backup, now I want to write a script to verify that all the files were transferred correctly. I did something like this which works in principle on small trees:

diff -r -q $src_dir $dst_dir >& diffreport.txt

The problem with this is that it is very slow. The directory I am backing up is about 2 TB.

I also tried using the tools find and sum to dump the checksums to two file s, one for source directory and one for destination and comparing them. This is the command I used:

find $src_dir -type f -print0 | xargs -0 sum > src_dir_checksums.txt
find $dst_dir -type f -print0 | xargs -0 sum > dst_dir_checksums.txt
diff src_dir_checksums.txt dst_dir_checksums.txt

But for some reason this produces a different search order for the two directories which are on different machines.

Any help would greatly appreciated.

Thanks in advance,
Sam
Sponsored Links
    #2  
Old Unix and Linux 08-12-2008
danmero danmero is offline Forum Advisor  
 
Join Date: Nov 2007
Last Activity: 25 August 2015, 1:58 PM EDT
Location: H3X
Posts: 2,152
Thanks: 10
Thanked 119 Times in 112 Posts
Try rsync, you can google for rsync incremental backup.
Sponsored Links
    #3  
Old Unix and Linux 08-12-2008
Ikon's Unix or Linux Image
Ikon Ikon is offline Forum Advisor  
Computer Geek
 
Join Date: Jul 2008
Last Activity: 15 January 2015, 10:57 AM EST
Location: Frederick, MD
Posts: 748
Thanks: 4
Thanked 11 Times in 10 Posts
What about just compairing the output of


Code:
# cd /path/to/directory
# du
16      ./somedir
7200    ./somedir/1
1200    ./somedir/2
80      ./someotherdir
14512   .

This wont check the files as far as being exact copies but would verify the size of the files in the directories.
    #4  
Old Unix and Linux 08-12-2008
same1290 same1290 is offline
Registered User
 
Join Date: Jul 2008
Last Activity: 10 October 2008, 12:03 PM EDT
Posts: 15
Thanks: 0
Thanked 0 Times in 0 Posts
Hi,
Thanks for your reply.

I actually already compared the sizes using du. They're quite similar but not the same, I think maybe because the directory entries sizes are also part of the total and those are different on the different machines where the two trees are stored (that's just a guess).

So I think I need something more reliable.

Sam
Sponsored Links
    #5  
Old Unix and Linux 08-12-2008
Annihilannic Annihilannic is offline Forum Advisor  
 
Join Date: May 2008
Last Activity: 28 October 2009, 7:03 PM EDT
Location: Sydney, Australia
Posts: 1,009
Thanks: 0
Thanked 2 Times in 2 Posts
sort the checksum files by filename before you diff them.
Sponsored Links
    #6  
Old Unix and Linux 08-16-2008
drl's Unix or Linux Image
drl drl is offline Forum Advisor  
Registered Voter
 
Join Date: Apr 2007
Last Activity: 3 September 2015, 10:28 AM EDT
Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris
Posts: 1,807
Thanks: 72
Thanked 249 Times in 223 Posts
Hi.

There is a script cmptree at Unix Review > The Shell Corner: cmptree
which may be useful. It uses cmp to compare files. Utility cmp reads a file as binary, so non-text files can be successfully compared.

If you are solving this problem essentially once, then my feeling is that to read an entire file to get the checksum may be wasting cycles if the differences occur early in the files.

In fact, the method I prefer is first to check the length of the files. This is a low-overhead operation, either with utility stat in Linux or utility ls otherwise. If the lengths are different, then the files are different. If the lengths are the same, then one can use something like cmp to compare the files.

The one disadvantage that I saw in cmptree is that does not handle filenames with embedded whitespace, so if you have such files, then the published version of cmptree will not be useful ... cheers, drl
Sponsored Links
    #7  
Old Unix and Linux 08-18-2008
same1290 same1290 is offline
Registered User
 
Join Date: Jul 2008
Last Activity: 10 October 2008, 12:03 PM EDT
Posts: 15
Thanks: 0
Thanked 0 Times in 0 Posts
Hi drl
Thanks for that feedback. That's a good idea. Getting the lengths is probably enough of a check and a lot quicker which is my main problem (with terabytes of data to check).

Sam
Sponsored Links
Closed Linux or Unix Question

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Linux More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
Fine Tune - Huge files/directory - Purging senthil.ak Shell Programming and Scripting 19 07-05-2011 12:01 PM
Diff - filename and directory name are same tomix Shell Programming and Scripting 2 03-13-2011 08:12 PM
How to rsync or tar directory trees, with hidden directory, but without files? fld2007 UNIX for Advanced & Expert Users 4 09-19-2010 12:54 PM
what is diff b/w near ,far and huge pointers amitpansuria Programming 1 08-08-2007 03:35 AM



All times are GMT -4. The time now is 11:08 AM.