Best way to diff two huge directory trees


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Best way to diff two huge directory trees
# 1  
Old 08-12-2008
Best way to diff two huge directory trees

Hi

I have a job that will be running nightly incremental backsup of a large directory tree.

I did the initial backup, now I want to write a script to verify that all the files were transferred correctly. I did something like this which works in principle on small trees:

diff -r -q $src_dir $dst_dir >& diffreport.txt

The problem with this is that it is very slow. The directory I am backing up is about 2 TB.

I also tried using the tools find and sum to dump the checksums to two file s, one for source directory and one for destination and comparing them. This is the command I used:

find $src_dir -type f -print0 | xargs -0 sum > src_dir_checksums.txt
find $dst_dir -type f -print0 | xargs -0 sum > dst_dir_checksums.txt
diff src_dir_checksums.txt dst_dir_checksums.txt

But for some reason this produces a different search order for the two directories which are on different machines.

Any help would greatly appreciated.

Thanks in advance,
Sam
# 2  
Old 08-12-2008
Try rsync, you can google for rsync incremental backup.
# 3  
Old 08-12-2008
What about just compairing the output of

Code:
# cd /path/to/directory
# du
16      ./somedir
7200    ./somedir/1
1200    ./somedir/2
80      ./someotherdir
14512   .

This wont check the files as far as being exact copies but would verify the size of the files in the directories.
# 4  
Old 08-12-2008
Hi,
Thanks for your reply.

I actually already compared the sizes using du. They're quite similar but not the same, I think maybe because the directory entries sizes are also part of the total and those are different on the different machines where the two trees are stored (that's just a guess).

So I think I need something more reliable.

Sam
# 5  
Old 08-12-2008
sort the checksum files by filename before you diff them.
# 6  
Old 08-17-2008
Hi.

There is a script cmptree at Unix Review > The Shell Corner: cmptree
which may be useful. It uses cmp to compare files. Utility cmp reads a file as binary, so non-text files can be successfully compared.

If you are solving this problem essentially once, then my feeling is that to read an entire file to get the checksum may be wasting cycles if the differences occur early in the files.

In fact, the method I prefer is first to check the length of the files. This is a low-overhead operation, either with utility stat in Linux or utility ls otherwise. If the lengths are different, then the files are different. If the lengths are the same, then one can use something like cmp to compare the files.

The one disadvantage that I saw in cmptree is that does not handle filenames with embedded whitespace, so if you have such files, then the published version of cmptree will not be useful ... cheers, drl
# 7  
Old 08-18-2008
Hi drl
Thanks for that feedback. That's a good idea. Getting the lengths is probably enough of a check and a lot quicker which is my main problem (with terabytes of data to check).

Sam
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Need help with listing file name and modified date on a huge directory

hi, We have a huge directory that ha 5.1 Million files in it. We are trying to get the file name and modified timestamp of the most recent 3 years from this huge directory for a migration project. However, the ls command (background process) to list the file names and timestamp is running for... (2 Replies)
Discussion started by: subbu
2 Replies

2. Shell Programming and Scripting

ksh - Checking directory trees containing wild cards

Hi Can somebody please show me how to check from within a KSH script if a directory exists on that same host when parts of the directory tree are unknown? If these wildcard dirs were the only dirs at that level then ... RETCODE=$(ls -l /u01/app/oracle/local/*/* | grep target_dir) ... will... (4 Replies)
Discussion started by: user052009
4 Replies

3. Shell Programming and Scripting

How to copy very large directory trees

I have constant trouble with XCOPY/s for multi-gigabyte transfers. I need a utility like XCOPY/S that remembers where it left off if I reboot. Is there such a utility? How about a free utility (free as in free beer)? How about an md5sum sanity check too? I posted the above query in another... (3 Replies)
Discussion started by: siegfried
3 Replies

4. Shell Programming and Scripting

Checking whether the file exists under a directory and doing a diff

Hi Everyone, I am writing a shell script for the below needs and would like your suggestions and advices. I have a lot of scripting files(Shell Scripts) under the directory: /home/risk_dev/dev I have another directory which has a lot of shell scripts under the directory: ... (2 Replies)
Discussion started by: filter
2 Replies

5. Shell Programming and Scripting

Fine Tune - Huge files/directory - Purging

Hi Expert's, I need your assitance in tunning one script. I have a mount point where almost 4848008 files and 864739 directories are present. The script search for specific pattern files and specfic period then delete them to free up space. The script is designed to run daily and its taking around... (19 Replies)
Discussion started by: senthil.ak
19 Replies

6. Shell Programming and Scripting

Diff - filename and directory name are same

Hi, I have in the one folder file and directory that have same name. I need make diff from first directory where exists file in folder FOLDER/filename and second file where not exist folder, but FOLDER is filename. I use -N switch for create new file. Scripts report: Not a directory Sample:... (2 Replies)
Discussion started by: tomix
2 Replies

7. UNIX for Advanced & Expert Users

How to rsync or tar directory trees, with hidden directory, but without files?

I want to backup all the directory tress, including hidden directories, without copying any files. find . -type d gives the perfect list. When I tried tar, it won't work for me because it tars all the files. find . -type d | xargs tar -cvf a.tar So i tried rsync. On my own test box, the... (4 Replies)
Discussion started by: fld2007
4 Replies

8. Programming

what is diff b/w near ,far and huge pointers

helo, can u tell me what is exact difference among near,far and huge pointer Amit (1 Reply)
Discussion started by: amitpansuria
1 Replies

9. Shell Programming and Scripting

how to look in directory for files with diff date

What kind of command can i use to search a directory and subdirectories for all files that do not have the same date? i want to find any files that do not match a date of Sep 13, 2002? Or that have a different owner or group? Any help appreciated! (2 Replies)
Discussion started by: kymberm
2 Replies
Login or Register to Ask a Question