Simple directory tree diff script


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Simple directory tree diff script
# 1  
Old 03-16-2013
Simple directory tree diff script

I have had some issues with a data drive and have copied all of the data to a new drive. The size used is not the same on both drives with a 3GB difference (less on the new drive). There are millions of files on the data drive, so it is not an easy task to determine if there are some files missing on the the new copy. Is there a simple script I can run that will identify any files that are present on the original drive but are missing on the new drive?

I create the copy with cp -Rfp &> logfile, and the logfile did not indicate that there were any files that could not be copied.

I could run rsync in one direction, but there are some issues with the time stamps on the original drive, so I'm not sure how that would work. I'm not looking to correct any discrepancies, just to identify it they exist. I have found some dir diff scripts, but they all seem over complicated for what I need.

This is ntfs under windows XP and I am running bash under cygwin.

Thanks for the advice.

LMHmedchem
# 2  
Old 03-16-2013
This will work on cygwin - I just tried it. It will take a loooong time.

Code:
diff <(find /drivea )  <(find /driveb -)  >  /tmp/diff.txt

This will speed it it up a little

Code:
find /drivea > /tmp/drivea &
find /driveb > /tmp/driveb &
wait
diff /tmp/drivea /tmp/driveb > /tmp/diff.txt

Using
Code:
/tmp

on some architectures really improves performance.

/drivea is the mountpoint of one filesystem, /driveb is the other mountpoint.

This will NOT check file similarity, only existence of names and directories.

[lecture]
And large (in the sense of inodes (UNIX term for file name slots)) directory trees are inherently inefficient, and become prone to errors when free inodes become scarce. i.e., 'millions' of files on a single file system are usually a really terrible idea.

Windows filesystems are not immune to this problem.

Develop a means of archiving off fasta files or whatever your are using. Save last month's files on permanent storage - disk is NOT permanent. You just discovered that, I see.

Then remove them from the disk. Just keep recent data. I realize that research or medical testing means keeping data almost forever. Ask your legal guys how long 'almost forever' is. And maybe learn about off-site archival storage in the neighborhood. Having a defined retention policy is better than trying to keep everything. Less costly, too. Smilie
[/lecture]
This User Gave Thanks to jim mcnamara For This Post:
# 3  
Old 03-17-2013
Thanks for the tip.

On cygwin, I don't see a mount point in the way that you describe it. I usually access drive as /cygdrive/c/ for C:, etc. I made a script using the code you suggest.

Code:
#!/usr/bin/bash

find /cygdrive/e/_test > /tmp/e_test &
find /cygdrive/i/_test > /tmp/i_test &
wait
sed 's/\/cygdrive\/e//g' /tmp/e_test > /tmp/check_e
sed 's/\/cygdrive\/i//g' /tmp/i_test > /tmp/check_i
diff  /tmp/check_e /tmp/check_i > /tmp/diff.txt

because each path starts with a different cygdrive, I had to use sed to remove that part of each path. After adding that, it seems to work fine on the test directories I used. I will try with some larger directories and see if there are any issues.

I will reply to your other comments later when I get this going. You are definitely not going to get a lecture in return. I have had a number of ideas as to what "permanent storage" entails and haven't landed on anything particularly useful in that regard.

LMHmedchem
# 4  
Old 03-17-2013
You don't need to use sed. Just cd and run each find command on the current working directory.
Code:
( cd /cygdrive/e/_test && find . > /tmp/e_test ) &
( cd /cygdrive/i/_test && find . > /tmp/i_test ) &

Or, it could just be done all in the same shell (which would be easier if you wanted to add error handling after each cd):
Code:
cd /cygdrive/e/_test
find . > /temp/e_test &
cd /cygdrive/i/_test
find . > /tmp/i_test &

Keep in mind that find is not guaranteed to return the members of a directory in any particular. Especially given the timestamp differences that you mentioned, if in just one directory a pair of subdirectories are visited in different order, diff will generate a LOT of noise even though the contents may be identical.

If that's an issue, sort the output of find and then use comm.

Regards,
Alister

Last edited by alister; 03-17-2013 at 02:57 PM..
This User Gave Thanks to alister For This Post:
# 5  
Old 03-17-2013
Well I ran this on my main data directory and the diff file is 500MB. This seems far too large for the size discrepancy between the two drives. Is there some particular option I should use with sort? Is there some reason to not use diff on the sorted files and use comm instead?

LMHmedchem
# 6  
Old 03-17-2013
No need to use any options with sort. The default full line lexicographical sort is appropriate.

You could use diff, I suppose. In the rare case that some of your filenames begin with a tab, the diff output will be less ambiguous than comm -3.

Regards,
Alister
# 7  
Old 03-17-2013
Well the sorted find files differ by ~3000 lines. I take this to mean that there are ~3000 files that are missing from the one directory. The output off comm is 3091728 lines, which is the same number of lines as are in the find for the original directory. I presume this is because the col 3 output of comm are files that are in both, and output I don't need to see. I presume I want comm -3 for the output I want, meaning files that are in one director tree and not in the other?

LMHmedchem

---------- Post updated at 05:28 PM ---------- Previous update was at 02:43 PM ----------

This is the final script that I used. I have brushed it up a bit so that it is more generalized and checks a few things.

Code:
#!/usr/bin/bash

# accepts path to two directories and compares the file lists in each
# path should begin with /cygdrive/driveletter/

# assign location for output
TMPDIR='/cygdrive/c/cygwin/tmp/dir_compare'

# assign directory trees to compare
TREE1=$1
TREE2=$2

# check arguments, print help if no arguments are passed
if [ $# -eq 0 ]
then
   echo "this script expects two arguments"
   echo "each argument should be the path to a directory"
   echo "each path should start with /cygdrive/, not a relative path"
   echo "the script will compare the list of files in the directory and subdirectories"
   echo "and will report any instance where a file exists in one directory but not the other"
   echo 'output will be printed to '$TMPDIR'/comm.txt'
   exit
fi

# check if TREE1 exists
if [ ! -d $1 ];
then
   echo " "
   echo "directory " $1 "not found"
   echo "exiting"
   exit
fi
# check if TREE2 exists
if [ ! -d $2 ];
then
   echo " "
   echo "directory " $2 "not found"
   echo "exiting"
   exit
fi

# clean tmp dir if it contains files
cd $TMPDIR
FILES=(*)
FILES=${#FILES[@]}
if (( "$FILES" > 0 )) ; then
   rm *
fi

# echo some information
echo " "
echo "comparing file list of " $TREE1
echo "with file list of " $TREE2
echo " "

# cd to TREE1 and create file list for tree
cd $TREE1
find . > $TMPDIR'/check_1' &

# cd to TREE2 and create file list for tree
cd $TREE2
find . > $TMPDIR'/check_2' &

# wait for find to finish
wait

# sort output of find to keep file list from both dir trees is in registration
sort $TMPDIR'/check_1' > $TMPDIR'/check_1_sorted'
sort $TMPDIR'/check_1' > $TMPDIR'/check_2_sorted'

# print the number of lines (files) in each directory tree
wc -l $TMPDIR'/check_1_sorted'
wc -l $TMPDIR'/check_2_sorted'

# compare the two files, only print instances where a file exists in one tree but not the other
comm -3  $TMPDIR'/check_1_sorted'  $TMPDIR'/check_2_sorted' > $TMPDIR'/comm.txt'

Running this script indicates that I have 6,189,828 files in each tree and the script does not find any difference in file names. I found that I had one extra directory in one of the trees. This came from some testing I was doing to see if a copy of my files had the same issue with the time stamps as the original. When I deleted this copy directory, the comm file is empty.

The only problem is that I still have a 3GB size discrepancy between the two partitions.

$ df -h
Filesystem Size Used Avail Use% Mounted on
E: 879G 502G 378G 58% /cygdrive/e
I: 831G 499G 332G 61% /cygdrive/i

The size of the E partition didn't change when I deleted the extra directory, even though the folder was quite large. I expected that to make the sizes the same. I'm not sure what else I can do to check that my copy has all of the data from the original. The results would imply that some of the files exist on both drives, but are not the same size. Is there a reasonable way to check that? I would seem like that would be a non-trivial addition to what I am doing. Is it possible for the same exact files to be on both drives but to take up different amounts of space?

LMHmedchem

---------- Post updated at 05:44 PM ---------- Previous update was at 05:28 PM ----------

I see I had a typo in the script, so I wasn't doing the correct compare. I am running again with the corrected script.

---------- Post updated at 06:51 PM ---------- Previous update was at 05:44 PM ----------

Running the corrected script, there are a few files that are different, but the total size is not much. I keep my browser profiles here and these are different because one is the browser I am using and one is a copy made yesterday.

There is nothing here that accounts for 3GB of data.

Any suggestions on what to do next? I suppose I could use the sorted find files to do a diff between each file pair, but that wouldn't exactly be speedy. The find files don't differentiate between files and directories and I don't know what happens if you feed diff a pair of directories instead of files.

LMHmedchem
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to run a script/command on all the directories in a directory tree?

How to run a script/command on all the directories in a directory tree? The below script is just for the files in a single directory, how to run it on all the directories in a directory tree? #!/bin/sh for audio_files in *.mp3 do outfile="${audio_files%.*}.aiff" sox "$audio_files"... (2 Replies)
Discussion started by: temp-usr
2 Replies

2. Shell Programming and Scripting

Shell script to build directory tree and files

Hi all, I'm trying at the moment to write a shell script to build a directory tree and create files within the built directories. I've scoured through sites and text books and I just can't figure out how to go about it. I would assume that I need to use loops of some sort, but I can't seem... (8 Replies)
Discussion started by: Libertad
8 Replies

3. Shell Programming and Scripting

Specific directory parsing in a directory tree

Hi friends, Hello again :) i got stuck in problem. Is there any way to get a special directory from directory tree? Here is my problm.." Suppose i have one fix directory structure "/abc/xyz/pqr/"(this will be fix).Under this directory structure i have some other directory and... (6 Replies)
Discussion started by: harpal singh
6 Replies

4. UNIX for Dummies Questions & Answers

How to copy a tree of directory

Mi question is how can you copy only de three of directory and not the files in it. Only a need the three of directorys not the files (6 Replies)
Discussion started by: enkei17
6 Replies

5. UNIX for Dummies Questions & Answers

directory tree with directory size

find . -type d -print 2>/dev/null|awk '!/\.$/ {for (i=1;i<NF;i++){d=length($i);if ( d < 5 && i != 1 )d=5;printf("%"d"s","|")}print "---"$NF}' FS='/' Can someone explain how this works..?? How can i add directory size to be listed in the above command's output..?? (1 Reply)
Discussion started by: vikram3.r
1 Replies

6. Shell Programming and Scripting

Newbie problem with simple script to create a directory

script is: dirname= "$(date +%b%d)_$(date +%H%M)" mkdir $dirname should create a directory named Nov4_ Instead I get the following returned: root@dchs-pint-001:/=>./test1 ./test1: Nov04_0736: not found. Usage: mkdir Directory ... root@dchs-pint-001:/=> TOO easy, but what am I... (2 Replies)
Discussion started by: gwfay
2 Replies

7. UNIX for Dummies Questions & Answers

Move all files in a directory tree to a signal directory?

Is this possible? Let me know If I need specify further on what I am trying to do- I just want to spare you the boring details of my personal file management. Thanks in advance- Brian- (2 Replies)
Discussion started by: briandanielz
2 Replies

8. Shell Programming and Scripting

Diff. Backup Script Using TAR. Should be simple.

I'm specifically trying to find help or insight on using the --incremental ('-G') option for creating a tar. Please resist the urge to tell me to use --listed-incremental ('-g') option. That's fairly well documented in the GNU tar manual. GNU tar 1.19 This is what the manual does say in section... (0 Replies)
Discussion started by: protienplant
0 Replies

9. Shell Programming and Scripting

directory tree

Hi all, The following is a script for displaying directory tree. D=${1:-`pwd`} (cd $D; pwd) find $D -type d -print | sort | sed -e "s,^$D,,"\ -e "/^$/d"\ -e "s,*/\(*\)$,\:-----\1,"\ -e "s,*/,: ,g" | more exit 0 I am trying to understand the above script.But... (3 Replies)
Discussion started by: ravi raj kumar
3 Replies

10. Programming

directory as tree

hi i have modified a program to display directory entries recursively in a tree like form i need an output with the following guidelines: the prog displays the contents of the directory the directory contents are sorted before printing so that directories come before regular files if an entry... (2 Replies)
Discussion started by: anything2
2 Replies
Login or Register to Ask a Question