Simple directory tree diff script

03-16-2013

Registered User

362, 16

Join Date: Mar 2010

Last Activity: 3 March 2020, 10:38 PM EST

Location: Boston

Posts: 362

Thanks Given: 193

Thanked 16 Times in 15 Posts

Simple directory tree diff script

I have had some issues with a data drive and have copied all of the data to a new drive. The size used is not the same on both drives with a 3GB difference (less on the new drive). There are millions of files on the data drive, so it is not an easy task to determine if there are some files missing on the the new copy. Is there a simple script I can run that will identify any files that are present on the original drive but are missing on the new drive?

I create the copy with cp -Rfp &> logfile, and the logfile did not indicate that there were any files that could not be copied.

I could run rsync in one direction, but there are some issues with the time stamps on the original drive, so I'm not sure how that would work. I'm not looking to correct any discrepancies, just to identify it they exist. I have found some dir diff scripts, but they all seem over complicated for what I need.

This is ntfs under windows XP and I am running bash under cygwin.

Thanks for the advice.

LMHmedchem

LMHmedchem

View Public Profile for LMHmedchem

Find all posts by LMHmedchem

03-16-2013

Registered User

11,728, 1,345

Join Date: Feb 2004

Last Activity: 8 May 2020, 9:07 AM EDT

Location: NM

Posts: 11,728

Thanks Given: 903

Thanked 1,345 Times in 1,201 Posts

This will work on cygwin - I just tried it. It will take a loooong time.

Code:

diff <(find /drivea )  <(find /driveb -)  >  /tmp/diff.txt

This will speed it it up a little

Code:

find /drivea > /tmp/drivea &
find /driveb > /tmp/driveb &
wait
diff /tmp/drivea /tmp/driveb > /tmp/diff.txt

Using

Code:

/tmp

on some architectures really improves performance.

/drivea is the mountpoint of one filesystem, /driveb is the other mountpoint.

This will NOT check file similarity, only existence of names and directories.

[lecture]
And large (in the sense of inodes (UNIX term for file name slots)) directory trees are inherently inefficient, and become prone to errors when free inodes become scarce. i.e., 'millions' of files on a single file system are usually a really terrible idea.

Windows filesystems are not immune to this problem.

Develop a means of archiving off fasta files or whatever your are using. Save last month's files on permanent storage - disk is NOT permanent. You just discovered that, I see.

Then remove them from the disk. Just keep recent data. I realize that research or medical testing means keeping data almost forever. Ask your legal guys how long 'almost forever' is. And maybe learn about off-site archival storage in the neighborhood. Having a defined retention policy is better than trying to keep everything. Less costly, too.

[/lecture]

This User Gave Thanks to jim mcnamara For This Post:

jim mcnamara

View Public Profile for jim mcnamara

Find all posts by jim mcnamara

03-17-2013

Registered User

362, 16

Join Date: Mar 2010

Last Activity: 3 March 2020, 10:38 PM EST

Location: Boston

Posts: 362

Thanks Given: 193

Thanked 16 Times in 15 Posts

Thanks for the tip.

On cygwin, I don't see a mount point in the way that you describe it. I usually access drive as /cygdrive/c/ for C:, etc. I made a script using the code you suggest.

Code:

#!/usr/bin/bash

find /cygdrive/e/_test > /tmp/e_test &
find /cygdrive/i/_test > /tmp/i_test &
wait
sed 's/\/cygdrive\/e//g' /tmp/e_test > /tmp/check_e
sed 's/\/cygdrive\/i//g' /tmp/i_test > /tmp/check_i
diff  /tmp/check_e /tmp/check_i > /tmp/diff.txt

because each path starts with a different cygdrive, I had to use sed to remove that part of each path. After adding that, it seems to work fine on the test directories I used. I will try with some larger directories and see if there are any issues.

I will reply to your other comments later when I get this going. You are definitely not going to get a lecture in return. I have had a number of ideas as to what "permanent storage" entails and haven't landed on anything particularly useful in that regard.

LMHmedchem

LMHmedchem

View Public Profile for LMHmedchem

Find all posts by LMHmedchem

03-17-2013

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

You don't need to use sed. Just cd and run each find command on the current working directory.

Code:

( cd /cygdrive/e/_test && find . > /tmp/e_test ) &
( cd /cygdrive/i/_test && find . > /tmp/i_test ) &

Or, it could just be done all in the same shell (which would be easier if you wanted to add error handling after each cd):

Code:

cd /cygdrive/e/_test
find . > /temp/e_test &
cd /cygdrive/i/_test
find . > /tmp/i_test &

Keep in mind that find is not guaranteed to return the members of a directory in any particular. Especially given the timestamp differences that you mentioned, if in just one directory a pair of subdirectories are visited in different order, diff will generate a LOT of noise even though the contents may be identical.

If that's an issue, sort the output of find and then use comm.

Regards,
Alister

Last edited by alister; 03-17-2013 at 02:57 PM..

This User Gave Thanks to alister For This Post:

alister

View Public Profile for alister

Find all posts by alister

03-17-2013

Registered User

362, 16

Join Date: Mar 2010

Last Activity: 3 March 2020, 10:38 PM EST

Location: Boston

Posts: 362

Thanks Given: 193

Thanked 16 Times in 15 Posts

Well I ran this on my main data directory and the diff file is 500MB. This seems far too large for the size discrepancy between the two drives. Is there some particular option I should use with sort? Is there some reason to not use diff on the sorted files and use comm instead?

LMHmedchem

LMHmedchem

View Public Profile for LMHmedchem

Find all posts by LMHmedchem

03-17-2013

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

No need to use any options with sort. The default full line lexicographical sort is appropriate.

You could use diff, I suppose. In the rare case that some of your filenames begin with a tab, the diff output will be less ambiguous than comm -3.

Regards,
Alister

alister

View Public Profile for alister

Find all posts by alister

03-17-2013

Registered User

362, 16

Join Date: Mar 2010

Last Activity: 3 March 2020, 10:38 PM EST

Location: Boston

Posts: 362

Thanks Given: 193

Thanked 16 Times in 15 Posts

Well the sorted find files differ by ~3000 lines. I take this to mean that there are ~3000 files that are missing from the one directory. The output off comm is 3091728 lines, which is the same number of lines as are in the find for the original directory. I presume this is because the col 3 output of comm are files that are in both, and output I don't need to see. I presume I want comm -3 for the output I want, meaning files that are in one director tree and not in the other?

LMHmedchem

---------- Post updated at 05:28 PM ---------- Previous update was at 02:43 PM ----------

This is the final script that I used. I have brushed it up a bit so that it is more generalized and checks a few things.

Code:

#!/usr/bin/bash

# accepts path to two directories and compares the file lists in each
# path should begin with /cygdrive/driveletter/

# assign location for output
TMPDIR='/cygdrive/c/cygwin/tmp/dir_compare'

# assign directory trees to compare
TREE1=$1
TREE2=$2

# check arguments, print help if no arguments are passed
if [ $# -eq 0 ]
then
   echo "this script expects two arguments"
   echo "each argument should be the path to a directory"
   echo "each path should start with /cygdrive/, not a relative path"
   echo "the script will compare the list of files in the directory and subdirectories"
   echo "and will report any instance where a file exists in one directory but not the other"
   echo 'output will be printed to '$TMPDIR'/comm.txt'
   exit
fi

# check if TREE1 exists
if [ ! -d $1 ];
then
   echo " "
   echo "directory " $1 "not found"
   echo "exiting"
   exit
fi
# check if TREE2 exists
if [ ! -d $2 ];
then
   echo " "
   echo "directory " $2 "not found"
   echo "exiting"
   exit
fi

# clean tmp dir if it contains files
cd $TMPDIR
FILES=(*)
FILES=${#FILES[@]}
if (( "$FILES" > 0 )) ; then
   rm *
fi

# echo some information
echo " "
echo "comparing file list of " $TREE1
echo "with file list of " $TREE2
echo " "

# cd to TREE1 and create file list for tree
cd $TREE1
find . > $TMPDIR'/check_1' &

# cd to TREE2 and create file list for tree
cd $TREE2
find . > $TMPDIR'/check_2' &

# wait for find to finish
wait

# sort output of find to keep file list from both dir trees is in registration
sort $TMPDIR'/check_1' > $TMPDIR'/check_1_sorted'
sort $TMPDIR'/check_1' > $TMPDIR'/check_2_sorted'

# print the number of lines (files) in each directory tree
wc -l $TMPDIR'/check_1_sorted'
wc -l $TMPDIR'/check_2_sorted'

# compare the two files, only print instances where a file exists in one tree but not the other
comm -3  $TMPDIR'/check_1_sorted'  $TMPDIR'/check_2_sorted' > $TMPDIR'/comm.txt'

Running this script indicates that I have 6,189,828 files in each tree and the script does not find any difference in file names. I found that I had one extra directory in one of the trees. This came from some testing I was doing to see if a copy of my files had the same issue with the time stamps as the original. When I deleted this copy directory, the comm file is empty.

The only problem is that I still have a 3GB size discrepancy between the two partitions.

$ df -h
Filesystem Size Used Avail Use% Mounted on
E: 879G 502G 378G 58% /cygdrive/e
I: 831G 499G 332G 61% /cygdrive/i

The size of the E partition didn't change when I deleted the extra directory, even though the folder was quite large. I expected that to make the sizes the same. I'm not sure what else I can do to check that my copy has all of the data from the original. The results would imply that some of the files exist on both drives, but are not the same size. Is there a reasonable way to check that? I would seem like that would be a non-trivial addition to what I am doing. Is it possible for the same exact files to be on both drives but to take up different amounts of space?

LMHmedchem

---------- Post updated at 05:44 PM ---------- Previous update was at 05:28 PM ----------

I see I had a typo in the script, so I wasn't doing the correct compare. I am running again with the corrected script.

---------- Post updated at 06:51 PM ---------- Previous update was at 05:44 PM ----------

Running the corrected script, there are a few files that are different, but the total size is not much. I keep my browser profiles here and these are different because one is the browser I am using and one is a copy made yesterday.

There is nothing here that accounts for 3GB of data.

Any suggestions on what to do next? I suppose I could use the sorted find files to do a diff between each file pair, but that wouldn't exactly be speedy. The find files don't differentiate between files and directories and I don't know what happens if you feed diff a pair of directories instead of files.

LMHmedchem

LMHmedchem

View Public Profile for LMHmedchem

Find all posts by LMHmedchem

Shell Programming and Scripting

Simple directory tree diff script

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to run a script/command on all the directories in a directory tree?

Discussion started by: temp-usr

2. Shell Programming and Scripting

Shell script to build directory tree and files

Discussion started by: Libertad

3. Shell Programming and Scripting

Specific directory parsing in a directory tree

Discussion started by: harpal singh

4. UNIX for Dummies Questions & Answers

How to copy a tree of directory

Discussion started by: enkei17

5. UNIX for Dummies Questions & Answers

directory tree with directory size

Discussion started by: vikram3.r

6. Shell Programming and Scripting

Newbie problem with simple script to create a directory

Discussion started by: gwfay

7. UNIX for Dummies Questions & Answers

Move all files in a directory tree to a signal directory?

Discussion started by: briandanielz

8. Shell Programming and Scripting

Diff. Backup Script Using TAR. Should be simple.

Discussion started by: protienplant

9. Shell Programming and Scripting

directory tree

Discussion started by: ravi raj kumar

10. Programming

directory as tree

Discussion started by: anything2