Simple directory tree diff script

03-17-2013

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

If you want to determine if there are any differences, the only way is to read every byte of every file and compare it to its counterpart. Otherwise, even if file sizes match, there could still be a discrepancy. cmp may be useful for that task.

du (and df) measure the amount of storage allocated for files. They do not report file sizes. Two identical files may consume different amount of storage on different partitions/filesystems. One factor that may affect the storage allocated to a file is the block size of the file system. Another factor is sparseness.

A very large sparse file may occupy very little space on disk even though ls and stat report a large file size. But, if that file is copied to a filesystem that does not support sparse files, or using a tool that doesn't support sparse files, the disk space consumed will balloon to match the file's size as reported by ls/stat.

Regards,
Alister

alister

View Public Profile for alister

Find all posts by alister

03-18-2013

Registered User

362, 16

Join Date: Mar 2010

Last Activity: 3 March 2020, 10:38 PM EST

Location: Boston

Posts: 362

Thanks Given: 193

Thanked 16 Times in 15 Posts

I suspect that there is simply an issue with the reported size, since all of the file name match. For both partitions, they are windows ntfs, and the copy is smaller the the source. The copy was made using cp -Rfp under cygwin, so it may be that cp stored the copies more efficiently than they were stored in the original versions.

If I was to use the two sorted find files as a starting list for cmp, how would I differentiate the entries in the sorted list that are directories from those that are files? Since cmp is for files, will it just throw an exception if what you pass to it is a directory?

---------- Post updated 03-18-13 at 12:31 AM ---------- Previous update was 03-17-13 at 09:16 PM ----------

I have added the following to the end of my script to check each file pair with cmp. If I get this working, I will add logic to run this part based on an argument.

Code:

# further process sorted find list by checking each file pair with cmp
while read input
do
#  remove leading . from each line in find output
   TEMP=$(echo $input | sed 's/^.//g')
#  escape spaces
   LOCALFILE=$(echo $TEMP | sed 's/ /\\ /g')

   echo $TREE1$LOCALFILE
   echo $TREE2$LOCALFILE

   cmp $TREE1$LOCALFILE  $TREE2$LOCALFILE > $TMPDIR'/byte_compare.txt'

done < $TMPDIR'/check_1_sorted'

This uses one of the sorted find lists to identify each file. If the entry from find is a directory, is seems as if cmp just prints a notification to stderr and moves on. The problem I am having now is that cmp won't except what I have done above to escape the spaces in file names. I have done echo on the path for each file, and it appears correct, but I am getting an error from cmp,

/cygdrive/e/nlite/Presets/Last\ Session_u.ini
/cygdrive/i/nlite/Presets/Last\ Session_u.ini
cmp: invalid --ignore-initial value `/cygdrive/i/nlite/Presets/Last\'

cmp doesn't seem to be seeing anything past the escape. Am I not escaping this properly? If I don't escape the space, I get a similar error indicating that the space is breaking the input.

LMHmedchem

LMHmedchem

View Public Profile for LMHmedchem

Find all posts by LMHmedchem

03-18-2013

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Keep in mind that disk space is always allocated in clusters, usually 4k. So a one byte file would use 4k disk space. This still would mean there's (not quite) a million files that spoil 4k of disk space each, though.
Are the two disk using the same file system type? There's fs out there that make intelligent use of inode list space to store incomplete clusters, while others don't and store those incomplete clusters on disk and "spoil" the empty part of the cluster.
And, finally, the two disks have a different size. Not much, but it may suffice to create a difference in the fs managing structures' overhead size.

Last edited by RudiC; 03-18-2013 at 11:04 AM.. Reason: typo corrected

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

03-18-2013

Registered User

2,288, 480

Join Date: Apr 2007

Last Activity: 3 May 2020, 8:28 AM EDT

Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris

Posts: 2,288

Thanks Given: 430

Thanked 480 Times in 395 Posts

Hi.

Observations:

1) I agree with jim mcnamara about "millions of files" (especially if they are in just a few directories -- *nix filesystems can handle that, but at a cost of several indirect lookups), and with RudiC that allocation sizes may be involved.

2) If I were doing this, I would look at the comparison between lengths of the files. The stat function for size is easy and fast to obtain, with command stat, perl, c, etc. Then only if the file pairs had different sizes would I investigate farther.

3) There is some information about backslash-escape in cygwin at bash - Cygwin: using a path variable containing a windows path (with a space in it) - Stack Overflow -- I attribute that to the use of "\" as a path separator in MS systems -- basically, the advice seems to be use quotes.

4) My vague recollection is that directory sizes are never decreased even when significant numbers of files and sub-directories are removed, at least in *nix. I have no idea if that concept holds in MS systems.

5) This problem may be in a gray area between *nix-like systems and MS systems. About the latter I know very little.

Best wishes ... cheers, drl

This User Gave Thanks to drl For This Post:

drl

View Public Profile for drl

Find all posts by drl

03-18-2013

Registered User

362, 16

Join Date: Mar 2010

Last Activity: 3 March 2020, 10:38 PM EST

Location: Boston

Posts: 362

Thanks Given: 193

Thanked 16 Times in 15 Posts

My file system started with a single primary data directory. This was primarily for the purposes of data backup, since it simplified the rsync setup. At this point, I have 4 primary data directories. The vast majority of the files on this drive are chemical structures in electronic format (mol files and SMILES strings). Mission critical data, such as src code, exists on DVD and even in hard copy printout at other locations. Electronic structure data defies some such permanent storage solutions, since they are of little or no value when printed on paper. Archiving such data means moving it to another hard drive, or possibly a DVD. I am skeptical about optical storage, since I have a case of CDs downstairs that I purchased with out silk screened logo on them. That was a few years ago, but they are already unreadable by any software that I can find. When I put one of them into an optical drive, I get a message that the disk is not in a readable format, so I cannot write to them. I think that it is understandable that this kind of thing makes me hesitant about storing data on such a medium.

The solution I have taken to is to have every important file on at least 4 hard drives, over at least two locations. This means two internal hard drives, synced with rsync, and two external hard drives (one off site). I replace hard drives every two years and have had good luck with this solution up to this point. On larger drives, some of my newer setups have two partitions with a smaller "working" partition at the outer edge of the drive and a larger "archive" partition for the rest.

This system does not keep things up to date in real time (like a raid1), but raid has it's issues as well. I have lost many more files through my own stupidity of accidentally deleting things than I ever have through hardware or software failures. Not even a raid array can protect you from being a moron from time to time, oh that it could...

I can certainly spread my data over more directories at higher levels, or even add more partitions. All of these partitions are ntfs if that matters. I am actually getting ready to rebuild this rig, so now would be a good time to make changes. I don't often do searches from higher up directories, since there are individual project folders.

I can move things around to whatever extent would be helpful, but there are still millions of files that need to be kept somewhere (several somewheres for backup). I can dump many of them onto external drives and put them in the firesafe, but I don't know if they would be any better preserved there than in an archive partition. How long can you leave a hard drive sitting in the closet and still expect it to fire up? I guess there might not be much data on that at this point, since 1TB drives are only a few years old.

As far a my current script, changing to double quotes seems to work,

cmp "$TREE1$LOCALFILE" "$TREE2$LOCALFILE" > $TMPDIR'/byte_compare.txt'

I know I tried this with quotes, but it must have been single quotes. I remove the code to escape spaces. I tried with stat by doing,

Code:

SIZE1=$(stat -c%s "$TREE1$LOCALFILE")
SIZE2=$(stat -c%s "$TREE2$LOCALFILE")

if [ "$SIZE1" != "$SIZE2" ]; then
   echo "$TREE1$LOCALFILE" >> $TMPDIR'/size_compare.txt'
fi

This takes about 20% longer than doing cmp, so unless I have set this up incorrectly, there doesn't seem to be a performance advantage, especially if you are going to do cmp anyway if you find files of different size ( I am using size in bytes synonymously with length, so let me know if that is not correct).

Thanks for all of the help so far.

LMHmedchem

LMHmedchem

View Public Profile for LMHmedchem

Find all posts by LMHmedchem

03-18-2013

Registered User

2,288, 480

Join Date: Apr 2007

Last Activity: 3 May 2020, 8:28 AM EDT

Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris

Posts: 2,288

Thanks Given: 430

Thanked 480 Times in 395 Posts

Hi.

I may be comparing apples(MS systems) to oranges (*nix systems), but here is a timing comparison of stat and cmp on a GNU/Linux box, with 2 identical files:

Code:

#!/usr/bin/env bash

# @(#) s1	Demonstrate compare timings for stat and cmp.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C stat cmp

N=${1-10000}

pl " Input data file f1 f2:"
specimen -3 -n f1 f2 | cut -c1-78

pl " Results, time for $N stat calls:"
rm -f f3
time for ((i=1;i<=$N;i++))
do
  s1=$(stat -c%s f1)
  s2=$(stat -c%s f2)
  if [ "$s1" != "$s2" ]
  then 
    pe "f1" >> f3
  fi
done
if [ -e f3 ]
then
  pe " Lines in f3: $(wc -l <f3)"
fi

pl " Results, time for $N cmp calls:"
rm -f f3
time for ((i=1;i<=$N;i++))
do
  if ! cmp f1 f2
  then
    pe "f1" >> f3
  fi
done
if [ -e f3 ]
then
  pe " Lines in f3: $(wc -l <f3)"
fi

exit 0

Code:

./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0.8 (lenny) 
bash GNU bash 3.2.39
stat (GNU coreutils) 6.10
cmp (GNU diffutils) 2.8.1

-----
 Input data file f1 f2:
Edges: 3:0:3 of 17777 lines in file "f1"
     1	Preliminary Matter.  
     2	
     3	This text of Melville's Moby-Dick is based on the Hendricks House editi
   ---
 17775	THEY GLIDED BY AS IF WITH PADLOCKS ON THEIR MOUTHS; THE SAVAGE SEA-HAWK
 17776	D WITH SHEATHED BEAKS. +ON THE SECOND DAY, A SAIL DREW NEAR, NEARER, AN
 17777	KED ME UP AT LAST. +IT WAS THE DEVIOUS-CRUISING +RACHEL, THAT IN HER RE

Edges: 3:0:3 of 17777 lines in file "f2"
     1	Preliminary Matter.  
     2	
     3	This text of Melville's Moby-Dick is based on the Hendricks House editi
   ---
 17775	THEY GLIDED BY AS IF WITH PADLOCKS ON THEIR MOUTHS; THE SAVAGE SEA-HAWK
 17776	D WITH SHEATHED BEAKS. +ON THE SECOND DAY, A SAIL DREW NEAR, NEARER, AN
 17777	KED ME UP AT LAST. +IT WAS THE DEVIOUS-CRUISING +RACHEL, THAT IN HER RE

-----
 Results, time for 10000 stat calls:

real	0m39.595s
user	0m10.397s
sys	0m27.694s

-----
 Results, time for 10000 cmp calls:

real	0m55.188s
user	0m27.122s
sys	0m25.958s

So perhaps MS sysems require a lot more work to get the size.

For the case of perl, that same amount of work for stat can be done in under 0.1 seconds real time:

Code:

#!/usr/bin/env perl

# @(#) p1	Demonstrate stat on open (and un-opened) files.

use strict;
use warnings;

my ($debug);
$debug = 1;
$debug = 0;

my ( $f1, $f2, $f3 );
my ( $s1, $s2, $i, $j, $N );

# Make sure files are around, then close them.
open( $f1, "<", "f1" ) || die " Cannot open file f1\n";
open( $f2, "<", "f2" ) || die " Cannot open file f2\n";
open( $f3, ">", "f3" ) || die " Cannot open file f3 for write\n";
$s1 = ( stat("f1") )[7];
$s2 = ( stat("f2") )[7];
print " Length of f1, f2: $s1, $s2\n";
close $f1;
close $f2;

$j = 0;
$N = 10;
$N = 10000;
for ( $i = 1; $i <= $N; $i++ ) {
  $s1 = rand();
  $s2 = rand();
  $s1 = ( stat("f1") )[7];
  $s2 = ( stat("f2") )[7];
  if ( $s1 != $s2 ) {
    print $f3 " Found mismatch at iteration $i\n";
    $j++;
  }
  print " Length of f1, f2: $s1, $s2\n" if $debug;
}
print STDERR " Called stat $i (-1) times on each file, compared sizes.\n";
if ( $j != 0 ) {
  print STDERR " File f3 was written to $j times.\n";
}

exit(0);

producing:

Code:

time ./p1
 Length of f1, f2: 1205404, 1205404
 Called stat 10001 (-1) times on each file, compared sizes.

real	0m0.091s
user	0m0.036s
sys	0m0.040s

However, as I mentioned I don't know about MS systems. It does seem odd that obtaining the length of a file (in *nix, just pull in the length from the inode), whereas reading every byte in two files and comparing them would be so different (and on the wrong side, it seems to me).

Best wishes ... cheers, drl

drl

View Public Profile for drl

Find all posts by drl

03-19-2013

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

I'm really surprised that cmp is so close a runner up as it needs to read and compare each single byte in both files.
One idea to speed up things might be to run stat once for a couple of files, e.g. for an entire directory, so not creating a new process for every single file...

RudiC

View Public Profile for RudiC

Find all posts by RudiC

Shell Programming and Scripting

Simple directory tree diff script

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to run a script/command on all the directories in a directory tree?

Discussion started by: temp-usr

2. Shell Programming and Scripting

Shell script to build directory tree and files

Discussion started by: Libertad

3. Shell Programming and Scripting

Specific directory parsing in a directory tree

Discussion started by: harpal singh

4. UNIX for Dummies Questions & Answers

How to copy a tree of directory

Discussion started by: enkei17

5. UNIX for Dummies Questions & Answers

directory tree with directory size

Discussion started by: vikram3.r

6. Shell Programming and Scripting

Newbie problem with simple script to create a directory

Discussion started by: gwfay

7. UNIX for Dummies Questions & Answers

Move all files in a directory tree to a signal directory?

Discussion started by: briandanielz

8. Shell Programming and Scripting

Diff. Backup Script Using TAR. Should be simple.

Discussion started by: protienplant

9. Shell Programming and Scripting

directory tree

Discussion started by: ravi raj kumar

10. Programming

directory as tree

Discussion started by: anything2