Simple directory tree diff script


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Simple directory tree diff script
# 8  
Old 03-17-2013
If you want to determine if there are any differences, the only way is to read every byte of every file and compare it to its counterpart. Otherwise, even if file sizes match, there could still be a discrepancy. cmp may be useful for that task.

du (and df) measure the amount of storage allocated for files. They do not report file sizes. Two identical files may consume different amount of storage on different partitions/filesystems. One factor that may affect the storage allocated to a file is the block size of the file system. Another factor is sparseness.

A very large sparse file may occupy very little space on disk even though ls and stat report a large file size. But, if that file is copied to a filesystem that does not support sparse files, or using a tool that doesn't support sparse files, the disk space consumed will balloon to match the file's size as reported by ls/stat.

Regards,
Alister
# 9  
Old 03-18-2013
I suspect that there is simply an issue with the reported size, since all of the file name match. For both partitions, they are windows ntfs, and the copy is smaller the the source. The copy was made using cp -Rfp under cygwin, so it may be that cp stored the copies more efficiently than they were stored in the original versions.

If I was to use the two sorted find files as a starting list for cmp, how would I differentiate the entries in the sorted list that are directories from those that are files? Since cmp is for files, will it just throw an exception if what you pass to it is a directory?

---------- Post updated 03-18-13 at 12:31 AM ---------- Previous update was 03-17-13 at 09:16 PM ----------

I have added the following to the end of my script to check each file pair with cmp. If I get this working, I will add logic to run this part based on an argument.
Code:
# further process sorted find list by checking each file pair with cmp
while read input
do
#  remove leading . from each line in find output
   TEMP=$(echo $input | sed 's/^.//g')
#  escape spaces
   LOCALFILE=$(echo $TEMP | sed 's/ /\\ /g')

   echo $TREE1$LOCALFILE
   echo $TREE2$LOCALFILE

   cmp $TREE1$LOCALFILE  $TREE2$LOCALFILE > $TMPDIR'/byte_compare.txt'

done < $TMPDIR'/check_1_sorted'

This uses one of the sorted find lists to identify each file. If the entry from find is a directory, is seems as if cmp just prints a notification to stderr and moves on. The problem I am having now is that cmp won't except what I have done above to escape the spaces in file names. I have done echo on the path for each file, and it appears correct, but I am getting an error from cmp,

/cygdrive/e/nlite/Presets/Last\ Session_u.ini
/cygdrive/i/nlite/Presets/Last\ Session_u.ini
cmp: invalid --ignore-initial value `/cygdrive/i/nlite/Presets/Last\'

cmp doesn't seem to be seeing anything past the escape. Am I not escaping this properly? If I don't escape the space, I get a similar error indicating that the space is breaking the input.

LMHmedchem
# 10  
Old 03-18-2013
Keep in mind that disk space is always allocated in clusters, usually 4k. So a one byte file would use 4k disk space. This still would mean there's (not quite) a million files that spoil 4k of disk space each, though.
Are the two disk using the same file system type? There's fs out there that make intelligent use of inode list space to store incomplete clusters, while others don't and store those incomplete clusters on disk and "spoil" the empty part of the cluster.
And, finally, the two disks have a different size. Not much, but it may suffice to create a difference in the fs managing structures' overhead size.

Last edited by RudiC; 03-18-2013 at 11:04 AM.. Reason: typo corrected
This User Gave Thanks to RudiC For This Post:
# 11  
Old 03-18-2013
Hi.

Observations:

1) I agree with jim mcnamara about "millions of files" (especially if they are in just a few directories -- *nix filesystems can handle that, but at a cost of several indirect lookups), and with RudiC that allocation sizes may be involved.

2) If I were doing this, I would look at the comparison between lengths of the files. The stat function for size is easy and fast to obtain, with command stat, perl, c, etc. Then only if the file pairs had different sizes would I investigate farther.

3) There is some information about backslash-escape in cygwin at bash - Cygwin: using a path variable containing a windows path (with a space in it) - Stack Overflow -- I attribute that to the use of "\" as a path separator in MS systems -- basically, the advice seems to be use quotes.

4) My vague recollection is that directory sizes are never decreased even when significant numbers of files and sub-directories are removed, at least in *nix. I have no idea if that concept holds in MS systems.

5) This problem may be in a gray area between *nix-like systems and MS systems. About the latter I know very little.

Best wishes ... cheers, drl
This User Gave Thanks to drl For This Post:
# 12  
Old 03-18-2013
My file system started with a single primary data directory. This was primarily for the purposes of data backup, since it simplified the rsync setup. At this point, I have 4 primary data directories. The vast majority of the files on this drive are chemical structures in electronic format (mol files and SMILES strings). Mission critical data, such as src code, exists on DVD and even in hard copy printout at other locations. Electronic structure data defies some such permanent storage solutions, since they are of little or no value when printed on paper. Archiving such data means moving it to another hard drive, or possibly a DVD. I am skeptical about optical storage, since I have a case of CDs downstairs that I purchased with out silk screened logo on them. That was a few years ago, but they are already unreadable by any software that I can find. When I put one of them into an optical drive, I get a message that the disk is not in a readable format, so I cannot write to them. I think that it is understandable that this kind of thing makes me hesitant about storing data on such a medium.

The solution I have taken to is to have every important file on at least 4 hard drives, over at least two locations. This means two internal hard drives, synced with rsync, and two external hard drives (one off site). I replace hard drives every two years and have had good luck with this solution up to this point. On larger drives, some of my newer setups have two partitions with a smaller "working" partition at the outer edge of the drive and a larger "archive" partition for the rest.

This system does not keep things up to date in real time (like a raid1), but raid has it's issues as well. I have lost many more files through my own stupidity of accidentally deleting things than I ever have through hardware or software failures. Not even a raid array can protect you from being a moron from time to time, oh that it could...

I can certainly spread my data over more directories at higher levels, or even add more partitions. All of these partitions are ntfs if that matters. I am actually getting ready to rebuild this rig, so now would be a good time to make changes. I don't often do searches from higher up directories, since there are individual project folders.

I can move things around to whatever extent would be helpful, but there are still millions of files that need to be kept somewhere (several somewheres for backup). I can dump many of them onto external drives and put them in the firesafe, but I don't know if they would be any better preserved there than in an archive partition. How long can you leave a hard drive sitting in the closet and still expect it to fire up? I guess there might not be much data on that at this point, since 1TB drives are only a few years old.

As far a my current script, changing to double quotes seems to work,

cmp "$TREE1$LOCALFILE" "$TREE2$LOCALFILE" > $TMPDIR'/byte_compare.txt'

I know I tried this with quotes, but it must have been single quotes. I remove the code to escape spaces. I tried with stat by doing,

Code:
SIZE1=$(stat -c%s "$TREE1$LOCALFILE")
SIZE2=$(stat -c%s "$TREE2$LOCALFILE")

if [ "$SIZE1" != "$SIZE2" ]; then
   echo "$TREE1$LOCALFILE" >> $TMPDIR'/size_compare.txt'
fi

This takes about 20% longer than doing cmp, so unless I have set this up incorrectly, there doesn't seem to be a performance advantage, especially if you are going to do cmp anyway if you find files of different size ( I am using size in bytes synonymously with length, so let me know if that is not correct).

Thanks for all of the help so far.

LMHmedchem
# 13  
Old 03-18-2013
Hi.

I may be comparing apples(MS systems) to oranges (*nix systems), but here is a timing comparison of stat and cmp on a GNU/Linux box, with 2 identical files:
Code:
#!/usr/bin/env bash

# @(#) s1	Demonstrate compare timings for stat and cmp.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C stat cmp

N=${1-10000}

pl " Input data file f1 f2:"
specimen -3 -n f1 f2 | cut -c1-78

pl " Results, time for $N stat calls:"
rm -f f3
time for ((i=1;i<=$N;i++))
do
  s1=$(stat -c%s f1)
  s2=$(stat -c%s f2)
  if [ "$s1" != "$s2" ]
  then 
    pe "f1" >> f3
  fi
done
if [ -e f3 ]
then
  pe " Lines in f3: $(wc -l <f3)"
fi

pl " Results, time for $N cmp calls:"
rm -f f3
time for ((i=1;i<=$N;i++))
do
  if ! cmp f1 f2
  then
    pe "f1" >> f3
  fi
done
if [ -e f3 ]
then
  pe " Lines in f3: $(wc -l <f3)"
fi

exit 0

Code:
./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0.8 (lenny) 
bash GNU bash 3.2.39
stat (GNU coreutils) 6.10
cmp (GNU diffutils) 2.8.1

-----
 Input data file f1 f2:
Edges: 3:0:3 of 17777 lines in file "f1"
     1	Preliminary Matter.  
     2	
     3	This text of Melville's Moby-Dick is based on the Hendricks House editi
   ---
 17775	THEY GLIDED BY AS IF WITH PADLOCKS ON THEIR MOUTHS; THE SAVAGE SEA-HAWK
 17776	D WITH SHEATHED BEAKS. +ON THE SECOND DAY, A SAIL DREW NEAR, NEARER, AN
 17777	KED ME UP AT LAST. +IT WAS THE DEVIOUS-CRUISING +RACHEL, THAT IN HER RE

Edges: 3:0:3 of 17777 lines in file "f2"
     1	Preliminary Matter.  
     2	
     3	This text of Melville's Moby-Dick is based on the Hendricks House editi
   ---
 17775	THEY GLIDED BY AS IF WITH PADLOCKS ON THEIR MOUTHS; THE SAVAGE SEA-HAWK
 17776	D WITH SHEATHED BEAKS. +ON THE SECOND DAY, A SAIL DREW NEAR, NEARER, AN
 17777	KED ME UP AT LAST. +IT WAS THE DEVIOUS-CRUISING +RACHEL, THAT IN HER RE

-----
 Results, time for 10000 stat calls:

real	0m39.595s
user	0m10.397s
sys	0m27.694s

-----
 Results, time for 10000 cmp calls:

real	0m55.188s
user	0m27.122s
sys	0m25.958s

So perhaps MS sysems require a lot more work to get the size.

For the case of perl, that same amount of work for stat can be done in under 0.1 seconds real time:
Code:
#!/usr/bin/env perl

# @(#) p1	Demonstrate stat on open (and un-opened) files.

use strict;
use warnings;

my ($debug);
$debug = 1;
$debug = 0;

my ( $f1, $f2, $f3 );
my ( $s1, $s2, $i, $j, $N );

# Make sure files are around, then close them.
open( $f1, "<", "f1" ) || die " Cannot open file f1\n";
open( $f2, "<", "f2" ) || die " Cannot open file f2\n";
open( $f3, ">", "f3" ) || die " Cannot open file f3 for write\n";
$s1 = ( stat("f1") )[7];
$s2 = ( stat("f2") )[7];
print " Length of f1, f2: $s1, $s2\n";
close $f1;
close $f2;

$j = 0;
$N = 10;
$N = 10000;
for ( $i = 1; $i <= $N; $i++ ) {
  $s1 = rand();
  $s2 = rand();
  $s1 = ( stat("f1") )[7];
  $s2 = ( stat("f2") )[7];
  if ( $s1 != $s2 ) {
    print $f3 " Found mismatch at iteration $i\n";
    $j++;
  }
  print " Length of f1, f2: $s1, $s2\n" if $debug;
}
print STDERR " Called stat $i (-1) times on each file, compared sizes.\n";
if ( $j != 0 ) {
  print STDERR " File f3 was written to $j times.\n";
}

exit(0);

producing:
Code:
time ./p1
 Length of f1, f2: 1205404, 1205404
 Called stat 10001 (-1) times on each file, compared sizes.

real	0m0.091s
user	0m0.036s
sys	0m0.040s

However, as I mentioned I don't know about MS systems. It does seem odd that obtaining the length of a file (in *nix, just pull in the length from the inode), whereas reading every byte in two files and comparing them would be so different (and on the wrong side, it seems to me).

Best wishes ... cheers, drl
# 14  
Old 03-19-2013
I'm really surprised that cmp is so close a runner up as it needs to read and compare each single byte in both files.
One idea to speed up things might be to run stat once for a couple of files, e.g. for an entire directory, so not creating a new process for every single file...
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to run a script/command on all the directories in a directory tree?

How to run a script/command on all the directories in a directory tree? The below script is just for the files in a single directory, how to run it on all the directories in a directory tree? #!/bin/sh for audio_files in *.mp3 do outfile="${audio_files%.*}.aiff" sox "$audio_files"... (2 Replies)
Discussion started by: temp-usr
2 Replies

2. Shell Programming and Scripting

Shell script to build directory tree and files

Hi all, I'm trying at the moment to write a shell script to build a directory tree and create files within the built directories. I've scoured through sites and text books and I just can't figure out how to go about it. I would assume that I need to use loops of some sort, but I can't seem... (8 Replies)
Discussion started by: Libertad
8 Replies

3. Shell Programming and Scripting

Specific directory parsing in a directory tree

Hi friends, Hello again :) i got stuck in problem. Is there any way to get a special directory from directory tree? Here is my problm.." Suppose i have one fix directory structure "/abc/xyz/pqr/"(this will be fix).Under this directory structure i have some other directory and... (6 Replies)
Discussion started by: harpal singh
6 Replies

4. UNIX for Dummies Questions & Answers

How to copy a tree of directory

Mi question is how can you copy only de three of directory and not the files in it. Only a need the three of directorys not the files (6 Replies)
Discussion started by: enkei17
6 Replies

5. UNIX for Dummies Questions & Answers

directory tree with directory size

find . -type d -print 2>/dev/null|awk '!/\.$/ {for (i=1;i<NF;i++){d=length($i);if ( d < 5 && i != 1 )d=5;printf("%"d"s","|")}print "---"$NF}' FS='/' Can someone explain how this works..?? How can i add directory size to be listed in the above command's output..?? (1 Reply)
Discussion started by: vikram3.r
1 Replies

6. Shell Programming and Scripting

Newbie problem with simple script to create a directory

script is: dirname= "$(date +%b%d)_$(date +%H%M)" mkdir $dirname should create a directory named Nov4_ Instead I get the following returned: root@dchs-pint-001:/=>./test1 ./test1: Nov04_0736: not found. Usage: mkdir Directory ... root@dchs-pint-001:/=> TOO easy, but what am I... (2 Replies)
Discussion started by: gwfay
2 Replies

7. UNIX for Dummies Questions & Answers

Move all files in a directory tree to a signal directory?

Is this possible? Let me know If I need specify further on what I am trying to do- I just want to spare you the boring details of my personal file management. Thanks in advance- Brian- (2 Replies)
Discussion started by: briandanielz
2 Replies

8. Shell Programming and Scripting

Diff. Backup Script Using TAR. Should be simple.

I'm specifically trying to find help or insight on using the --incremental ('-G') option for creating a tar. Please resist the urge to tell me to use --listed-incremental ('-g') option. That's fairly well documented in the GNU tar manual. GNU tar 1.19 This is what the manual does say in section... (0 Replies)
Discussion started by: protienplant
0 Replies

9. Shell Programming and Scripting

directory tree

Hi all, The following is a script for displaying directory tree. D=${1:-`pwd`} (cd $D; pwd) find $D -type d -print | sort | sed -e "s,^$D,,"\ -e "/^$/d"\ -e "s,*/\(*\)$,\:-----\1,"\ -e "s,*/,: ,g" | more exit 0 I am trying to understand the above script.But... (3 Replies)
Discussion started by: ravi raj kumar
3 Replies

10. Programming

directory as tree

hi i have modified a program to display directory entries recursively in a tree like form i need an output with the following guidelines: the prog displays the contents of the directory the directory contents are sorted before printing so that directories come before regular files if an entry... (2 Replies)
Discussion started by: anything2
2 Replies
Login or Register to Ask a Question