If you want to determine if there are any differences, the only way is to read every byte of every file and compare it to its counterpart. Otherwise, even if file sizes match, there could still be a discrepancy. cmp may be useful for that task.
du (and df) measure the amount of storage allocated for files. They do not report file sizes. Two identical files may consume different amount of storage on different partitions/filesystems. One factor that may affect the storage allocated to a file is the block size of the file system. Another factor is sparseness.
A very large sparse file may occupy very little space on disk even though ls and stat report a large file size. But, if that file is copied to a filesystem that does not support sparse files, or using a tool that doesn't support sparse files, the disk space consumed will balloon to match the file's size as reported by ls/stat.
I suspect that there is simply an issue with the reported size, since all of the file name match. For both partitions, they are windows ntfs, and the copy is smaller the the source. The copy was made using cp -Rfp under cygwin, so it may be that cp stored the copies more efficiently than they were stored in the original versions.
If I was to use the two sorted find files as a starting list for cmp, how would I differentiate the entries in the sorted list that are directories from those that are files? Since cmp is for files, will it just throw an exception if what you pass to it is a directory?
---------- Post updated 03-18-13 at 12:31 AM ---------- Previous update was 03-17-13 at 09:16 PM ----------
I have added the following to the end of my script to check each file pair with cmp. If I get this working, I will add logic to run this part based on an argument.
This uses one of the sorted find lists to identify each file. If the entry from find is a directory, is seems as if cmp just prints a notification to stderr and moves on. The problem I am having now is that cmp won't except what I have done above to escape the spaces in file names. I have done echo on the path for each file, and it appears correct, but I am getting an error from cmp,
/cygdrive/e/nlite/Presets/Last\ Session_u.ini
/cygdrive/i/nlite/Presets/Last\ Session_u.ini
cmp: invalid --ignore-initial value `/cygdrive/i/nlite/Presets/Last\'
cmp doesn't seem to be seeing anything past the escape. Am I not escaping this properly? If I don't escape the space, I get a similar error indicating that the space is breaking the input.
Keep in mind that disk space is always allocated in clusters, usually 4k. So a one byte file would use 4k disk space. This still would mean there's (not quite) a million files that spoil 4k of disk space each, though.
Are the two disk using the same file system type? There's fs out there that make intelligent use of inode list space to store incomplete clusters, while others don't and store those incomplete clusters on disk and "spoil" the empty part of the cluster.
And, finally, the two disks have a different size. Not much, but it may suffice to create a difference in the fs managing structures' overhead size.
Last edited by RudiC; 03-18-2013 at 11:04 AM..
Reason: typo corrected
Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris
Posts: 2,288
Thanks Given: 430
Thanked 480 Times in 395 Posts
Hi.
Observations:
1) I agree with jim mcnamara about "millions of files" (especially if they are in just a few directories -- *nix filesystems can handle that, but at a cost of several indirect lookups), and with RudiC that allocation sizes may be involved.
2) If I were doing this, I would look at the comparison between lengths of the files. The stat function for size is easy and fast to obtain, with command stat, perl, c, etc. Then only if the file pairs had different sizes would I investigate farther.
4) My vague recollection is that directory sizes are never decreased even when significant numbers of files and sub-directories are removed, at least in *nix. I have no idea if that concept holds in MS systems.
5) This problem may be in a gray area between *nix-like systems and MS systems. About the latter I know very little.
My file system started with a single primary data directory. This was primarily for the purposes of data backup, since it simplified the rsync setup. At this point, I have 4 primary data directories. The vast majority of the files on this drive are chemical structures in electronic format (mol files and SMILES strings). Mission critical data, such as src code, exists on DVD and even in hard copy printout at other locations. Electronic structure data defies some such permanent storage solutions, since they are of little or no value when printed on paper. Archiving such data means moving it to another hard drive, or possibly a DVD. I am skeptical about optical storage, since I have a case of CDs downstairs that I purchased with out silk screened logo on them. That was a few years ago, but they are already unreadable by any software that I can find. When I put one of them into an optical drive, I get a message that the disk is not in a readable format, so I cannot write to them. I think that it is understandable that this kind of thing makes me hesitant about storing data on such a medium.
The solution I have taken to is to have every important file on at least 4 hard drives, over at least two locations. This means two internal hard drives, synced with rsync, and two external hard drives (one off site). I replace hard drives every two years and have had good luck with this solution up to this point. On larger drives, some of my newer setups have two partitions with a smaller "working" partition at the outer edge of the drive and a larger "archive" partition for the rest.
This system does not keep things up to date in real time (like a raid1), but raid has it's issues as well. I have lost many more files through my own stupidity of accidentally deleting things than I ever have through hardware or software failures. Not even a raid array can protect you from being a moron from time to time, oh that it could...
I can certainly spread my data over more directories at higher levels, or even add more partitions. All of these partitions are ntfs if that matters. I am actually getting ready to rebuild this rig, so now would be a good time to make changes. I don't often do searches from higher up directories, since there are individual project folders.
I can move things around to whatever extent would be helpful, but there are still millions of files that need to be kept somewhere (several somewheres for backup). I can dump many of them onto external drives and put them in the firesafe, but I don't know if they would be any better preserved there than in an archive partition. How long can you leave a hard drive sitting in the closet and still expect it to fire up? I guess there might not be much data on that at this point, since 1TB drives are only a few years old.
As far a my current script, changing to double quotes seems to work,
I know I tried this with quotes, but it must have been single quotes. I remove the code to escape spaces. I tried with stat by doing,
This takes about 20% longer than doing cmp, so unless I have set this up incorrectly, there doesn't seem to be a performance advantage, especially if you are going to do cmp anyway if you find files of different size ( I am using size in bytes synonymously with length, so let me know if that is not correct).
Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris
Posts: 2,288
Thanks Given: 430
Thanked 480 Times in 395 Posts
Hi.
I may be comparing apples(MS systems) to oranges (*nix systems), but here is a timing comparison of stat and cmp on a GNU/Linux box, with 2 identical files:
So perhaps MS sysems require a lot more work to get the size.
For the case of perl, that same amount of work for stat can be done in under 0.1 seconds real time:
producing:
However, as I mentioned I don't know about MS systems. It does seem odd that obtaining the length of a file (in *nix, just pull in the length from the inode), whereas reading every byte in two files and comparing them would be so different (and on the wrong side, it seems to me).
I'm really surprised that cmp is so close a runner up as it needs to read and compare each single byte in both files.
One idea to speed up things might be to run stat once for a couple of files, e.g. for an entire directory, so not creating a new process for every single file...
How to run a script/command on all the directories in a directory tree?
The below script is just for the files in a single directory, how to run it on all the directories in a directory tree?
#!/bin/sh
for audio_files in *.mp3
do
outfile="${audio_files%.*}.aiff"
sox "$audio_files"... (2 Replies)
Hi all,
I'm trying at the moment to write a shell script to build a directory tree and create files within the built directories. I've scoured through sites and text books and I just can't figure out how to go about it.
I would assume that I need to use loops of some sort, but I can't seem... (8 Replies)
Hi friends,
Hello again :)
i got stuck in problem. Is there any way to get a special directory from directory tree?
Here is my problm.." Suppose i have one fix directory structure "/abc/xyz/pqr/"(this will be fix).Under this directory structure i have some other directory and... (6 Replies)
find . -type d -print 2>/dev/null|awk '!/\.$/ {for (i=1;i<NF;i++){d=length($i);if ( d < 5 && i != 1 )d=5;printf("%"d"s","|")}print "---"$NF}' FS='/'
Can someone explain how this works..??
How can i add directory size to be listed in the above command's output..?? (1 Reply)
script is:
dirname= "$(date +%b%d)_$(date +%H%M)"
mkdir $dirname
should create a directory named Nov4_
Instead I get the following returned:
root@dchs-pint-001:/=>./test1
./test1: Nov04_0736: not found.
Usage: mkdir Directory ...
root@dchs-pint-001:/=>
TOO easy, but what am I... (2 Replies)
Is this possible? Let me know If I need specify further on what I am trying to do- I just want to spare you the boring details of my personal file management.
Thanks in advance-
Brian- (2 Replies)
I'm specifically trying to find help or insight on using the --incremental ('-G') option for creating a tar. Please resist the urge to tell me to use --listed-incremental ('-g') option. That's fairly well documented in the GNU tar manual. GNU tar 1.19
This is what the manual does say in section... (0 Replies)
Hi all,
The following is a script for displaying directory tree.
D=${1:-`pwd`}
(cd $D; pwd)
find $D -type d -print | sort |
sed -e "s,^$D,,"\
-e "/^$/d"\
-e "s,*/\(*\)$,\:-----\1,"\
-e "s,*/,: ,g" | more
exit 0
I am trying to understand the above script.But... (3 Replies)
hi i have modified a program to display directory entries recursively in a tree like form
i need an output with the following guidelines:
the prog displays the contents of the directory
the directory contents are sorted before printing so that directories come before regular files
if an entry... (2 Replies)