Over the years I've created a bit of a mess in my directories with duplicate files. I've used fdupes to remove complete duplicates but there are still files which are almost identical which fdupes doesn't look for.
These have the same (or very similar) filenames. So I have tried to create a script to look for them and list them like fdupes (sets of duplicates separated by a blank line). What i have so far is this very inelegant script.
I'm sure there is a better way of doing this. would this script even work since I'm trying to change the file in the loop that's reading it. My main concern is efficiency in the algorithm. I tried to remove duplicates already accounted for by removing them from the list as it progresses through it but I have a feeling this will actually make it less efficient because of the added file operations. Any ideas on how best to approach this problem?
i'm not familiar with perl, can you please elaborate on what that perl script does. It looks like it compares 2 directories looking for duplicate files instead of duplicate filenames, is this correct?
I have created 2 scripts now trying to find duplicate filenames but they are so slow, that's why I really need to optimise the algorithm.
in all the methods I create a complete file list of the directory with full paths. My only problem is how time consuming the scripts are. All the methods work but which is the most time efficient for long lists?
Method 1
go through path list one entry at a time looking for matching filenames further down the list.
paths with matching filenames are removed from list so that the next filename has less entires to compare to.
Method 2 create a another list in addition to the paths, that is, a list of duplicate filenames using uniq (list 2). filter the path list using grep using these duplicate filenames (list 2) to get a smaller path list (list 1)
go through each duplicate filename (in list 2) looking for the matching paths in the path list (list 1).
remove matching paths so that the next duplicate filename has less entires to compare to.
The question is 1) is the added file operation required for removing previous matching paths worth it.
2) which algorithm is better in terms of speed, method 1, 2, or some other way
3) I'd like to add a progress bar but i do not want it in stdout since that will interfere with the actual output of duplicates. how do I do this? should I use stderr?
The scripts
The scripts for both methods are below and they both work but directories with many, many, files (I tested with 25,000) take considerable time, I'd really like to speed the script up.
if you want to test either one you can create a simple test text file with example paths to duplicate files, then use
./scriptname.sh -f List_of_file_paths.txt
if you want to actually look for duplicate filenames in a directory just run the script and it will look for duplicates in the current working directory. for another directory use.
Thanks for creating that thegeek but is that not a content comparison. Can I ask how this differs from fdupes? The thing with fdupes is that it is a byte for byte content comparison. I used it to remove duplicate files (i.e. files that are exactly the same). However files that differed only slightly it would not list as a "duplicate", and rightly so. For example my filing system is in such a mess that I have multiple versions of the same file in different directories where I might of added something to the newer one. The files are probably 90% the same but they were not exact duplicates for fdupes to list them. I do not know of any tools (or how) to list files that are almost the same. Can this be done in finddup? if so that would be great.
This is why I'm comparing their filenames instead since I assume I probably didn't rename the files.
I've now solved the problem with effeciency too if anybody is interested. The extra file operations were not worth it and the "grep -f" line was extremely taxing. So I moved the grep into the loop and avoided the extra iterations of the loop with this too. The script before took hours to go through 25,000 files, this one takes less than 5 minutes. forgive the unecessary use of cat, file redirection gave me some trouble for some reason.
Yes,
as already stated, the previous Perl solutions compare the content of the files.
Could you try this Perl code and compare its performance with your shell script?
Hi All,
Thanks in Advance
I am working on a shell script. I need some assistance.
My code:
if
then
set "subscriber" "promplan" "mapping" "dedicatedaccount" "faflistSub" "faflistAcc" "accumulator"\
"pam_account";
for i in 1 2 3 4 5 6 7 8;... (0 Replies)
Hi All,
Thanks in Advance
I am working on a shell script. I need some assistance.
My Requirement:
1) There are some set of files in a directory like given below
OTP_UFSC_20120530000000_acc.csv
OTP_UFSC_20120530000000_faf.csv
OTP_UFSC_20120530000000_prom.csv... (0 Replies)
i'm doing banker's algorithm..
got some error there but i cant fix it..
please help!!
#!/bin/bash
echo "enter no.of resources: "
read n1
echo -n "enter the max no .of resources for each type: "
for(( i=0; i <$n1; i++ ))
do
read ${t}
done
echo -n "enter no .of... (1 Reply)
Hi,
In a file, I have to mark duplicate records as 'D' and the latest record alone as 'C'.
In the below file, I have to identify if duplicate records are there or not based on Man_ID, Man_DT, Ship_ID and I have to mark the record with latest Ship_DT as "C" and other as "D" (I have to create... (7 Replies)
Hi guys , in my study book from which I re-learn C is task to generate all possible characters combination from numbers entered by the user. I know this algorithm must use combinatorics to calculate all permutations. Problem is how to implement algortihm.
// This program reads the four numbers... (0 Replies)
Hi all,
I have a list of xml file. I need to split the files to a different files when see the <ko> tag.
The list of filename are
B20090908.1100-20090908.1200_CDMA=1,NO=2,SITE=3.xml
B20090908.1200-20090908.1300_CDMA=1,NO=2,SITE=3.xml
B20090908.1300-20090908.1400_CDMA=1,NO=2,SITE=3.xml
... (3 Replies)
I am trying to write a script that prompts users for date and time, then process the gzip file into awk. During the ksh part of the script another file is created and needs to be processed with a different set of pattern matches then I need to combine the two in the end. I'm stuck at the part... (6 Replies)
I have the files logged in the file system with names in the format of : filename_ordernumber_date_time
eg:
file_1_12012007_1101.txt
file_2_12022007_1101.txt
file_1_12032007_1101.txt
I need to find out all the files that are logged multiple times with same order number. In the above eg, I... (1 Reply)
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
21444 tomusr 213M 61M sleep 29 10 1:20:46 0.1% java/43
21249 root 93M 44M sleep 29 10 1:07:19 0.2% java/56
is there anyway i can use a command to get the total of the SIZE? 306M (Derive from... (5 Replies)
Looking for an algorithm to compute the number of days between two given dates I came across a professor's C program located here: http://cr.yp.to/2001-275/struct1.c
I was wondering if anyone could tell me where the value 678882 in the line
int d = dateday - 678882;
comes from and also the... (1 Reply)