Fastest way to delete duplicates from a large filelist.....
OK
I have two filelists......
The first is formatted like this....
/path/to/the/actual/file/location/filename.jpg
and has up to a million records
The second list shows filename.jpg where there is more then on instance.
and has maybe up to 65,000 records
I want to copy files only (i.e. not retaining the full path) from the first filelist as long as that filename does not appear in the second list.
At the moment I have a script that roughly does this....
As you can see the script pulls up a record from the "with the path filelist" and does an inverse grep to see if the filename is in the duplicate list, if it isn't then it outputs that filename with it's path to /ListWithoutDups.txt In the actual script it actually does some copies and other actions on the file.
This is a pretty inefficient way of doing it IMHO as it has to pull in each record individually and then check to see if it's in the duplicates list (and that could mean 1m records * 60,000 duplicate checks).
Can anyone suggest a better/more efficient way to code this to achieve the same result?
Thanks
Last edited by Bashingaway; 07-08-2011 at 02:52 AM..
First off, it is wise to avoid 'loading' variables with cat -- in your case, with a million filenames/pathnames, you are likely to exceed the amount that can be stuffed into a variable. Something like this would allow you to do the same thing without issues:
That said, you are correct your approach isn't efficient. I interpreted your requirements to be that you need a list of files from 'filewithpath.txt' that are NOT listed in the duplicate list file. If that is the case, this should work for you:
I have always found the options to comm to be difficult to understand and have to read the man page nearly every time I use it. In this case, comm reads both files in parallel (thus they must be sorted) and keeps the records that are unique to the first file (not listed in the second file).
Last edited by agama; 07-07-2011 at 10:54 PM..
Reason: added comments
First off, it is wise to avoid 'loading' variables with cat -- in your case, with a million filenames/pathnames, you are likely to exceed the amount that can be stuffed into a variable. Something like this would allow you to do the same thing without issues:
That said, you are correct your approach isn't efficient. I interpreted your requirements to be that you need a list of files from 'filewithpath.txt' that are NOT listed in the duplicate list file. If that is the case, this should work for you:
I have always found the options to comm to be difficult to understand and have to read the man page nearly every time I use it. In this case, comm reads both files in parallel (thus they must be sorted) and keeps the records that are unique to the first file (not listed in the second file).
Thanks for this but I'm not sure it'll work for me. My original example is the problem, sorry.
When you strip the pathname on the first line it therefore removes where to get the file that I want to perform the process's on?
I am using the below script to delete duplicate files but it is not working for directories with more than 10k files "Argument is too long" is getting for ls -t. Tried to replace ls -t with
find . -type f \( -iname "*.xml" \) -printf '%T@ %p\n' | sort -rg | sed -r 's/* //' | awk... (8 Replies)
Hi,
i have another problem. I have been trying to solve it by myself but failed.
inputfile
;;
ID T08578
NAME T08578
SBASE 30696
EBASE 32083
TYPE P
func just test
func chronology
func cholesterol
func null
INT 30765-37333
INT 37154-37318
Link 5546
Link 8142 (4 Replies)
I do have a big CA bundle certificate file and each time if i get request to add new certificate to the existing bundle i need to make sure it is not present already. How i can validate the duplicates.
The alignment of the certificate within the bundle seems to be different.
Example:
Cert 1... (7 Replies)
Hello,
i have the following problem:
there are two folders with a lot of files.
Example:
FolderA contains AAA, BBB, CCC
FolderB contains DDD, EEE, AAA
How can i via script identify AAA as duplicate in Folder B and delete it there? So that only DDD and EEE remain, in Folder B?
Thank you... (16 Replies)
I have a 5 GB text file(log/debug)
I want to delete all lines containing 'TRACE'
Command used
sed -i '/TRACE/d' mylog.txt
Is there any other fastest way to do this? (1 Reply)
I have a log file and I am trying to run a script against it to search for key issues such as invalid users, errors etc. In one part, I grep for session closed and get a lot of the same thing,, ie. root username etc. I want to remove the multiple root and just have it do a count, like wc -l
... (5 Replies)
1)I am trying to write a script that works interactively lists duplicated records on certain field/column and asks user to delete one or more. And finally it deletes all the records the used has asked for.
I have an idea to store those line numbers in an array, not sure how to do this in... (3 Replies)
hello
i need help to remove directory . The directory is not empty ., it contains
several sub directories and files inside that..
total number of files in one directory is 12,24,446 .
rm -rf doesnt work . it is prompting for every file ..
i want to delete without prompting and... (6 Replies)
Hi!
I have thousands of sub-directories, and hundreds of thousands of files in them. What is the fast way to find out which files are older than a certain date? Is the "find" command the fastest? Or is there some other way?
Right now I have a C script that traverses through and checks... (5 Replies)
I have searched the FAQ - by using sort, duplicates, etc.... but I didn't get any articles or results on it.
Currently, I am using:
sort -u file1 > file2 to remove duplicates. For a file size of 1giga byte approx. time taken to remove duplicates is 1hr 21 mins.
Is there any other faster way... (15 Replies)