Fastest way to delete duplicates from a large filelist..... Post: 302537329

Sponsored Content

Top Forums Shell Programming and Scripting Fastest way to delete duplicates from a large filelist..... Post 302537329 by Bashingaway on Thursday 7th of July 2011 08:15:29 PM

07-07-2011

Registered User

Fastest way to delete duplicates from a large filelist.....

OK

I have two filelists......

The first is formatted like this....

/path/to/the/actual/file/location/filename.jpg

and has up to a million records

The second list shows filename.jpg where there is more then on instance.

and has maybe up to 65,000 records

I want to copy files only (i.e. not retaining the full path) from the first filelist as long as that filename does not appear in the second list.

At the moment I have a script that roughly does this....

Code:

FULLPATHFILENAME=`cat /FileWithPath.txt`
DUPLICATESLIST=`cat /DuplicateFiles.txt`

for REMOVEDUP in $FULLPATHFILENAME ; do

  for THISDUP in $DUPLICATESLIST ; do

         ISITADUP=`echo $REMOVEDUP | grep -v $THISDUP`

         echo "$ISITADUP" >> /ListWithoutDups.txt

 done

done

As you can see the script pulls up a record from the "with the path filelist" and does an inverse grep to see if the filename is in the duplicate list, if it isn't then it outputs that filename with it's path to /ListWithoutDups.txt In the actual script it actually does some copies and other actions on the file.

This is a pretty inefficient way of doing it IMHO as it has to pull in each record individually and then check to see if it's in the duplicates list (and that could mean 1m records * 60,000 duplicate checks).

Can anyone suggest a better/more efficient way to code this to achieve the same result?

Thanks

Last edited by Bashingaway; 07-08-2011 at 02:52 AM..

Bashingaway

View Public Profile for Bashingaway

Find all posts by Bashingaway

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

fastest way to remove duplicates.

I have searched the FAQ - by using sort, duplicates, etc.... but I didn't get any articles or results on it. Currently, I am using: sort -u file1 > file2 to remove duplicates. For a file size of 1giga byte approx. time taken to remove duplicates is 1hr 21 mins. Is there any other faster way...

2. UNIX for Dummies Questions & Answers

Fastest way to traverse through large directories

Hi! I have thousands of sub-directories, and hundreds of thousands of files in them. What is the fast way to find out which files are older than a certain date? Is the "find" command the fastest? Or is there some other way? Right now I have a C script that traverses through and checks...

3. Shell Programming and Scripting

how to delete/remove directory in fastest way

hello i need help to remove directory . The directory is not empty ., it contains several sub directories and files inside that.. total number of files in one directory is 12,24,446 . rm -rf doesnt work . it is prompting for every file .. i want to delete without prompting and...

4. Shell Programming and Scripting

An interactive way to delete duplicates

1)I am trying to write a script that works interactively lists duplicated records on certain field/column and asks user to delete one or more. And finally it deletes all the records the used has asked for. I have an idea to store those line numbers in an array, not sure how to do this in...

5. Shell Programming and Scripting

how can I delete duplicates in the log?

I have a log file and I am trying to run a script against it to search for key issues such as invalid users, errors etc. In one part, I grep for session closed and get a lot of the same thing,, ie. root username etc. I want to remove the multiple root and just have it do a count, like wc -l ...

6. Shell Programming and Scripting

Fastest way to delete line

I have a 5 GB text file(log/debug) I want to delete all lines containing 'TRACE' Command used sed -i '/TRACE/d' mylog.txt Is there any other fastest way to do this?

7. Shell Programming and Scripting

Delete duplicates via script?

Hello, i have the following problem: there are two folders with a lot of files. Example: FolderA contains AAA, BBB, CCC FolderB contains DDD, EEE, AAA How can i via script identify AAA as duplicate in Folder B and delete it there? So that only DDD and EEE remain, in Folder B? Thank you...

8. Shell Programming and Scripting

Delete duplicates in CA bundle

I do have a big CA bundle certificate file and each time if i get request to add new certificate to the existing bundle i need to make sure it is not present already. How i can validate the duplicates. The alignment of the certificate within the bundle seems to be different. Example: Cert 1...

9. Shell Programming and Scripting

Delete only if duplicates found in each record

Hi, i have another problem. I have been trying to solve it by myself but failed. inputfile ;; ID T08578 NAME T08578 SBASE 30696 EBASE 32083 TYPE P func just test func chronology func cholesterol func null INT 30765-37333 INT 37154-37318 Link 5546 Link 8142

10. Shell Programming and Scripting

To Delete the duplicates using Part of File Name

I am using the below script to delete duplicate files but it is not working for directories with more than 10k files "Argument is too long" is getting for ls -t. Tried to replace ls -t with find . -type f \( -iname "*.xml" \) -printf '%T@ %p\n' | sort -rg | sed -r 's/* //' | awk...

LEARN ABOUT DEBIAN

jpegicc

JPEGICC(1)						      General Commands Manual							JPEGICC(1)

NAME

       jpegicc - little cms ICC profile applier for JPEG.

SYNOPSIS

       jpegicc [options] input.jpg output.jpg

DESCRIPTION

       lcms is a standalone CMM engine, which deals with the color management.	It implements a fast transformation between ICC profiles.  jpegicc
       is little cms ICC profile applier for JPEG.

OPTIONS

       -b     Black point compensation.

       -c <0,1,2,3>
	      Precalculates transform. (0=Off, 1=Normal, 2=Hi-res, 3=LoRes) [defaults to 1]

       -g     Marks out-of-gamut colors on softproof.

       -h <0,1,2>
	      Show summary of options and examples.

       -i profile
	      Input profile (defaults to sRGB).

       -m <0,1,2,3>
	      SoftProof intent.

       -n     Ignore embedded profile.

       -p profile
	      Soft proof profile

       -o profile

       Output profile (defaults to sRGB).

       -q <0..100>
	      Output JPEG quality.

       -t <0,1,2,3>
	      Intent (0=Perceptual, 1=Colorimetric, 2=Saturation, 3=Absolute).

       -v     Verbose.

EXAMPLES

       To color correct from scanner to sRGB:
	    jpegicc -iscanner.icm in.jpg out.jpg

       To convert from monitor1 to monitor2:
	    jpegicc -imon1.icm -omon2.icm in.jpg out.jpg

       To make a CMYK separation:
	    jpegicc -oprinter.icm inrgb.jpg outcmyk.jpg

       To recover sRGB from a CMYK separation:
	    jpegicc -iprinter.icm incmyk.jpg outrgb.jpg

       To convert from CIELab ITU/Fax JPEG to sRGB
	    jpegicc -iitufax.icm in.jpg out.jpg

NOTES

       For suggestions, comments, bug reports etc. send mail to info@littlecms.com.

SEE ALSO

       tifficc(1), tiffdiff(1), icc2ps(1), icclink(1), icctrans(1), wtpt(1)

AUTHOR

       This manual page was written by Shiju p. Nair <shiju.p@gmail.com>, for the Debian project.

								September 30, 2004							JPEGICC(1)

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

fastest way to remove duplicates.

Discussion started by: radhika

2. UNIX for Dummies Questions & Answers

Fastest way to traverse through large directories

Discussion started by: sreedharange

3. Shell Programming and Scripting

how to delete/remove directory in fastest way

Discussion started by: getdpg

4. Shell Programming and Scripting

An interactive way to delete duplicates

Discussion started by: chvs2000