Sponsored Content
Top Forums Shell Programming and Scripting Fastest way to delete duplicates from a large filelist..... Post 302537329 by Bashingaway on Thursday 7th of July 2011 08:15:29 PM
Old 07-07-2011
Fastest way to delete duplicates from a large filelist.....

OK

I have two filelists......

The first is formatted like this....

/path/to/the/actual/file/location/filename.jpg

and has up to a million records

The second list shows filename.jpg where there is more then on instance.

and has maybe up to 65,000 records

I want to copy files only (i.e. not retaining the full path) from the first filelist as long as that filename does not appear in the second list.

At the moment I have a script that roughly does this....

Code:
FULLPATHFILENAME=`cat /FileWithPath.txt`
DUPLICATESLIST=`cat /DuplicateFiles.txt`

for REMOVEDUP in $FULLPATHFILENAME ; do

  for THISDUP in $DUPLICATESLIST ; do

         ISITADUP=`echo $REMOVEDUP | grep -v $THISDUP`

         echo "$ISITADUP" >> /ListWithoutDups.txt

 done

done

As you can see the script pulls up a record from the "with the path filelist" and does an inverse grep to see if the filename is in the duplicate list, if it isn't then it outputs that filename with it's path to /ListWithoutDups.txt In the actual script it actually does some copies and other actions on the file.

This is a pretty inefficient way of doing it IMHO as it has to pull in each record individually and then check to see if it's in the duplicates list (and that could mean 1m records * 60,000 duplicate checks).

Can anyone suggest a better/more efficient way to code this to achieve the same result?

Thanks

Last edited by Bashingaway; 07-08-2011 at 02:52 AM..
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

fastest way to remove duplicates.

I have searched the FAQ - by using sort, duplicates, etc.... but I didn't get any articles or results on it. Currently, I am using: sort -u file1 > file2 to remove duplicates. For a file size of 1giga byte approx. time taken to remove duplicates is 1hr 21 mins. Is there any other faster way... (15 Replies)
Discussion started by: radhika
15 Replies

2. UNIX for Dummies Questions & Answers

Fastest way to traverse through large directories

Hi! I have thousands of sub-directories, and hundreds of thousands of files in them. What is the fast way to find out which files are older than a certain date? Is the "find" command the fastest? Or is there some other way? Right now I have a C script that traverses through and checks... (5 Replies)
Discussion started by: sreedharange
5 Replies

3. Shell Programming and Scripting

how to delete/remove directory in fastest way

hello i need help to remove directory . The directory is not empty ., it contains several sub directories and files inside that.. total number of files in one directory is 12,24,446 . rm -rf doesnt work . it is prompting for every file .. i want to delete without prompting and... (6 Replies)
Discussion started by: getdpg
6 Replies

4. Shell Programming and Scripting

An interactive way to delete duplicates

1)I am trying to write a script that works interactively lists duplicated records on certain field/column and asks user to delete one or more. And finally it deletes all the records the used has asked for. I have an idea to store those line numbers in an array, not sure how to do this in... (3 Replies)
Discussion started by: chvs2000
3 Replies

5. Shell Programming and Scripting

how can I delete duplicates in the log?

I have a log file and I am trying to run a script against it to search for key issues such as invalid users, errors etc. In one part, I grep for session closed and get a lot of the same thing,, ie. root username etc. I want to remove the multiple root and just have it do a count, like wc -l ... (5 Replies)
Discussion started by: taekwondo
5 Replies

6. Shell Programming and Scripting

Fastest way to delete line

I have a 5 GB text file(log/debug) I want to delete all lines containing 'TRACE' Command used sed -i '/TRACE/d' mylog.txt Is there any other fastest way to do this? (1 Reply)
Discussion started by: johnbach
1 Replies

7. Shell Programming and Scripting

Delete duplicates via script?

Hello, i have the following problem: there are two folders with a lot of files. Example: FolderA contains AAA, BBB, CCC FolderB contains DDD, EEE, AAA How can i via script identify AAA as duplicate in Folder B and delete it there? So that only DDD and EEE remain, in Folder B? Thank you... (16 Replies)
Discussion started by: Y-T
16 Replies

8. Shell Programming and Scripting

Delete duplicates in CA bundle

I do have a big CA bundle certificate file and each time if i get request to add new certificate to the existing bundle i need to make sure it is not present already. How i can validate the duplicates. The alignment of the certificate within the bundle seems to be different. Example: Cert 1... (7 Replies)
Discussion started by: diva_thilak
7 Replies

9. Shell Programming and Scripting

Delete only if duplicates found in each record

Hi, i have another problem. I have been trying to solve it by myself but failed. inputfile ;; ID T08578 NAME T08578 SBASE 30696 EBASE 32083 TYPE P func just test func chronology func cholesterol func null INT 30765-37333 INT 37154-37318 Link 5546 Link 8142 (4 Replies)
Discussion started by: redse171
4 Replies

10. Shell Programming and Scripting

To Delete the duplicates using Part of File Name

I am using the below script to delete duplicate files but it is not working for directories with more than 10k files "Argument is too long" is getting for ls -t. Tried to replace ls -t with find . -type f \( -iname "*.xml" \) -printf '%T@ %p\n' | sort -rg | sed -r 's/* //' | awk... (8 Replies)
Discussion started by: gold2k8
8 Replies
wrjpgcom(1)							   User Commands						       wrjpgcom(1)

NAME
wrjpgcom - insert text comments into a JPEG file SYNOPSIS
wrjpgcom [-replace] [-comment text] [-cfile name] [filename] DESCRIPTION
wrjpgcom reads the named JPEG or JFIF file, or the standard input if no file is named, and generates a new JPEG or JFIF file on the stan- dard output. A comment block is added to the file. The JPEG standard allows "comment" (COM) blocks to occur within a JPEG file. Although the standard does not actually define the intended function of COM blocks, they are widely used to hold user-supplied text strings. This enables you to add annotations, titles, index terms, and so on to your JPEG files, and later retrieve the COM blocks as text. COM blocks do not interfere with the image stored in the JPEG file. The maximum size of a COM block is 64K, but you can have many COM blocks in one JPEG file. wrjpgcom adds a COM block, containing text that you provide, to a JPEG file. Ordinarily, the COM block is added after any existing COM blocks, but you can delete the old COM blocks if you wish. OPTIONS
The following options are supported: -cfile name Read the text for a new COM block from the named file. -comment text Supply the text for a new COM block on the command line. -replace Delete any existing COM blocks from the file. OPERANDS
The following operands are supported: filename The name of the JPEG file to which you want to add text comments. EXTENDED DESCRIPTION
To add only one line of comment text, use the -comment option to provide the text on the command line. Specify the comment text within quotes, so that the text is treated as a single argument. Longer comments can be read from a text file. If you specify neither the -comment nor the -cfile option, wrjpgcom reads the comment text from standard input. In such cases, you must supply an input image filename. You can enter multiple lines, up to 64KB. Type an end-of-file indicator, usually Ctrl-D, to terminate the comment text entry. wrjpgcom does not add a COM block if the provided comment string is empty. Therefore, you can use -replace -comment "" to delete all COM blocks from a file. EXAMPLES
Example 1: Adding a Short Comment to in.jpg to Produce out.jpg example% wrjpgcom -c "View of my back yard" in.jpg > out.jpg Example 2: Attaching a Long Comment Previously Stored in comment.txt example% wrjpgcom in.jpg < comment.txt > out.jpg or example% wrjpgcom -cfile comment.txt < in.jpg > out.jpg In this example, 1000 is a number that is larger than the number of rows in the source file. ATTRIBUTES
See attributes(5) for descriptions of the following attributes: +-----------------------------+-----------------------------+ | ATTRIBUTE TYPE | ATTRIBUTE VALUE | +-----------------------------+-----------------------------+ |Availability |SUNWjpg | +-----------------------------+-----------------------------+ |Interface stability |External | +-----------------------------+-----------------------------+ SEE ALSO
cjpeg(1), djpeg(1), jpegtran(1), rdjpgcom(1) NOTES
This man page was originally written by the Independent JPEG Group. Updated by Breda McColgan, Sun Microsystems Inc., 2004. SunOS 5.10 26 Mar 2004 wrjpgcom(1)
All times are GMT -4. The time now is 06:33 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy