Sponsored Content
Top Forums Shell Programming and Scripting Fastest way to delete duplicates from a large filelist..... Post 302537331 by agama on Thursday 7th of July 2011 09:52:24 PM
Old 07-07-2011
First off, it is wise to avoid 'loading' variables with cat -- in your case, with a million filenames/pathnames, you are likely to exceed the amount that can be stuffed into a variable. Something like this would allow you to do the same thing without issues:

Code:
while read filename
do
    echo $filename
done <file-list-file

That said, you are correct your approach isn't efficient. I interpreted your requirements to be that you need a list of files from 'filewithpath.txt' that are NOT listed in the duplicate list file. If that is the case, this should work for you:

Code:
sed 's!.*/!!' FileWithPath.txt | sort -u >/tmp/f1      # strip pathname and sort removing any dups
sort -u DuplicateFiles.txt >/tmp/f2                       # both files must be sorted for comm, remove dups just in case
comm -23 /tmp/f1 /tmp/f2 >ListWithoutDups.txt
rm /tmp/f1 /tmp/f2

I have always found the options to comm to be difficult to understand and have to read the man page nearly every time I use it. In this case, comm reads both files in parallel (thus they must be sorted) and keeps the records that are unique to the first file (not listed in the second file).

Last edited by agama; 07-07-2011 at 10:54 PM.. Reason: added comments
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

fastest way to remove duplicates.

I have searched the FAQ - by using sort, duplicates, etc.... but I didn't get any articles or results on it. Currently, I am using: sort -u file1 > file2 to remove duplicates. For a file size of 1giga byte approx. time taken to remove duplicates is 1hr 21 mins. Is there any other faster way... (15 Replies)
Discussion started by: radhika
15 Replies

2. UNIX for Dummies Questions & Answers

Fastest way to traverse through large directories

Hi! I have thousands of sub-directories, and hundreds of thousands of files in them. What is the fast way to find out which files are older than a certain date? Is the "find" command the fastest? Or is there some other way? Right now I have a C script that traverses through and checks... (5 Replies)
Discussion started by: sreedharange
5 Replies

3. Shell Programming and Scripting

how to delete/remove directory in fastest way

hello i need help to remove directory . The directory is not empty ., it contains several sub directories and files inside that.. total number of files in one directory is 12,24,446 . rm -rf doesnt work . it is prompting for every file .. i want to delete without prompting and... (6 Replies)
Discussion started by: getdpg
6 Replies

4. Shell Programming and Scripting

An interactive way to delete duplicates

1)I am trying to write a script that works interactively lists duplicated records on certain field/column and asks user to delete one or more. And finally it deletes all the records the used has asked for. I have an idea to store those line numbers in an array, not sure how to do this in... (3 Replies)
Discussion started by: chvs2000
3 Replies

5. Shell Programming and Scripting

how can I delete duplicates in the log?

I have a log file and I am trying to run a script against it to search for key issues such as invalid users, errors etc. In one part, I grep for session closed and get a lot of the same thing,, ie. root username etc. I want to remove the multiple root and just have it do a count, like wc -l ... (5 Replies)
Discussion started by: taekwondo
5 Replies

6. Shell Programming and Scripting

Fastest way to delete line

I have a 5 GB text file(log/debug) I want to delete all lines containing 'TRACE' Command used sed -i '/TRACE/d' mylog.txt Is there any other fastest way to do this? (1 Reply)
Discussion started by: johnbach
1 Replies

7. Shell Programming and Scripting

Delete duplicates via script?

Hello, i have the following problem: there are two folders with a lot of files. Example: FolderA contains AAA, BBB, CCC FolderB contains DDD, EEE, AAA How can i via script identify AAA as duplicate in Folder B and delete it there? So that only DDD and EEE remain, in Folder B? Thank you... (16 Replies)
Discussion started by: Y-T
16 Replies

8. Shell Programming and Scripting

Delete duplicates in CA bundle

I do have a big CA bundle certificate file and each time if i get request to add new certificate to the existing bundle i need to make sure it is not present already. How i can validate the duplicates. The alignment of the certificate within the bundle seems to be different. Example: Cert 1... (7 Replies)
Discussion started by: diva_thilak
7 Replies

9. Shell Programming and Scripting

Delete only if duplicates found in each record

Hi, i have another problem. I have been trying to solve it by myself but failed. inputfile ;; ID T08578 NAME T08578 SBASE 30696 EBASE 32083 TYPE P func just test func chronology func cholesterol func null INT 30765-37333 INT 37154-37318 Link 5546 Link 8142 (4 Replies)
Discussion started by: redse171
4 Replies

10. Shell Programming and Scripting

To Delete the duplicates using Part of File Name

I am using the below script to delete duplicate files but it is not working for directories with more than 10k files "Argument is too long" is getting for ls -t. Tried to replace ls -t with find . -type f \( -iname "*.xml" \) -printf '%T@ %p\n' | sort -rg | sed -r 's/* //' | awk... (8 Replies)
Discussion started by: gold2k8
8 Replies
wrjpgcom(1)							   User Commands						       wrjpgcom(1)

NAME
wrjpgcom - insert text comments into a JPEG file SYNOPSIS
wrjpgcom [-replace] [-comment text] [-cfile name] [filename] DESCRIPTION
wrjpgcom reads the named JPEG or JFIF file, or the standard input if no file is named, and generates a new JPEG or JFIF file on the stan- dard output. A comment block is added to the file. The JPEG standard allows "comment" (COM) blocks to occur within a JPEG file. Although the standard does not actually define the intended function of COM blocks, they are widely used to hold user-supplied text strings. This enables you to add annotations, titles, index terms, and so on to your JPEG files, and later retrieve the COM blocks as text. COM blocks do not interfere with the image stored in the JPEG file. The maximum size of a COM block is 64K, but you can have many COM blocks in one JPEG file. wrjpgcom adds a COM block, containing text that you provide, to a JPEG file. Ordinarily, the COM block is added after any existing COM blocks, but you can delete the old COM blocks if you wish. OPTIONS
The following options are supported: -cfile name Read the text for a new COM block from the named file. -comment text Supply the text for a new COM block on the command line. -replace Delete any existing COM blocks from the file. OPERANDS
The following operands are supported: filename The name of the JPEG file to which you want to add text comments. EXTENDED DESCRIPTION
To add only one line of comment text, use the -comment option to provide the text on the command line. Specify the comment text within quotes, so that the text is treated as a single argument. Longer comments can be read from a text file. If you specify neither the -comment nor the -cfile option, wrjpgcom reads the comment text from standard input. In such cases, you must supply an input image filename. You can enter multiple lines, up to 64KB. Type an end-of-file indicator, usually Ctrl-D, to terminate the comment text entry. wrjpgcom does not add a COM block if the provided comment string is empty. Therefore, you can use -replace -comment "" to delete all COM blocks from a file. EXAMPLES
Example 1: Adding a Short Comment to in.jpg to Produce out.jpg example% wrjpgcom -c "View of my back yard" in.jpg > out.jpg Example 2: Attaching a Long Comment Previously Stored in comment.txt example% wrjpgcom in.jpg < comment.txt > out.jpg or example% wrjpgcom -cfile comment.txt < in.jpg > out.jpg In this example, 1000 is a number that is larger than the number of rows in the source file. ATTRIBUTES
See attributes(5) for descriptions of the following attributes: +-----------------------------+-----------------------------+ | ATTRIBUTE TYPE | ATTRIBUTE VALUE | +-----------------------------+-----------------------------+ |Availability |SUNWjpg | +-----------------------------+-----------------------------+ |Interface stability |External | +-----------------------------+-----------------------------+ SEE ALSO
cjpeg(1), djpeg(1), jpegtran(1), rdjpgcom(1) NOTES
This man page was originally written by the Independent JPEG Group. Updated by Breda McColgan, Sun Microsystems Inc., 2004. SunOS 5.10 26 Mar 2004 wrjpgcom(1)
All times are GMT -4. The time now is 12:23 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy