List Duplicate Post: 302123979

Sponsored Content

Top Forums Shell Programming and Scripting List Duplicate Post 302123979 by aigles on Thursday 28th of June 2007 03:32:53 AM

06-28-2007

Registered User

When I read the question, I had in mind a solution using arrays like that of vgersh99.
Finally I tried to see whether it were easy to make without arrays, and it's that solution that i have posted.
The vgersh99' solution is simpler and more readable.

I wanted to see the differences in performance between the two solution for a large volume of data.
For that I adapted the two solutions to determine the number of files duplicated files on my system.

I have build a file containing the list of all the files (field 1: directory path, field 2: name of the file)
The result file contains 64000 duplicate files approximately.

Code:

# find / | sed 's!/\([^/]*\)$!/ \1!' > files.txt
# wc files.txt
  534733 1069473 34359804 files.txt
# head -10 files.txt
/ 
/ lost+found
/ home
/home/ lost+found
/home/ guest
/home/guest/ .sh_history
/home/ gseyjr
/home/gseyjr/ .profile
/home/ usertest
/home/usertest/ .profile
#

The solution with arrays :

Code:

$ cat dup1.sh
awk '
   { 
      Files[$2] = ($2 in Files) ? Files[$2] ORS $0 : $0; 
      FilesCnt[$2]++ 
   }
   END { 
      for (f in Files) {
         if (FilesCnt[f] > 1) {
            print Files[f];
            duplicates++;
         }
      }
      print "\nDuplicates : " duplicates;
   }
' files.txt
$ time dup1.sh > /dev/null
real    0m27.22s
user    0m26.74s
sys     0m0.40s
$

The solution without arrays :

The -T option of the sort command was required because there wasn't sufficient space available for work files on the current filesystem.

Code:

$ cat dup2.sh
sort -T /refiea/tmp -k2,2 files.txt |
awk '
   BEGIN { first_duplicate = 1 }
   {
     file = $2;
     if (file == prv_file) {
         if (first_duplicate) {
            print prv_rec;
            duplicates++
         }
         print $0;
         first_duplicate = 0;
     } else {
        prv_file = file;
        prv_rec  = $0;
        first_duplicate = 1;
     }
   }
   END {
      print "Duplicates : " duplicates;
   }
'
$time dup2.sh > /dev/null
real    0m39.85s
user    0m2.92s
sys     0m0.10s
$

In fact, the sort itself takes more time to run that the complete solution with arrays.

Code:

$ time sort -T /refiea/tmp -k2,2 files.txt > /dev/null
real   33.06
user   32.28
sys    0.73
$

Conclusion:

The arrays win the contest.

Awk' arrays are yours friends.
They are easy to use and powerful.

aigles

View Public Profile for aigles

Find all posts by aigles

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Get a none duplicate list file

Dear sir i got a file like following format with the duplicate line: AAA AAA AAA AAA AAA BBB BBB BBB BBB CCC CCC CCC CCC ...

2. Shell Programming and Scripting

Removing duplicate files from list with different path

I have a list which contains all the jar files shipped with the product I am involved with. Now, in this list I have some jar files which appear again and again. But these jar files are present in different folders. My input file looks like this /path/1/to a.jar /path/2/to a.jar /path/1/to...

3. Shell Programming and Scripting

Splitting a list @list by space delimiter so i can access it by using $list[0 ..1..2]

EDIT : This is for perl @data2 = grep(/$data/, @list_now); This gives me @data2 as Printing data2 11 testzone1 running /zones/testzone1 ***-*****-****-*****-***** native shared But I really cant access data2 by its individual elements. $data2 is the entire list, while $data,2,3......

4. Shell Programming and Scripting

Find duplicate based on 'n' fields and mark the duplicate as 'D'

Hi, In a file, I have to mark duplicate records as 'D' and the latest record alone as 'C'. In the below file, I have to identify if duplicate records are there or not based on Man_ID, Man_DT, Ship_ID and I have to mark the record with latest Ship_DT as "C" and other as "D" (I have to create...

5. Shell Programming and Scripting

Duplicate value

Hi All, i have file like ID|Indiv_ID 12345|10001 |10001 |10001 23456|10002 |10002 |10002 |10002 |10003 |10004 if indiv_id having duplicate values and corresponding ID column is null then copy the id. I need output like: ID|Indiv_ID 12345|10001...

6. Shell Programming and Scripting

Duplicate files and output list

Gents, I have a file like this. 1 1 1 2 2 3 2 4 2 5 3 6 3 7 4 8 5 9 I would like to get something like it 1 1 2 2 3 4 5 3 6 7 Thanks in advance for your support :b:

7. Shell Programming and Scripting

Find and remove duplicate record and print list

Gents, I needs to delete duplicate values and only get uniq values based in columns 2-27 Always we should keep the last record found... I need to store one clean file and other with the duplicate values removed. Input : S3033.0 7305.01 0 420123.8...

8. Shell Programming and Scripting

List duplicate files based on Name and size

Hello, I have a huge directory (with millions of files) and need to find out duplicates based on BOTH file name and File size. I know fdupes but it calculates MD5 which is very time-consuming and especially it takes forever as I have millions of files. Can anyone please suggest a script or...

9. Shell Programming and Scripting

Find duplicate values in specific column and delete all the duplicate values

Dear folks I have a map file of around 54K lines and some of the values in the second column have the same value and I want to find them and delete all of the same values. I looked over duplicate commands but my case is not to keep one of the duplicate values. I want to remove all of the same...

10. UNIX for Beginners Questions & Answers

Iterate through a list - checking for a duplicate then report it ot

I have a job that produces a file of barcodes that gets added to every time the job runs I want to check the list to see if the barcode is already in the list and report it out if it is.

LEARN ABOUT DEBIAN

set_color

set_color(1)							       fish							      set_color(1)

NAME

       set_color - set_color - set the terminal color

set_color - set the terminal color
   Synopsis
       set_color [-v --version] [-h --help] [-b --background COLOR] [COLOR]

   Description
       Change the foreground and/or background color of the terminal. COLOR is one of black, red, green, brown, yellow, blue, magenta, purple,
       cyan, white and normal.

       o -b, --background Set the background color

       o -c, --print-colors Prints a list of all valid color names

       o -h, --help Display help message and exit

       o -o, --bold Set bold or extra bright mode

       o -u, --underline Set underlined mode

       o -v, --version Display version and exit

       Calling set_color normal will set the terminal color to whatever is the default color of the terminal.

       Some terminals use the --bold escape sequence to switch to a brighter color set. On such terminals, set_color white will result in a grey
       font color, while set_color --bold white will result in a white font color.

       Not all terminal emulators support all these features. This is not a bug in set_color but a missing feature in the terminal emulator.

       set_color uses the terminfo database to look up how to change terminal colors on whatever terminal is in use. Some systems have old and
       incomplete terminfo databases, and may lack color information for terminals that support it. Download and install the latest version of
       ncurses and recompile fish against it in order to fix this issue.

Version 1.23.1							  Sun Jan 8 2012						      set_color(1)

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Get a none duplicate list file

Discussion started by: trynew

2. Shell Programming and Scripting

Removing duplicate files from list with different path

Discussion started by: vino

3. Shell Programming and Scripting

Splitting a list @list by space delimiter so i can access it by using $list[0 ..1..2]

Discussion started by: shriyer

4. Shell Programming and Scripting

Find duplicate based on 'n' fields and mark the duplicate as 'D'

Discussion started by: machomaddy

5. Shell Programming and Scripting

Duplicate value

Discussion started by: bmk

6. Shell Programming and Scripting

Duplicate files and output list

Discussion started by: jiam912

7. Shell Programming and Scripting

Find and remove duplicate record and print list

Discussion started by: jiam912

8. Shell Programming and Scripting

List duplicate files based on Name and size

Discussion started by: prvnrk

9. Shell Programming and Scripting

Find duplicate values in specific column and delete all the duplicate values

Discussion started by: sajmar

10. UNIX for Beginners Questions & Answers

Iterate through a list - checking for a duplicate then report it ot

Discussion started by: worky

LEARN ABOUT DEBIAN

set_color