finding duplicates with perl

01-27-2003

Registered User

723, 4

Join Date: Oct 2002

Last Activity: 31 July 2013, 6:52 PM EDT

Posts: 723

Thanks Given: 3

Thanked 4 Times in 4 Posts

finding duplicates with perl

I have a huge file (over 30mb) that I am processing through with perl. I am pulling out a list of filenames and placing it in an array called @reports.
I am fine up till here. What I then want to do is go through the array and find any duplicates. If there is a duplicate, output it to the screen. But once I find one duplicate of a filename, I want to go through and look for a duplicate of the next filename.
Thanks!

dangral

View Public Profile for dangral

Find all posts by dangral

01-27-2003

Registered User

129, 0

Join Date: May 2002

Last Activity: 7 September 2008, 10:24 PM EDT

Location: Atlanta

Posts: 129

Thanks Given: 0

Thanked 0 Times in 0 Posts

Without more specifics about your problem, I think a hash might be more appropriate than an array. Then you can keep a count of where each filename is called out, or a list of callouts or whatever. If you need to preserve the order of the filenames, store the record number each filename was first found in, say, then sort on the record number. But a hash is a fundamental perl idiom for detecting duplicates. It'll work in ruby, too, BTW.

criglerj

View Public Profile for criglerj

Find all posts by criglerj

01-27-2003

Registered User

723, 4

Join Date: Oct 2002

Last Activity: 31 July 2013, 6:52 PM EDT

Posts: 723

Thanks Given: 3

Thanked 4 Times in 4 Posts

I'm not really sure what exactly that would entail. I will post what I have thought of (using an array), although it is not working.

open(DUPS, ">duplicates.txt") or die "Can't open duplicates.txt $!";
for $i (0 .. $#reports){
for $j (0 .. $#reports){
if ($i != $j && $reports[$i] eq $reports[$j]){
print DUPS "\n $reports[$i]";
last;
}

}

}
close(DUPS);

dangral

View Public Profile for dangral

Find all posts by dangral

01-28-2003

Registered User

129, 0

Join Date: May 2002

Last Activity: 7 September 2008, 10:24 PM EDT

Location: Atlanta

Posts: 129

Thanks Given: 0

Thanked 0 Times in 0 Posts

Okay, if you need the list of @reports in an array for some other reason, the next best thing is a hash just to gather the duplicates. You can destroy the hash later if you need to. Warning: untested code

Code:

%h = ();
foreach $r (@reports) {
    if (!exists($h{$r}))  {
        # First time we've seen this one
        $h{$r} = 0
    } elsif ($h{$r}) {
        # We've seen this one before and reported
        $h{$r}++
    } else {
        # Second time, so report the duplicate
        print DUPS "\n $r";
        $h{$r} = 1
    }
}
%h = ();   # Destroy %h if you're done with it

Now %h contains the number of "extra" of each member of @reports, i.e, one less than the number that's actually there. If you don't need @reports for anything else, you can embed this logic into the loop thats reading your large data file and save some memory.

Another option is to put off the reporting of duplicates, putting that in another loop after the one shown above so you can report the number of times a report is found in @reports.

criglerj

View Public Profile for criglerj

Find all posts by criglerj

Shell Programming and Scripting

finding duplicates with perl

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Help in modifying a PERL script to sort Singletons and Duplicates

Discussion started by: gimley

2. Shell Programming and Scripting

UNIX scripting for finding duplicates and null records in pk columns

Discussion started by: praveenraj.1991

3. Shell Programming and Scripting

PERL "filtering the log file removing the duplicates

Discussion started by: scriptscript

4. Shell Programming and Scripting

Finding duplicates in a file excluding specific pattern

Discussion started by: shiva2985

5. Shell Programming and Scripting

Perl, sorting and eliminating duplicates

Discussion started by: gabrysfe

6. UNIX for Dummies Questions & Answers

Finding duplicates then copying, almost there, maybe?

Discussion started by: Rhinoskin

7. Shell Programming and Scripting

finding duplicates in csv based on key columns

Discussion started by: baskivs

8. Shell Programming and Scripting

Help finding non duplicates

Discussion started by: chipblah84

9. Shell Programming and Scripting

Finding duplicates from positioned substring across lines

Discussion started by: gapprasath

10. Shell Programming and Scripting

finding duplicates in columns and removing lines

Discussion started by: totus