Help in modifying a PERL script to sort Singletons and Duplicates

07-03-2015

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Help in modifying a PERL script to sort Singletons and Duplicates

I have a large database which has the following structure

Code:

a=b
where a is one language and b is the other and = is the delimiter

Since the data treats of language, homographs occur i.e. the same word on the left hand side can map in two different entries to two different glosses on the right hand side as in th example below.

Code:

a=b
a=c
b=p
b=q
c=d
r=s

I had written the following script which does give results when the database is small but when the database increases (around 100,000+ entries) it does not perform very well. The expected output in a single file would be as under:

Code:

SINGLETONS
c=d
r=s
DUPES
a=b
a=c
b=p
b=q

I am giving the Perl script below

Code:

#!/usr/bin/perl

$dupes = $singletons = "";		# This goes at the head of the file

do {
    $dupefound = 0;			# These go at the head of the loop
    $text = $line = $prevline = $name = $prevname = "";
    do {
	$line = <>;
	$line =~ /^(.+)\=.+$/ and $name = $1;
	$prevline =~ /^(.+)\=.+$/ and $prevname = $1;
	if ($name eq $prevname) { $dupefound += 1 }
	$text .= $line;
	$prevline = $line;
    } until ($dupefound > 0 and $text !~ /^(.+?)\=.*?\n(?:\1=.*?\n)+\z/m) or eof;
    if ($text =~ s/(^(.+?)\=.*?\n(?:\2=.*?\n)+)//m) { $dupes .= $1 }
    $singletons .= $text;
} until eof;
print "SINGLETONS\n$singletons\n\DUPES\n$dupes";

I feel that the problem lies with a prior sort of the data since the script gives better results (but not perfect) when sorted.
Could someone please provide a better solution and if the code is being modified, could the change be commented since I am still learning the intricacies of Perl, which explains the comments which I added on to the script.
Many thanks for your help

gimley

View Public Profile for gimley

Find all posts by gimley

07-03-2015

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

It doesn't perform very well because you're keeping 100,000 things in one string and using regular expressions to find them. This is never going to be efficient -- you're scanning through all 100,000 things every time you add a single one.

Use a hash to organize it into several smaller things instead of one big thing of 100,000 and avoid needing regexes to find them.

Code:

#!/usr/bin/perl

my %hg=(), $n;
my @s=();

do {
        my ($a, $b, $c)=split(/[=\n]/,<STDIN>);

        if(!defined($hg{$a})) { $hg{$a}="";     }

        # Append string to hash{a}
        $hg{$a}=$hg{$a} . " " . $b;
} until eof;

foreach my $a (keys %hg) {
        my @l=split(/ /, $hg{$a});

        # save singletons for later
        if($#l == 1) {  push(@s, $a);   }
        else    {
                for($n=1; $n<=$#l; $n++)
                {
                        printf("%s=%s\n", $a, $l[$n]);
                }
        }
}

print "SINGLETONS\n";

for($n=1; $n <= $#s; $n++)
{
        printf("%s=%s\n", $s[$n], $hg{$s[$n]});
}

This User Gave Thanks to Corona688 For This Post:

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

07-03-2015

Registered User

1,781, 705

Join Date: May 2008

Last Activity: 10 November 2021, 5:38 PM EST

Posts: 1,781

Thanks Given: 62

Thanked 705 Times in 653 Posts

Use as starting point.

Code:

#!/usr/bin/perl

# segregate.pl

use strict;
use warnings;

# data structure as a hash
my %db;
# populate a data structure based on first field
while(my $line = <>) {
    # remove new line
    chomp $line;
    # get field one
    my ($id) = split /=/, $line;
    # add field1 => line to the structure
    push @{$db{$id}}, $line;
}

# sorted keys for singletons
my @single;
# sorted keys for duplicates
my @dup;
# iterate through db hash using sorted keys
for my $key (sort keys %db) {
    if(scalar @{$db{$key}} == 1) {
        # add to singles list if one item
        push @single, $key;
    }
    else {
        # more than one item, add to dup list
        push @dup, $key;
    }
}

# Display lists
print "SINGLETONS:\n";
# iterate only through sorted singles
for my $single (@single) {
    print @{$db{$single}}, "\n";
}

print "DUPES:\n";
# iterate only through sorted duplicates
for my $dup (@dup) {
    print join "\n", @{$db{$dup}};
    print "\n";
}

Usage:
perl segregate.pl datafile

This User Gave Thanks to Aia For This Post:

Aia

View Public Profile for Aia

Find all posts by Aia

07-03-2015

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Many thanks for your help and above all for taking the pains to comment the code. Walking through it helped me understand what was wrong with my approach. The script worked and was fast and neatly segregated the two types.

---------- Post updated at 07:23 PM ---------- Previous update was at 07:18 PM ----------

Many thanks. I walked through the code and also your comments. I now understand why my script kept on giving very few results. Most importantly I learnt how to hash data in Perl. Aia's script was the add-on. Thanks to forum members for help and above all the pains people take to correct and comment on code written by people struggling with code.

gimley

View Public Profile for gimley

Find all posts by gimley

Shell Programming and Scripting

Help in modifying a PERL script to sort Singletons and Duplicates

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Concatenate and sort to remove duplicates

Discussion started by: Paras Pandey

2. Shell Programming and Scripting

Sort and Remove duplicates

Discussion started by: ysvsr1

3. Shell Programming and Scripting

Bash - remove duplicates without sort

Discussion started by: locoroco

4. Shell Programming and Scripting

Help in modifying existing Perl Script to produce report of dupes

Discussion started by: gimley

5. Shell Programming and Scripting

remove duplicates and sort

Discussion started by: dvah

6. Shell Programming and Scripting

need Shell script for Sort BASED ON FIRST FIELD and PRINT THE WHOLE FILE WITHOUT DUPLICATES

Discussion started by: tuffEnuff

7. Shell Programming and Scripting

modifying perl script

Discussion started by: reaky

8. UNIX for Dummies Questions & Answers

removing duplicates and sort -k

Discussion started by: orahi001

9. Shell Programming and Scripting

Sort, Uniq, Duplicates

Discussion started by: Amruta Pitkar

10. Shell Programming and Scripting

Removing duplicates [sort , uniq]

Discussion started by: sharatz83