Help in modifying existing Perl Script to produce report of dupes

04-25-2012

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Help in modifying existing Perl Script to produce report of dupes

Hello,
I have a large amount of data with the following structure:
Word=Transliterated word
I have written a Perl Script (reproduced below) which goes through the full file and identifies all dupes on the right hand side. It creates successfully a new file with two headers: Singletons and Dupes.
I have tried to modify the script to produce additionally a record listing the frequency count of all dupes. Thus in the sample provided, I would like to know how many times the dupe Albert has been transliterated in different ways. I am providing pseudo-data since the original data is in a foreign script.

Quote:

Albert=albt
Albert=albut
Albert=albat
Mary=mari
Mary=meri
Mary=merry
Mary=marey

The script should give me a report in a separate output with the following structure:

Quote:

Albert,3, albt,albut,albat
Mary,4,mari,meri,merry,marey

The final output would thus have two files:
The output file listing Singletons and Dupes
The report which would have the dupes listed along with their frequency.
I am not very good at generating reports in Perl and hence the request:
Perl script follows.
Many thanks for excellent help and advice given.

Code:

#!/usr/bin/perl

$dupes = $singletons = "";		# This goes at the head of the file

do {
    $dupefound = 0;			# These go at the head of the loop
    $text = $line = $prevline = $name = $prevname = "";
    do {
	$line = <>;
	$line =~ /^(.+)\=.+$/ and $name = $1;
	$prevline =~ /^(.+)\=.+$/ and $prevname = $1;
	if ($name eq $prevname) { $dupefound += 1 }
	$text .= $line;
	$prevline = $line;
    } until ($dupefound > 0 and $text !~ /^(.+?)\=.*?\n(?:\1=.*?\n)+\z/m) or eof;
    if ($text =~ s/(^(.+?)\=.*?\n(?:\2=.*?\n)+)//m) { $dupes .= $1 }
    $singletons .= $text;
} until eof;
print "SINGLETONS\n$singletons\n\DUPES\n$dupes";

Last edited by Franklin52; 04-26-2012 at 03:48 AM.. Reason: Corrected code tags

gimley

View Public Profile for gimley

Find all posts by gimley

04-26-2012

Registered User

2,019, 606

Join Date: Apr 2009

Last Activity: 27 February 2021, 12:15 PM EST

Location: India

Posts: 2,019

Thanks Given: 50

Thanked 606 Times in 567 Posts

Code:

[user@cygwin ~]$ cat input.txt
Albert=albt
Albert=albut
Albert=albat
Mary=mari
Mary=meri
Mary=merry
Mary=marey
[user@cygwin ~]$
[user@cygwin ~]$ perl -F= -ane 'BEGIN {open O, "> output.txt"}
chomp $F[1]; $x{$F[0]} .= "$F[1],"; $y{$F[0]}++;
END {
    for (sort keys %x) {
        $x{$_} =~ s/,$//;
        print O "$_,$y{$_},$x{$_}\n";
    }
    close O;
}' input.txt
[user@cygwin ~]$
[user@cygwin ~]$ cat output.txt
Albert,3,albt,albut,albat
Mary,4,mari,meri,merry,marey
[user@cygwin ~]$

This User Gave Thanks to balajesuri For This Post:

balajesuri

View Public Profile for balajesuri

Find all posts by balajesuri

04-26-2012

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Hi,
Many thanks.
Unluckily I work under windows and "cat" commands do not function correctly under this OS.
I cut the snippet of the code and applied it but it would not work.
Many thanks for the help all the same

gimley

View Public Profile for gimley

Find all posts by gimley

04-26-2012

Registered User

2,019, 606

Join Date: Apr 2009

Last Activity: 27 February 2021, 12:15 PM EST

Location: India

Posts: 2,019

Thanks Given: 50

Thanked 606 Times in 567 Posts

Code:

#! C:\Perl\bin\perl.exe
use strict;
use warnings;

my (@F, %x, %y);

open I, "< input.txt";
for (<I>) {
    chomp;
    @F = split /=/;
    $x{$F[0]} .= "$F[1],";
    $y{$F[0]}++;
}
close I;

open O, "> output.txt";
for (sort keys %x) {
    $x{$_} =~ s/,$//;
    print O "$_,$y{$_},$x{$_}\n";
}
close O;

balajesuri

View Public Profile for balajesuri

Find all posts by balajesuri

04-26-2012

Registered User

1,271, 299

Join Date: Sep 2009

Last Activity: 17 July 2019, 5:46 PM EDT

Location: ./India/Bangalore

Posts: 1,271

Thanks Given: 70

Thanked 299 Times in 290 Posts

How about this ?

Code:

#!/usr/bin/perl

while (<DATA>) {
        chomp;
        ($word,$meaning)=split(/\=/);
        push  @{$word} , $meaning ;
        $hashWord{$word}=\@{$word};
}

foreach (keys %hashWord ) {
        $KeyWord=$_;
        printf "%s,%d",$KeyWord,scalar @{$hashWord{$_}};
        foreach (@{$hashWord{$_}}) {
        printf ",%s",$_;
        }
        print "\n";
}



__DATA__
Albert=albt
Albert=albut
Albert=albat
Mary=mari
Mary=meri
Mary=merry
Mary=marey

This User Gave Thanks to pravin27 For This Post:

pravin27

View Public Profile for pravin27

Find all posts by pravin27

04-26-2012

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Many thanks. The script runs like a charm and sorts and identifies dupes

gimley

View Public Profile for gimley

Find all posts by gimley

Shell Programming and Scripting

Help in modifying existing Perl Script to produce report of dupes

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Help in modifying a PERL script to sort Singletons and Duplicates

Discussion started by: gimley

2. Shell Programming and Scripting

Help with Perl script for identifying dupes in column1

Discussion started by: gimley

3. Shell Programming and Scripting

Removing Dupes from huge file- awk/perl/uniq

Discussion started by: makn

4. Shell Programming and Scripting

Script for identifying and deleting dupes in a line

Discussion started by: gimley

5. Shell Programming and Scripting

Script to produce report of High Utilization Processes

Discussion started by: thinakarmani

6. Shell Programming and Scripting

Using an awk script to identify dupes in two files

Discussion started by: gimley

7. Shell Programming and Scripting

Shell script that will compare two config files and produce 2 outputs 1)actual config file 2)report

Discussion started by: muraliinfy04

8. Infrastructure Monitoring

modifying existing file using C

Discussion started by: zing_foru

9. Shell Programming and Scripting

modifying perl script

Discussion started by: reaky

10. Shell Programming and Scripting

Perl Script to produce a file

Discussion started by: mingming88