Perl: filtering lines based on duplicate values in a column

09-22-2011

Registered User

55, 0

Join Date: May 2010

Last Activity: 8 January 2014, 12:05 PM EST

Posts: 55

Thanks Given: 34

Thanked 0 Times in 0 Posts

Perl: filtering lines based on duplicate values in a column

Hi I have a file like this. I need to eliminate lines with first column having the same value 10 times.

Code:

13 18 1 + chromosome 1, 122638287 AGAGTATGGTCGCGGTTG
13 18 1 + chromosome 1, 128904080 AGAGTATGGTCGCGGTTG
13 18 1 - chromosome 14, 13627938 CAACCGCGACCATACTCT
13 18 1 + chromosome 1, 187172197 AGAGTATGGTCGCGGTTG
13 18 1 - chromosome X, 38407155 CAACCGCGACCATACTCT
13 18 1 + chromosome 9, 13503259 AGAGTATGGTCGCGGTTG
13 18 1 - chromosome 2, 105480832 CAACCGCGACCATACTCT
13 18 1 + chromosome 9, 49045535 AGAGTATGGTCGCGGTTG
13 18 1 + chromosome 1, 178729626 AGAGTATGGTCGCGGTTG
13 18 1 - chromosome X, 55081462 CAACCGCGACCATACTCT
9 17 2 + chromosome 10, 101398385 GCCAGTTCTACAGTCCG
9 17 2 - chromosome 3, 103818009 CGGACTGTAGAACTGGC
9 17 2 - chromosome 16, 94552245 CGGACTGTAGAACTGGC
4 18 1 - chromosome 18, 70056996 TACCCAACAACACATAGT

The value 13 in the first column is repeated 10 times in the consecutive lines. I need to eliminate all those lines in the output.

so the desired output will be

Code:

9 17 2 + chromosome 10, 101398385 GCCAGTTCTACAGTCCG
9 17 2 - chromosome 3, 103818009 CGGACTGTAGAACTGGC
9 17 2 - chromosome 16, 94552245 CGGACTGTAGAACTGGC
4 18 1 - chromosome 18, 70056996 TACCCAACAACACATAGT

Thank you much in advance. If it is possible a code in Perl would be much appreciated.

polsum

View Public Profile for polsum

Find all posts by polsum

09-23-2011

Registered User

509, 132

Join Date: Jul 2011

Last Activity: 24 September 2019, 9:48 AM EDT

Location: Chennai, India

Posts: 509

Thanks Given: 16

Thanked 132 Times in 127 Posts

PERL

Hi,

Try this code,

Code:

#! /usr/local/bin/perl
open(FILE,"<File1") or die("unable to open file");
my @mContent = <FILE>;
my %mFinal = ();
foreach ( @mContent )
{
   my $mLine = $_;
   chomp ( $mLine );
   my $mField = (split(/ /,$mLine,999))[0];
   $mFinal{$mField}{"count"}=$mFinal{$mField}{"count"}+1;
   $mFinal{$mField}{"content"}=$mLine;
}
foreach my $mField ( keys %mFinal )
{
   my $mCount = $mFinal{$mField}{"count"};
   if ( $mCount != 10 )
   {
      print "$mFinal{$mField}{'content'}\n";
   }
}

Cheers,
Ranga

This User Gave Thanks to rangarasan For This Post:

rangarasan

View Public Profile for rangarasan

Find all posts by rangarasan

09-23-2011

Registered User

55, 0

Join Date: May 2010

Last Activity: 8 January 2014, 12:05 PM EST

Posts: 55

Thanks Given: 34

Thanked 0 Times in 0 Posts

thanks for the reply - but eventhough its NOT printing the repetitive values, its not printing all the remaining values.

the out put was

Code:

4 18 1 - chromosome 18, 70056996 TACCCAACAACACATAGT
9 17 2 - chromosome 16, 94552245 CGGACTGTAGAACTGGC

polsum

View Public Profile for polsum

Find all posts by polsum

09-23-2011

Registered User

628, 174

Join Date: Oct 2010

Last Activity: 2 December 2017, 5:58 AM EST

Location: Madrid, Spain

Posts: 628

Thanks Given: 8

Thanked 174 Times in 171 Posts

Hi polsum,

Here you have another 'perl' solution:

Code:

$ cat File1
13 18 1 + chromosome 1, 122638287 AGAGTATGGTCGCGGTTG
13 18 1 + chromosome 1, 128904080 AGAGTATGGTCGCGGTTG
13 18 1 - chromosome 14, 13627938 CAACCGCGACCATACTCT
13 18 1 + chromosome 1, 187172197 AGAGTATGGTCGCGGTTG
13 18 1 - chromosome X, 38407155 CAACCGCGACCATACTCT
13 18 1 + chromosome 9, 13503259 AGAGTATGGTCGCGGTTG
13 18 1 - chromosome 2, 105480832 CAACCGCGACCATACTCT
13 18 1 + chromosome 9, 49045535 AGAGTATGGTCGCGGTTG
13 18 1 + chromosome 1, 178729626 AGAGTATGGTCGCGGTTG
13 18 1 - chromosome X, 55081462 CAACCGCGACCATACTCT
9 17 2 + chromosome 10, 101398385 GCCAGTTCTACAGTCCG
9 17 2 - chromosome 3, 103818009 CGGACTGTAGAACTGGC
9 17 2 - chromosome 16, 94552245 CGGACTGTAGAACTGGC
4 18 1 - chromosome 18, 70056996 TACCCAACAACACATAGT
$ cat polsum.pl
use warnings;
use strict;

@ARGV == 1 or die qq[Usage: perl $0 input-file\n];

my ($number, @block_lines, $prev, @f);

while ( <> ) {
        next if /\A\s*\z/;
        chomp;
        @f = split;

        if ( $. == 1 ) {
                ++$number;
                push @block_lines, $_;
                next;
        }


        if ( $prev == $f[0] ) {
                ++$number;
        }
        else {
                if ( $number != 10 ) {
                        printf "%s\n", join qq[\n], @block_lines;
                }
                $number = 1;
                @block_lines = ();
        }

        push @block_lines, $_;
}
continue {
        $prev = $f[0];
        if ( eof() && $number != 10 ) {
                printf "%s\n", join qq[\n], @block_lines;
        }
}
$ perl polsum.pl File1 
9 17 2 + chromosome 10, 101398385 GCCAGTTCTACAGTCCG
9 17 2 - chromosome 3, 103818009 CGGACTGTAGAACTGGC
9 17 2 - chromosome 16, 94552245 CGGACTGTAGAACTGGC
4 18 1 - chromosome 18, 70056996 TACCCAACAACACATAGT

Regards,
Birei

---------- Post updated at 02:07 ---------- Previous update was at 01:59 ----------

rangarasan's code also works for me with next changes:

Code:

#! /usr/local/bin/perl
open(FILE,"<File1") or die("unable to open file");
my @mContent = <FILE>;
my %mFinal = ();
foreach ( @mContent )
{
   my $mLine = $_;
#   chomp ( $mLine );
   my $mField = (split(/ /,$mLine,999))[0];
   $mFinal{$mField}{"count"}=$mFinal{$mField}{"count"}+1;
   $mFinal{$mField}{"content"}.=$mLine;   # '.' for concatenate strings.
}
foreach my $mField ( keys %mFinal )
{
   my $mCount = $mFinal{$mField}{"count"};
   if ( $mCount != 10 )
   {
#      print "$mFinal{$mField}{'content'}\n";
      print "$mFinal{$mField}{'content'}";
   }
}

Regards,
Birei

This User Gave Thanks to birei For This Post:

birei

View Public Profile for birei

Find all posts by birei

09-23-2011

Registered User

1,000, 237

Join Date: Jun 2011

Last Activity: 2 August 2017, 9:27 AM EDT

Location: From far

Posts: 1,000

Thanks Given: 21

Thanked 237 Times in 231 Posts

Assuming you don't want lines when the first field repeats N times:

Code:

awk -v N=10 '
$1 != prev {
  if (c != N) for (i=1; i<=c; i++) print a[i]
  c = 0
}
{
  a[++c] = $0;
  prev = $1;
}           
END {         
  if (c != N) for (i=1; i<=c; i++) print a[i] 
}' INPUTFILE

This User Gave Thanks to yazu For This Post:

yazu

View Public Profile for yazu

Find all posts by yazu

09-24-2011

Registered User

55, 0

Join Date: May 2010

Last Activity: 8 January 2014, 12:05 PM EST

Posts: 55

Thanks Given: 34

Thanked 0 Times in 0 Posts

Thank you very much every one. After few hours of head banging, I came up with my own code which seems to be working fine. yay!

Code:

#! /usr/local/bin/perl
use warnings;
use strict;
my %hash;
my $line;
my %dup;
while (<>) {   
  chomp;
   my($x, ) = split;
  $line = $_;
   $hash{$line} = "\t$x"; 
                   }
foreach $line(keys %hash) {
        $dup{$hash{$line}}++; 
                                            }
foreach $line(keys %hash) {
     if ($dup{$hash{$line}} != 10) {
          print "$line\n"; 
                                                       }
                                            }

polsum

View Public Profile for polsum

Find all posts by polsum

Shell Programming and Scripting

Perl: filtering lines based on duplicate values in a column

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Find lines with duplicate values in a particular column

Discussion started by: kaktus

2. UNIX for Beginners Questions & Answers

Filtering based on column values

Discussion started by: daashti

3. Shell Programming and Scripting

Find duplicate values in specific column and delete all the duplicate values

Discussion started by: sajmar

4. Shell Programming and Scripting

Removing duplicate lines on first column based with pipe delimiter

Discussion started by: parithi06

5. Shell Programming and Scripting

awk to sum a column based on duplicate strings in another column and show split totals

Discussion started by: prashob123

6. UNIX for Dummies Questions & Answers

awk solution to duplicate lines based on column

Discussion started by: torchij

7. UNIX for Dummies Questions & Answers

[SOLVED] remove lines that have duplicate values in column two

Discussion started by: pathunkathunk

8. Shell Programming and Scripting

Filtering lines for column elements based on corresponding counts in another column

Discussion started by: polsum

9. Shell Programming and Scripting

Joining multiple files based on one column with different and similar values (shell or perl)

Discussion started by: seqbiologist

10. UNIX for Advanced & Expert Users

Filtering duplicate lines

Discussion started by: AreaMan