How to get remove duplicate of a file based on many conditions

02-04-2010

Registered User

41, 0

Join Date: Jul 2009

Last Activity: 2 June 2010, 7:15 AM EDT

Posts: 41

Thanks Given: 0

Thanked 0 Times in 0 Posts

How to get remove duplicate of a file based on many conditions

Hii Friends.. I have a huge set of data stored in a file.Which is as shown below
a.dat:

HTML Code:

 RAO   1869 12 19  0  0  0.00  17.9000  82.3000  10.0   0  0.00   0  3.70  0.00  0.00   0  0.00  3.70   4   NULL
 LEE   1870  4 11  1  0  0.00  30.0000  99.0000   0.0   0  0.00   0  0.00  0.00  0.00   0  6.75  6.75   9   NULL
 SIG   1870  4 11  1  0  0.00  30.0000  99.0000   0.0   0  0.00   0  0.00  0.00  0.00   0  0.00  6.75   9   NULL
 SIG   1870  4 11  1  0  0.00  30.0000  99.0000   0.0   0  0.00   0  0.00  0.00  0.00   0  6.70  6.70   0   NULL
 RAO   1870 10 19  0  0  0.00  17.7000  83.4000  10.0   0  0.00   0  3.70  0.00  0.00   0  0.00  3.70   4   NULL
 SSR   1896  3  4  5  5  0.00  37.0000  76.0000  40.0   0  0.00   0  7.10  0.00  0.00   0  0.00  7.10   8   NULL
 SSR   1896  6 17 12  0  0.00  37.0000  68.0000  15.0   0  0.00   0  5.20  0.00  0.00   0  0.00  5.20   7   2.23e+23
 SIG   1899  9 23 23 24  0.00  37.0000  71.0000 160.0   0  0.00   0  0.00  0.00  0.00   0  7.50  7.50   6   NULL
 SSR   1899  9 23 23 20  0.00  37.0000  71.0000 160.0   0  0.00   0  7.50  0.00  0.00   0  0.00  7.50   6   NULL
 SIG   1902  8 30 21 50  0.00  37.0000  71.0000 200.0   0  0.00   0  0.00  0.00  0.00   0  7.70  7.70   7   NULL
 SSR   1902  8 30 21 50  0.00  37.0000  71.0000 200.0   0  0.00   0  7.70  0.00  6.90   0  0.00  7.70   7   NULL
 BDA   1905  4  4  2 50  0.00  33.0000  76.0000  60.0   0  0.00   0  5.00  8.00  0.00   0  8.60  8.60   0   NULL
 G-R   1905  4  4  0 50  0.00  33.0000  76.0000  25.0   0  0.00   0  5.00  8.00  0.00   0  8.60  8.60   0   1.23e+11
 SIG   1905  4  4  2 50  0.00  33.0000  76.0000  25.0   0  0.00   0  0.00  0.00  0.00   0  8.60  8.60   0   NULL
 SIG   1950  8 15  0  0  0.00  28.5000  96.7000   0.0   0  0.00   0  0.00  0.00  0.00   0  8.60  8.60   0   NULL
 BDA   1950  8 15 14  9 30.00  28.5000  96.5000  60.0   0  0.00   0  0.00  0.00  0.00   0  8.70  8.70   0   NULL
 G-R   1913  3  6  2  9  0.00  30.0000  83.0000   0.0   0  0.00   0  0.00  0.00  0.00   0  6.20  6.20   0   NULL

Output for this file should be like b.dat:

HTML Code:

 RAO   1869 12 19  0  0  0.00  17.9000  82.3000  10.0   0  0.00   0  3.70  0.00  0.00   0  0.00  3.70   4   NULL
 LEE   1870  4 11  1  0  0.00  30.0000  99.0000   0.0   0  0.00   0  0.00  0.00  0.00   0  6.75  6.75   9   NULL
 RAO   1870 10 19  0  0  0.00  17.7000  83.4000  10.0   0  0.00   0  3.70  0.00  0.00   0  0.00  3.70   4   NULL
 SSR   1896  3  4  5  5  0.00  37.0000  76.0000  40.0   0  0.00   0  7.10  0.00  0.00   0  0.00  7.10   8   NULL
 SSR   1896  6 17 12  0  0.00  37.0000  68.0000  15.0   0  0.00   0  5.20  0.00  0.00   0  0.00  5.20   7   NULL
 SIG   1899  9 23 23 24  0.00  37.0000  71.0000 160.0   0  0.00   0  0.00  0.00  0.00   0  7.50  7.50   6   NULL
 SSR   1902  8 30 21 50  0.00  37.0000  71.0000 200.0   0  0.00   0  7.70  0.00  6.90   0  0.00  7.70   7   NULL
 BDA   1905  4  4  2 50  0.00  33.0000  76.0000  60.0   0  0.00   0  5.00  8.00  0.00   0  8.60  8.60   0   NULL
 BDA   1950  8 15 14  9 30.00  28.5000  96.5000  60.0   0  0.00   0  0.00  0.00  0.00   0  8.70  8.70   0   NULL
 G-R   1913  3  6  2  9  0.00  30.0000  83.0000   0.0   0  0.00   0  0.00  0.00  0.00   0  6.20  6.20   0   NULL

Now in this file i have to remove duplicates lines based these conditions.
We check for column 19
1)if its value is between 0 to 7.00.Then
we check for 2,3,4,5,6 columns if they are same if so then remove one of the duplicate rows & retain the row which has its largest value in column 19 & Which has large set of columns with values in that row. and
2)If its value is between 7.00 to 8.00. Then
we check for 2,3,4,5 columns if they are same if so then remove one of the duplicate rows & retain the row which has its largest value in column 19 & Which has large set of columns with values in that row. and
3)if its value is between 8.00 to 9.00. Then
we check for 2,3,4 columns if they are same if so then remove one of the duplicate rows & retain the row which has its largest value in column 19 & Which has large set of columns with values in that row.

Help me out

reva

View Public Profile for reva

Find all posts by reva

02-04-2010

Registered User

44, 2

Join Date: Jan 2010

Last Activity: 28 December 2011, 5:35 PM EST

Posts: 44

Thanks Given: 0

Thanked 2 Times in 2 Posts

write a perl script like

Code:

#!/usr/bin/perl -w

use strict;

open (IN, "<data.in") || die "Cannot open data.in: $!\n";
my @lines = <IN>;
close (IN);

my @old; # original data items
my @new; # filtered data items

OUTER_LOOP:
foreach my ( $line )
{
  my @item = split (/\s+/, $line);

  # apply your conditions
  if ( item[18] > 0.00 && 
       item[18] < 7.00 )
  {
     # check if we have that item already
     # note that $have is an array reference
     INNER_LOOP: 
     foreach my $have ( @new )
     {
        if ( $item[1] == $have->[1] && 
             $item[2] == $have->[2] && 
              $item[3] == $have->[3] && 
              $item[4] == $have->[4] && 
              $item[5] == $have->[5] )
        {
           # found it.  So, replace new array entry 
           # with the one with larger #19
           if ( $item[18] > $have->[18] )
           {
               # foreach passes array values by reference, so we can 
               # simply swap the item
               $have = \@item;
            } 
          }
       
         # we should not find that item again, right?  So, 
         # finish that inner foreach here
         last INNER_LOOP; 
      }
   }
   # add similar tests for the other conditions below
   else if ( ... )
   {
   }  
} # all lines

open (OUT, ">data.out") || die "Cannot open data.out: $!\n";
foreach my $out ( @new )
{
  # print data tab separated
  print OUT join ("\t", @{$out});
  print OUT "\n";
}

The above is not tested and likely has typos, and intendation is hard to do in a form (sorry) - but I hope you get the idea. If you intent to go that route, let me know if you have trouble with the code etc.

Best, Andre

Andre_Merzky

View Public Profile for Andre_Merzky

Find all posts by Andre_Merzky

02-04-2010

Registered User

41, 0

Join Date: Jul 2009

Last Activity: 2 June 2010, 7:15 AM EDT

Posts: 41

Thanks Given: 0

Thanked 0 Times in 0 Posts

No i am not getting exact output..I dont know much perl..Can you tell me the same using awk .

reva

View Public Profile for reva

Find all posts by reva

02-04-2010

Registered User

44, 2

Join Date: Jan 2010

Last Activity: 28 December 2011, 5:35 PM EST

Posts: 44

Thanks Given: 0

Thanked 2 Times in 2 Posts

After completing the script, I still can't reproduce your output: those seems wrong. Fir example, I can't find the line ending not in NULL, but in 2.23e+23 -- that value is simply gone, but according to your filter algrithm, that line should remain I think.

Otherwise the attached script should now be able to reproduce the data you show. It reads a data.in file, and writes a data.out.

data_filter.pl (4.4 KB)

Andre_Merzky

View Public Profile for Andre_Merzky

Find all posts by Andre_Merzky

UNIX for Dummies Questions & Answers

How to get remove duplicate of a file based on many conditions

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Remove sections based on duplicate first line

Discussion started by: ahmedwaseem2000

2. Shell Programming and Scripting

Remove duplicate lines from file based on fields

Discussion started by: Lord Spectre

3. Shell Programming and Scripting

Remove duplicate rows based on one column

Discussion started by: clarissab

4. Shell Programming and Scripting

Remove duplicate entries based on the range

Discussion started by: raj_k

5. Shell Programming and Scripting

How To Remove Duplicate Based on the Value?

Discussion started by: OTNA

6. Shell Programming and Scripting

Remove duplicate value based on two field $4 and $5

Discussion started by: mohan sharma

7. Shell Programming and Scripting

Remove duplicate based on Group

Discussion started by: yale_work

8. UNIX for Dummies Questions & Answers

remove duplicate lines based on two columns and judging from a third one

Discussion started by: TheTransporter

9. Shell Programming and Scripting

Remove duplicate files based on text string?

Discussion started by: spangberg

10. UNIX for Dummies Questions & Answers

Remove duplicate rows of a file based on a value of a column

Discussion started by: risk_sly