How to remove duplicated based on longest row & largest value in a column

08-27-2009

Registered User

41, 0

Join Date: Jul 2009

Last Activity: 2 June 2010, 7:15 AM EDT

Posts: 41

Thanks Given: 0

Thanked 0 Times in 0 Posts

How to remove duplicated based on longest row & largest value in a column

Hii i have a file with data as shown below. Here i need to remove duplicates of the rows in such a way that
it just checks for 2,3,4,5 column for duplicates.When deleting duplicates,retain largest row i.e with many columns with values should be selected.Then it must remove duplicates such that by checking for the largest value in a specific column say 19 from the given below data.

HTML Code:

SSR 1901  12 1  0  0  0.00  40.0000  71.2000  14   12 3.00   0 4.60 4.00 0.00   0 0.00 8.60  0   
SSR 1901  12 1  0 10  3.00  40.0000  71.0000  30   0 0.00   0 5.80 0.00  5.90   0 5.70 5.90  0   
SSR 1902  8 22  3  7  4.40  40.0000  68.5000  35   0 0.00   0 6.00 0.00 6.20   0 5.90 6.20  0   aaaa 
BDA 1902  8 22  3  0  0.00  40.0000  77.0000  60   0 8.70   0 0.00 0.00 8.00   0 8.60 8.60  0   cccc
CFR 1903  8 22  3  0  0.00  40.0000  77.0000  25   0 0.00   0 0.00 0.00 0.00   0 8.60 8.60  0 bbbb  
RAO 1906  8 16  17  0  0.00  24.4000  72.7000  10   0 0.00   0 4.30 0.00 0.00   0 0.00 4.30  0 
RAO 1906  8 16  17  6  0.00  24.4000  72.7000  10   0 0.00   0 4.30 6.00 0.00   0 0.00 4.30  0 
LEE 1912  8 22  3  0  0.00  40.0000  76.5000   0   0 0.00   0 0.00 0.00 0.00   0 8.20 8.20  0 ffff 
LEE 1912  8 22  3  0  0.00  40.0000  76.5000   0   0 0.00   0 0.00 0.00 0.00   0 8.20 8.20  0 ffff

The output should be like

HTML Code:

SSR 1901  12 1  0  0  0.00  40.0000  71.2000  14   12 3.00   0 4.60 4.00 0.00   0 0.00 8.60  0   
BDA 1902  8 22  3  0  0.00  40.0000  77.0000  60   0 8.70   0 0.00 0.00 8.00   0 8.60 8.60  0 cccc 
CFR 1903  8 22  3  0  0.00  40.0000  77.0000  25   0 0.00   0 0.00 0.00 0.00   0 8.60 8.60  0 bbbb 
RAO 1906  8 16  17  6  0.00  24.4000  72.7000  10   0 0.00   0 4.30 6.00 0.00   0 0.00 4.30  0 
LEE 1912  8 22  3  0  0.00  40.0000  76.5000   0   0 0.00   0 0.00 0.00 0.00   0 8.20 8.20  0 ffff

Here we are removing duplicates rows based on 2 criteria i.e
1)we check for 2,3,4,5 columns if they are same if so then remove one of the duplicate row.
2)Retain the row which has its largest value in column 19 & which has large set of columns with values in that row.
Help me out if any one has an idea also..i am trying this out from past one week...

Thanks in advance..

reva

View Public Profile for reva

Find all posts by reva

08-30-2009

Registered User

380, 91

Join Date: Aug 2009

Last Activity: 15 March 2013, 10:40 AM EDT

Location: New Jersey

Posts: 380

Thanks Given: 7

Thanked 91 Times in 75 Posts

Code:

sort -k2,5n -k19,19rn -k1r filename |sort -mu -k2,5n

binlib

View Public Profile for binlib

Find all posts by binlib

08-31-2009

Registered User

41, 0

Join Date: Jul 2009

Last Activity: 2 June 2010, 7:15 AM EDT

Posts: 41

Thanks Given: 0

Thanked 0 Times in 0 Posts

Thanks a lot lot its working some what correctly but deletes most of my data.For above data its working.
but if i hav data like given below included with the data as above the output i want is shown.
The data included with above data is

Code:

  SIG   2007  3 24  4 35 45.80   5.2600  94.3100  58   0 5.20   0 0.00 5.00 0.00   0 0.00 5.20   0
  SSS  2007  3 24  9  3 37.40  36.5600  71.4800 152   0 4.70   0 0.00 0.00 0.00   0 0.00 4.70   0
  SIG   2008  3 25 18 29 33.15   1.7700  99.3400 163   0 4.60   0 0.00 0.00 0.00   0 0.00 4.60   0
  SEG  2008  3 25 18 27 35.06   1.7700  99.3400  89   0 5.00   0 0.00 0.00 0.00   0 0.00 5.00   0
PDE-Q 2009  7  2 22 36 45.17  37.4800  71.7400  20   0 4.60   0 0.00 0.00 0.00   0 0.00 4.60   0 rrr
PDE-Q 2009  7  2 23 50 49.20  37.4800  71.7400 108   0 4.70   0 0.00 0.00 0.00   0 0.00 4.70   0 tttt
PDE-Q 2009  7  3  4 42 32.83  34.4600  24.1200  41   0 4.50   0 0.00 0.00 0.00   0 0.00 4.50   0 yyy
PDE-Q 2009  7  5  9 45 48.77  36.4600  71.0700 248   0 4.90   0 0.00 0.00 0.00   0 0.00 4.90   0
PDE-Q 2009  7  5 12 25 37.44   1.3300  99.7800 185   0 4.50   0 0.00 0.00 0.00   0 0.00 4.60   0
PDE-Q 2009  7  5 12 25 37.44   1.3300  99.7800 185   0 4.50   0 0.00 0.00 0.00   0 0.00 4.50   0
PDE-Q 2009  7  6 16  0 38.96   3.0400  93.3500  34   0 4.90   0 0.00 0.00 0.00   0 0.00 4.90   0
PDE-Q 2009  7  7  0 32 47.11  34.1600  25.5100  13   0 0.00   0 0.00 0.00 0.00   0 0.00 0.00   0
PDE-Q 2009  7  7  1  2  0.48  34.1600  25.5100  25   0 4.80   0 0.00 0.00 0.00   0 3.00 4.80   0

THe output should be like

Code:

   SIG  2007  3 24  4 35 45.80   5.2600  94.3100  58   0 5.20   0 0.00 5.00 0.00   0 0.00 5.20   0
  SSS  2007  3 24  9  3 37.40  36.5600  71.4800 152   0 4.70   0 0.00 0.00 0.00   0 0.00 4.70   0
  SEG  2008  3 25 18 27 35.06   1.7700  99.3400  89   0 5.00   0 0.00 0.00 0.00   0 0.00 5.00   0
PDE-Q 2009  7  2 23 50 49.20  37.4800  71.7400 108   0 4.70   0 0.00 0.00 0.00   0 0.00 4.70   0 tttt
PDE-Q 2009  7  3  4 42 32.83  34.4600  24.1200  41   0 4.50   0 0.00 0.00 0.00   0 0.00 4.50   0 yyy
PDE-Q 2009  7  5  9 45 48.77  36.4600  71.0700 248   0 4.90   0 0.00 0.00 0.00   0 0.00 4.90   0
PDE-Q 2009  7  5 12 25 37.44   1.3300  99.7800 185   0 4.50   0 0.00 0.00 0.00   0 0.00 4.60   0
PDE-Q 2009  7  6 16  0 38.96   3.0400  93.3500  34   0 4.90   0 0.00 0.00 0.00   0 0.00 4.90   0
PDE-Q 2009  7  7  1  2  0.48  34.1600  25.5100  25   0 4.80   0 0.00 0.00 0.00   0 3.00 4.80   0

Help me out if anyone knows...
Its like its checking for duplicates in columns 2,3,4 together & largest value in column 19 & longest row i mean with more number of values in that row.

But as per your sort command it deletes all the rows of 2009 or any other number in column 2 & just retains one row of column2. I want to check duplicates for 2,3,4 column together not seperately just column2.

Last edited by reva; 08-31-2009 at 04:32 AM..

reva

View Public Profile for reva

Find all posts by reva

09-11-2009

Registered User

380, 91

Join Date: Aug 2009

Last Activity: 15 March 2013, 10:40 AM EDT

Location: New Jersey

Posts: 380

Thanks Given: 7

Thanked 91 Times in 75 Posts

I mistakenly thought -k2,5n was equivalent to -k2,2n -k3,3n -k4,4n -k5,n. Please replace the two instances with the later. By the way your desired result doesn't match the description of the second scenario.

binlib

View Public Profile for binlib

Find all posts by binlib

09-11-2009

Registered User

21, 0

Join Date: Dec 2007

Last Activity: 2 November 2010, 2:29 AM EDT

Location: Chennai

Posts: 21

Thanks Given: 0

Thanked 0 Times in 0 Posts

you are telling this is your out put but i am confused by seeing some records of your output

SIG 2007 3 24 4 35 45.80 5.2600 94.3100 58 0 5.20 0 0.00 5.00 0.00 0 0.00 5.20 0
SSS 2007 3 24 9 3 37.40 36.5600 71.4800 152 0 4.70 0 0.00 0.00 0.00 0 0.00 4.70 0

PDE-Q 2009 7 5 9 45 48.77 36.4600 71.0700 248 0 4.90 0 0.00 0.00 0.00 0 0.00 4.90 0
PDE-Q 2009 7 5 12 25 37.44 1.3300 99.7800 185 0 4.50 0 0.00 0.00 0.00 0 0.00 4.60 0

but in this the 2nd,3rd,4th fields are same .(please write properly your requirement,it is really confusing).

---------- Post updated at 07:57 AM ---------- Previous update was at 07:57 AM ----------

you are telling this is your out put but i am confused by seeing some records of your output

SIG 2007 3 24 4 35 45.80 5.2600 94.3100 58 0 5.20 0 0.00 5.00 0.00 0 0.00 5.20 0
SSS 2007 3 24 9 3 37.40 36.5600 71.4800 152 0 4.70 0 0.00 0.00 0.00 0 0.00 4.70 0

PDE-Q 2009 7 5 9 45 48.77 36.4600 71.0700 248 0 4.90 0 0.00 0.00 0.00 0 0.00 4.90 0
PDE-Q 2009 7 5 12 25 37.44 1.3300 99.7800 185 0 4.50 0 0.00 0.00 0.00 0 0.00 4.60 0

but in this the 2nd,3rd,4th fields are same .(please write properly your requirement,it is really confusing).

Manabhanjan

View Public Profile for Manabhanjan

Find all posts by Manabhanjan

09-12-2009

Registered User

98, 2

Join Date: Dec 2008

Last Activity: 1 June 2017, 6:59 AM EDT

Location: India,Bangalore

Posts: 98

Thanks Given: 0

Thanked 2 Times in 2 Posts

Hi Reva,
you can use the follwoing code: Hope it will work fine for you.

sort -k 2,4 -k 19r|awk 'a!=$2$3$4 {a=$2$3$4;print $0}'

I got the out put as:

SIG 2007 3 24 4 35 45.80 5.2600 94.3100 58 0 5.20 0 0.00 5.00 0.00 0 0.00 5.20 0
SEG 2008 3 25 18 27 35.06 1.7700 99.3400 89 0 5.00 0 0.00 0.00 0.00 0 0.00 5.00 0
PDE-Q 2009 7 2 23 50 49.20 37.4800 71.7400 108 0 4.70 0 0.00 0.00 0.00 0 0.00 4.70 0 tttt
PDE-Q 2009 7 3 4 42 32.83 34.4600 24.1200 41 0 4.50 0 0.00 0.00 0.00 0 0.00 4.50 0 yyy
PDE-Q 2009 7 5 9 45 48.77 36.4600 71.0700 248 0 4.90 0 0.00 0.00 0.00 0 0.00 4.90 0
PDE-Q 2009 7 6 16 0 38.96 3.0400 93.3500 34 0 4.90 0 0.00 0.00 0.00 0 0.00 4.90 0
PDE-Q 2009 7 7 1 2 0.48 34.1600 25.5100 25 0 4.80 0 0.00 0.00 0.00 0 3.00 4.80 0

sanjay.login

View Public Profile for sanjay.login

Find all posts by sanjay.login

09-12-2009

Registered User

24, 0

Join Date: Sep 2009

Last Activity: 15 December 2012, 9:17 AM EST

Location: chennai

Posts: 24

Thanks Given: 0

Thanked 0 Times in 0 Posts

Hi Revathy,
"binlib" has given an excelent code... just modifyin his code will get you your desired result:

code:
sort -k2,5 -k19,19r -k1r filename |sort -mu -k2,5

kindly try this one and check if this is your desired result

know d unknown

View Public Profile for know d unknown

Find all posts by know d unknown

UNIX for Dummies Questions & Answers

How to remove duplicated based on longest row & largest value in a column

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to remove duplicated column in a text file?

Discussion started by: huiyee1

2. Shell Programming and Scripting

Trying to remove duplicates based on field and row

Discussion started by: newbie2010

3. Shell Programming and Scripting

Find smallest & largest in every column

Discussion started by: attila

4. Shell Programming and Scripting

Remove duplicates within row and separate column

Discussion started by: manigrover

5. Shell Programming and Scripting

Deleting a row based on fetched value of column

Discussion started by: swasid

6. Shell Programming and Scripting

Sort a the file & refine data column & row format

Discussion started by: ckaramsetty

7. Shell Programming and Scripting

need to remove duplicates based on key in first column and pattern in last column

Discussion started by: script_op2a

8. Shell Programming and Scripting

duplicate row based on single column

Discussion started by: mitr

9. Shell Programming and Scripting

How to print column based on row number

Discussion started by: Surabhi_so_mh

10. Shell Programming and Scripting

ITERATION: remove row based on string value

Discussion started by: asanjuan