How to remove duplicated based on longest row & largest value in a column

09-12-2009

Registered User

41, 0

Join Date: Jul 2009

Last Activity: 2 June 2010, 7:15 AM EDT

Posts: 41

Thanks Given: 0

Thanked 0 Times in 0 Posts

Thanks a lot..The code is working now.........

reva

View Public Profile for reva

Find all posts by reva

12-14-2009

Registered User

41, 0

Join Date: Jul 2009

Last Activity: 2 June 2010, 7:15 AM EDT

Posts: 41

Thanks Given: 0

Thanked 0 Times in 0 Posts

hiiiiii Friends..Thier is a error occuring..Its not giving correct output for my huge data.. Check this type of file out..
I hav a file like
a.dat:

HTML Code:

   ISC 1976  8 12 23 26 47.09  26.6967  97.0421  31.0 326  6.20  79  0.00  5.90  6.10   0  0.00  6.20   0      7.99e+2
   PDE 1976  8 12 23 26 47.09  26.6967  97.0421  31.0 326  6.40  79  0.00  5.90  6.10   0  0.00  6.40   0      7.99e+2
  HFS 1984  5  6 15 18 20.00  18.9000  99.2000   0.0   0  6.00   0  0.00  0.00  0.00   0  0.00  6.00   0   NULL
   ISC 1984  5  6 15 19 11.32  24.2152  93.5256  32.0 480  5.70  85  0.00  6.00  5.60  14  5.80  6.00   0     1.19e+25
   MOS 1984  5  6 15 19 11.32  24.2152  93.5256  32.0 480  6.20  85  0.00  6.00  5.60  14  5.60  6.20   0     1.19e+25
   NAO 1984  5  6 15 19 11.32  24.2152  93.5256  32.0 480  5.60  85  0.00  6.00  5.60  14  0.00  6.00   0     1.19e+25
   ISC 1986 11  1  5 45  4.82  27.1726  96.3983  82.0   9  4.10   2  0.00  0.00  0.00   0  0.00  4.10   0   NULL
   MOS 1986 11  1  5  2 40.27  26.8483  96.3965  11.0 335  5.60  68  0.00  5.20  5.00   5  5.00  5.60   0      6.96e+2
   NAO 1986 11  1  5  2 40.27  26.8483  96.3965  11.0 335  5.10  68  0.00  5.20  5.00   5  0.00  5.20   0     6.96e+23
   NDI 1986 11  1  5  2 40.27  26.8483  96.3965  11.0 335  5.30  68  0.00  5.20  5.00   5  0.00  5.30   0     6.96e+23
   HFS 1988  2  6 14 50 45.38  24.6677  91.5619  33.0 496  6.20  89  6.10  5.80  5.80  20  5.90  6.20   0       6.7e+2
   ISC 1988  2  6 14 50 45.38  24.6677  91.5619  33.0 496  5.80  89  0.00  5.80  5.80  20  5.80  5.80   0     6.7e+24
   MOS 1988  2  6 14 50 45.38  24.6677  91.5619  33.0 496  6.10  89  0.00  5.80  5.80  20  5.70  6.20   0     6.7e+24

THe output i must get is
b.dat:

HTML Code:

   PDE 1976  8 12 23 26 47.09  26.6967  97.0421  31.0 326  6.40  79  0.00  5.90  6.10   0  0.00  6.40   0      7.99e+2
   MOS 1984  5  6 15 19 11.32  24.2152  93.5256  32.0 480  6.20  85  0.00  6.00  5.60  14  5.60  6.20   0     1.19e+25
   MOS 1986 11  1  5  2 40.27  26.8483  96.3965  11.0 335  5.60  68  0.00  5.20  5.00   5  5.00  5.60   0      6.96e+2
   HFS 1988  2  6 14 50 45.38  24.6677  91.5619  33.0 496  6.20  89  6.10  5.80  5.80  20  5.90  6.20   0       6.7e+2

It must check for 2,3,4,5, columns to same & remain the duplicates based on the longest row with values & largest 19th column ..

reva

View Public Profile for reva

Find all posts by reva

12-14-2009

Registered User

98, 2

Join Date: Dec 2008

Last Activity: 1 June 2017, 6:59 AM EDT

Location: India,Bangalore

Posts: 98

Thanks Given: 0

Thanked 2 Times in 2 Posts

Use this code :

Code:

sort -k 2,5 -k 19r file_name|awk 'a!=$2$3$4 {a=$2$3$4;print $0}'

Out put:

Code:

PDE 1976  8 12 23 26 47.09  26.6967  97.0421  31.0 326  6.40  79  0.00  5.90  6.10   0  0.00  6.40   0      7.99e+2
MOS 1984  5  6 15 19 11.32  24.2152  93.5256  32.0 480  6.20  85  0.00  6.00  5.60  14  5.60  6.20   0     1.19e+25
MOS 1986 11  1  5  2 40.27  26.8483  96.3965  11.0 335  5.60  68  0.00  5.20  5.00   5  5.00  5.60   0      6.96e+2
MOS 1988  2  6 14 50 45.38  24.6677  91.5619  33.0 496  6.10  89  0.00  5.80  5.80  20  5.70  6.20   0     6.7e+24

regards,
Sanjay

sanjay.login

View Public Profile for sanjay.login

Find all posts by sanjay.login

06-02-2010

Registered User

41, 0

Join Date: Jul 2009

Last Activity: 2 June 2010, 7:15 AM EDT

Posts: 41

Thanks Given: 0

Thanked 0 Times in 0 Posts

Its not working , If the file contains data lik this
a.dat:

HTML Code:

 BDA 1908 10 23 20 14  6.00  36.5000  70.5000 220.0   0  0.00   0  0.00  0.00  0.00   0  7.00  7.00   0 NULL
   G-R 1908 10 23 20 14  6.00  36.5000  70.5000 220.0   0  0.00   0  0.00  0.00  0.00   0  7.00  7.00   0 NULL
   SIG 1908 10 23 20 14  0.00  36.5000  70.5000 220.0   0  0.00   0  0.00  0.00  0.00   0  7.60  7.60   0 NULL
   SSR 1908 10 23 20 14  6.00  36.5000  70.5000 220.0   0  7.60   0  6.90  0.00  6.10   0  6.80  7.60   0 NULL
   BDA 1908 10 24 21 16 36.00  36.5000  70.5000 220.0   0  0.00   0  0.00  0.00  0.00   0  7.00  7.00   0 NULL
   G-R 1908 10 24 21 16 36.00  36.5000  70.5000 220.0   0  0.00   0  0.00  0.00  0.00   0  7.00  7.00   0 NULL
   SIG 1908 12 12  0  0  0.00  26.5000  97.0000   0.0   0  0.00   0  0.00  0.00  0.00   0  7.50  7.50   0 NULL
   G-R 1908 12 12 12 54 54.00  26.5000  97.0000 100.0   0  0.00   0  0.00  0.00  0.00   0  7.50  7.50   0 NULL
   SIG 1909  7  7  0  0  0.00  36.5000  70.5000 230.0   0  0.00   0  0.00  0.00  0.00   0  7.80  7.80   0 NULL
   SIG 1909  7  7 21 39  0.00  36.5000  70.5000  60.0   0  0.00   0  0.00  0.00  0.00   0  7.60  7.60   0 NULL

The output should be

HTML Code:

   SSR 1908 10 23 20 14  6.00  36.5000  70.5000 220.0   0  7.60   0  6.90  0.00  6.10   0  6.80  7.60   0 NULL
  BDA 1908 10 24 21 16 36.00  36.5000  70.5000 220.0   0  0.00   0  0.00  0.00  0.00   0  7.00  7.00   0 NULL
   SIG 1908 12 12  0  0  0.00  26.5000  97.0000   0.0   0  0.00   0  0.00  0.00  0.00   0  7.50  7.50   0 NULL
   SIG 1909  7  7  0  0  0.00  36.5000  70.5000 230.0   0  0.00   0  0.00  0.00  0.00   0  7.80  7.80   0 NULL

If i am using
sort -k 2,5 -k 19r file_name|awk 'a!=$2$3$4 {a=$2$3$4;print $0}'But i am getting the output as

HTML Code:

   SIG 1908 10 23 20 14  0.00  36.5000  70.5000 220.0   0  0.00   0  0.00  0.00  0.00   0  7.60  7.60   0 NULL
   BDA 1908 10 24 21 16 36.00  36.5000  70.5000 220.0   0  0.00   0  0.00  0.00  0.00   0  7.00  7.00   0 NULL
   SIG 1908 12 12  0  0  0.00  26.5000  97.0000   0.0   0  0.00   0  0.00  0.00  0.00   0  7.50  7.50   0 NULL
   SIG 1909  7  7  0  0  0.00  36.5000  70.5000 230.0   0  0.00   0  0.00  0.00  0.00   0  7.80  7.80   0 NULL

Help me out

reva

View Public Profile for reva

Find all posts by reva

10-05-2010

Registered User

11, 0

Join Date: Sep 2010

Last Activity: 7 March 2013, 10:15 AM EST

Location: Chennai, TamilNadu, India

Posts: 11

Thanks Given: 4

Thanked 0 Times in 0 Posts

I have this sample text file..

E643E32D00AB58B49926B3C9628793E5,907 ,9999,5/1/2004,867 ,12/31/2006,ACT,1,0,1
CA589E9EC9CDBABA560EE6BF77AA4DBE,907 ,8741,7/1/2006,867 ,7/31/2007,ACT,1,0,1
5DBD6FF7877F5F38C62658DA5E460E64,907 ,5141,10/1/2003,867 ,9/30/2008,ACT,1,0,1
DB392456D01E0BDEE374C7BD62C9301F,907 ,4213,7/1/2009,867 ,12/31/9999,ACT,1,0,1
E1D08EF15E28E729D354B2484DDF5DFB,907 ,1014,6/15/2010,809 ,6/15/2010,DEL,500001,0,500001
86487F19E6275AFAC66279077B94FDE3,907 ,1542,6/1/2009,867 ,12/31/9999,ACT,1,0,1
E45B7371EEC0D1AB00E1750B5BC661F7,907 ,5211,1/1/2004,867 ,12/31/2006,ACT,1,0,1
FCBAFE572C5E4BA29B3F8030BD480A94,907 ,6531,1/1/2003,867 ,12/31/2005,ACT,1,0,1
2345AD5D2BFB29C821C1BC3DE8B746A7,907 ,2711,1/1/2004,827 ,1/31/2305,ACT,1,0,1
2345AD5D2BFB29C821C1BC3DE8B746A7,907 ,2711,1/1/2004,867 ,1/31/2005,ACT,1,0,1
F30641D0918E6BD2BA0B13903B3EA012,907 ,1541,5/1/2007,867 ,8/31/2007,ACT,1,0,1
F30641D0918E6BD2BA0B13903B3EA012,907 ,1541,5/1/2007,867 ,8/31/2007,ACT,1,0,1

The last two lines are exact duplicates and the penultimate two lines are duplicates only for my keys which are columns 1 and 2.

when i tried the code provided above modifying it like this

sort -k1,2 f1.txt |sort -mu -k1,2

It just removes the line corresponding to this key F30641D0918E6BD2BA0B13903B3EA012,907

but the lines corresponding to the key 2345AD5D2BFB29C821C1BC3DE8B746A7,907 are not removed.

I do not want to use awk, since i will not be able to reuse it..

The keys might not be fixed.. I will be passing it as a variable..

My reusable code might look like
pk=1,2
sort -k1$pk f1.txt|sort -mu -k$pk

Please help..

Last edited by gpsridhar; 10-05-2010 at 12:25 PM.. Reason: Additional information provided

gpsridhar

View Public Profile for gpsridhar

Find all posts by gpsridhar

UNIX for Dummies Questions & Answers

How to remove duplicated based on longest row & largest value in a column

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to remove duplicated column in a text file?

Discussion started by: huiyee1

2. Shell Programming and Scripting

Trying to remove duplicates based on field and row

Discussion started by: newbie2010

3. Shell Programming and Scripting

Find smallest & largest in every column

Discussion started by: attila

4. Shell Programming and Scripting

Remove duplicates within row and separate column

Discussion started by: manigrover

5. Shell Programming and Scripting

Deleting a row based on fetched value of column

Discussion started by: swasid

6. Shell Programming and Scripting

Sort a the file & refine data column & row format

Discussion started by: ckaramsetty

7. Shell Programming and Scripting

need to remove duplicates based on key in first column and pattern in last column

Discussion started by: script_op2a

8. Shell Programming and Scripting

duplicate row based on single column

Discussion started by: mitr

9. Shell Programming and Scripting

How to print column based on row number

Discussion started by: Surabhi_so_mh

10. Shell Programming and Scripting

ITERATION: remove row based on string value

Discussion started by: asanjuan