How to remove duplicated based on longest row & largest value in a column


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers How to remove duplicated based on longest row & largest value in a column
# 8  
Old 09-12-2009
Thanks a lot..The code is working now.........
# 9  
Old 12-14-2009
Question

hiiiiii Friends..Thier is a error occuring..Its not giving correct output for my huge data.. Check this type of file out..
I hav a file like
a.dat:
HTML Code:
   ISC 1976  8 12 23 26 47.09  26.6967  97.0421  31.0 326  6.20  79  0.00  5.90  6.10   0  0.00  6.20   0      7.99e+2
   PDE 1976  8 12 23 26 47.09  26.6967  97.0421  31.0 326  6.40  79  0.00  5.90  6.10   0  0.00  6.40   0      7.99e+2
  HFS 1984  5  6 15 18 20.00  18.9000  99.2000   0.0   0  6.00   0  0.00  0.00  0.00   0  0.00  6.00   0   NULL
   ISC 1984  5  6 15 19 11.32  24.2152  93.5256  32.0 480  5.70  85  0.00  6.00  5.60  14  5.80  6.00   0     1.19e+25
   MOS 1984  5  6 15 19 11.32  24.2152  93.5256  32.0 480  6.20  85  0.00  6.00  5.60  14  5.60  6.20   0     1.19e+25
   NAO 1984  5  6 15 19 11.32  24.2152  93.5256  32.0 480  5.60  85  0.00  6.00  5.60  14  0.00  6.00   0     1.19e+25
   ISC 1986 11  1  5 45  4.82  27.1726  96.3983  82.0   9  4.10   2  0.00  0.00  0.00   0  0.00  4.10   0   NULL
   MOS 1986 11  1  5  2 40.27  26.8483  96.3965  11.0 335  5.60  68  0.00  5.20  5.00   5  5.00  5.60   0      6.96e+2
   NAO 1986 11  1  5  2 40.27  26.8483  96.3965  11.0 335  5.10  68  0.00  5.20  5.00   5  0.00  5.20   0     6.96e+23
   NDI 1986 11  1  5  2 40.27  26.8483  96.3965  11.0 335  5.30  68  0.00  5.20  5.00   5  0.00  5.30   0     6.96e+23
   HFS 1988  2  6 14 50 45.38  24.6677  91.5619  33.0 496  6.20  89  6.10  5.80  5.80  20  5.90  6.20   0       6.7e+2
   ISC 1988  2  6 14 50 45.38  24.6677  91.5619  33.0 496  5.80  89  0.00  5.80  5.80  20  5.80  5.80   0     6.7e+24
   MOS 1988  2  6 14 50 45.38  24.6677  91.5619  33.0 496  6.10  89  0.00  5.80  5.80  20  5.70  6.20   0     6.7e+24
THe output i must get is
b.dat:
HTML Code:
   PDE 1976  8 12 23 26 47.09  26.6967  97.0421  31.0 326  6.40  79  0.00  5.90  6.10   0  0.00  6.40   0      7.99e+2
   MOS 1984  5  6 15 19 11.32  24.2152  93.5256  32.0 480  6.20  85  0.00  6.00  5.60  14  5.60  6.20   0     1.19e+25
   MOS 1986 11  1  5  2 40.27  26.8483  96.3965  11.0 335  5.60  68  0.00  5.20  5.00   5  5.00  5.60   0      6.96e+2
   HFS 1988  2  6 14 50 45.38  24.6677  91.5619  33.0 496  6.20  89  6.10  5.80  5.80  20  5.90  6.20   0       6.7e+2
It must check for 2,3,4,5, columns to same & remain the duplicates based on the longest row with values & largest 19th column ..SmilieSmilie
# 10  
Old 12-14-2009
Use this code :
Code:
sort -k 2,5 -k 19r file_name|awk 'a!=$2$3$4 {a=$2$3$4;print $0}'

Out put:
Code:
PDE 1976  8 12 23 26 47.09  26.6967  97.0421  31.0 326  6.40  79  0.00  5.90  6.10   0  0.00  6.40   0      7.99e+2
MOS 1984  5  6 15 19 11.32  24.2152  93.5256  32.0 480  6.20  85  0.00  6.00  5.60  14  5.60  6.20   0     1.19e+25
MOS 1986 11  1  5  2 40.27  26.8483  96.3965  11.0 335  5.60  68  0.00  5.20  5.00   5  5.00  5.60   0      6.96e+2
MOS 1988  2  6 14 50 45.38  24.6677  91.5619  33.0 496  6.10  89  0.00  5.80  5.80  20  5.70  6.20   0     6.7e+24

regards,
Sanjay
# 11  
Old 06-02-2010
Question

Its not working , If the file contains data lik this
a.dat:
HTML Code:
 BDA 1908 10 23 20 14  6.00  36.5000  70.5000 220.0   0  0.00   0  0.00  0.00  0.00   0  7.00  7.00   0 NULL
   G-R 1908 10 23 20 14  6.00  36.5000  70.5000 220.0   0  0.00   0  0.00  0.00  0.00   0  7.00  7.00   0 NULL
   SIG 1908 10 23 20 14  0.00  36.5000  70.5000 220.0   0  0.00   0  0.00  0.00  0.00   0  7.60  7.60   0 NULL
   SSR 1908 10 23 20 14  6.00  36.5000  70.5000 220.0   0  7.60   0  6.90  0.00  6.10   0  6.80  7.60   0 NULL
   BDA 1908 10 24 21 16 36.00  36.5000  70.5000 220.0   0  0.00   0  0.00  0.00  0.00   0  7.00  7.00   0 NULL
   G-R 1908 10 24 21 16 36.00  36.5000  70.5000 220.0   0  0.00   0  0.00  0.00  0.00   0  7.00  7.00   0 NULL
   SIG 1908 12 12  0  0  0.00  26.5000  97.0000   0.0   0  0.00   0  0.00  0.00  0.00   0  7.50  7.50   0 NULL
   G-R 1908 12 12 12 54 54.00  26.5000  97.0000 100.0   0  0.00   0  0.00  0.00  0.00   0  7.50  7.50   0 NULL
   SIG 1909  7  7  0  0  0.00  36.5000  70.5000 230.0   0  0.00   0  0.00  0.00  0.00   0  7.80  7.80   0 NULL
   SIG 1909  7  7 21 39  0.00  36.5000  70.5000  60.0   0  0.00   0  0.00  0.00  0.00   0  7.60  7.60   0 NULL
The output should be
HTML Code:
   SSR 1908 10 23 20 14  6.00  36.5000  70.5000 220.0   0  7.60   0  6.90  0.00  6.10   0  6.80  7.60   0 NULL
  BDA 1908 10 24 21 16 36.00  36.5000  70.5000 220.0   0  0.00   0  0.00  0.00  0.00   0  7.00  7.00   0 NULL
   SIG 1908 12 12  0  0  0.00  26.5000  97.0000   0.0   0  0.00   0  0.00  0.00  0.00   0  7.50  7.50   0 NULL
   SIG 1909  7  7  0  0  0.00  36.5000  70.5000 230.0   0  0.00   0  0.00  0.00  0.00   0  7.80  7.80   0 NULL
If i am using
sort -k 2,5 -k 19r file_name|awk 'a!=$2$3$4 {a=$2$3$4;print $0}'But i am getting the output as
HTML Code:
   SIG 1908 10 23 20 14  0.00  36.5000  70.5000 220.0   0  0.00   0  0.00  0.00  0.00   0  7.60  7.60   0 NULL
   BDA 1908 10 24 21 16 36.00  36.5000  70.5000 220.0   0  0.00   0  0.00  0.00  0.00   0  7.00  7.00   0 NULL
   SIG 1908 12 12  0  0  0.00  26.5000  97.0000   0.0   0  0.00   0  0.00  0.00  0.00   0  7.50  7.50   0 NULL
   SIG 1909  7  7  0  0  0.00  36.5000  70.5000 230.0   0  0.00   0  0.00  0.00  0.00   0  7.80  7.80   0 NULL
Help me out
# 12  
Old 10-05-2010
I have this sample text file..

E643E32D00AB58B49926B3C9628793E5,907 ,9999,5/1/2004,867 ,12/31/2006,ACT,1,0,1
CA589E9EC9CDBABA560EE6BF77AA4DBE,907 ,8741,7/1/2006,867 ,7/31/2007,ACT,1,0,1
5DBD6FF7877F5F38C62658DA5E460E64,907 ,5141,10/1/2003,867 ,9/30/2008,ACT,1,0,1
DB392456D01E0BDEE374C7BD62C9301F,907 ,4213,7/1/2009,867 ,12/31/9999,ACT,1,0,1
E1D08EF15E28E729D354B2484DDF5DFB,907 ,1014,6/15/2010,809 ,6/15/2010,DEL,500001,0,500001
86487F19E6275AFAC66279077B94FDE3,907 ,1542,6/1/2009,867 ,12/31/9999,ACT,1,0,1
E45B7371EEC0D1AB00E1750B5BC661F7,907 ,5211,1/1/2004,867 ,12/31/2006,ACT,1,0,1
FCBAFE572C5E4BA29B3F8030BD480A94,907 ,6531,1/1/2003,867 ,12/31/2005,ACT,1,0,1
2345AD5D2BFB29C821C1BC3DE8B746A7,907 ,2711,1/1/2004,827 ,1/31/2305,ACT,1,0,1
2345AD5D2BFB29C821C1BC3DE8B746A7,907 ,2711,1/1/2004,867 ,1/31/2005,ACT,1,0,1
F30641D0918E6BD2BA0B13903B3EA012,907 ,1541,5/1/2007,867 ,8/31/2007,ACT,1,0,1
F30641D0918E6BD2BA0B13903B3EA012,907 ,1541,5/1/2007,867 ,8/31/2007,ACT,1,0,1


The last two lines are exact duplicates and the penultimate two lines are duplicates only for my keys which are columns 1 and 2.

when i tried the code provided above modifying it like this

sort -k1,2 f1.txt |sort -mu -k1,2

It just removes the line corresponding to this key F30641D0918E6BD2BA0B13903B3EA012,907

but the lines corresponding to the key 2345AD5D2BFB29C821C1BC3DE8B746A7,907 are not removed.

I do not want to use awk, since i will not be able to reuse it..

The keys might not be fixed.. I will be passing it as a variable..

My reusable code might look like
pk=1,2
sort -k1$pk f1.txt|sort -mu -k$pk

Please help..

Last edited by gpsridhar; 10-05-2010 at 12:25 PM.. Reason: Additional information provided
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to remove duplicated column in a text file?

Dear all, How can I remove duplicated column in a text file? Input: LG10_PM_map_19_LEnd 1000560 G AA AA AA AA AA GG LG10_PM_map_19_LEnd 1005621 G GG GG GG AA AA GG LG10_PM_map_19_LEnd 1011214 A AA AA AA AA GG GG LG10_PM_map_19_LEnd 1011673 T TT TT TT TT CC CC... (1 Reply)
Discussion started by: huiyee1
1 Replies

2. Shell Programming and Scripting

Trying to remove duplicates based on field and row

I am trying to see if I can use awk to remove duplicates from a file. This is the file: -==> Listvol <== deleting /vol/eng_rmd_0941 deleting /vol/eng_rmd_0943 deleting /vol/eng_rmd_0943 deleting /vol/eng_rmd_1006 deleting /vol/eng_rmd_1012 rearrange /vol/eng_rmd_0943 ... (6 Replies)
Discussion started by: newbie2010
6 Replies

3. Shell Programming and Scripting

Find smallest & largest in every column

Dear All, I have input like this, J_15TEST_ASH05_33A22.13885.txt: $$ 1 MAKE SP1501 1 1 4 6101 7392 2 2442 2685 18 3201 4008 20 120 4158 J_15TEST_ASH05_33A22.13885.txt: $$ 1 MAKE SP1502 1 1 4 5125 6416 2 ... (4 Replies)
Discussion started by: attila
4 Replies

4. Shell Programming and Scripting

Remove duplicates within row and separate column

Hi all I have following kind of input file ESR1 PA156 leflunomide PA450192 leflunomide CHST3 PA26503 docetaxel Pa4586; thalidomide Pa34958; decetaxel docetaxel docetaxel I want to remove duplicates and I want to separate anything before and after PAxxxx entry into columns or... (1 Reply)
Discussion started by: manigrover
1 Replies

5. Shell Programming and Scripting

Deleting a row based on fetched value of column

Hi, I have a file which consists of two columns but the first one can be varying in length like 123456789 0abcd 123456789 0abcd 4015 0 0abcd 5000 0abcd I want to go through the file reading each line, count the number of characters in the first column and delete... (2 Replies)
Discussion started by: swasid
2 Replies

6. Shell Programming and Scripting

Sort a the file & refine data column & row format

cat file1.txt field1 "user1": field2:"data-cde" field3:"data-pqr" field4:"data-mno" field1 "user1": field2:"data-dcb" field3:"data-mxz" field4:"data-zul" field1 "user2": field2:"data-cqz" field3:"data-xoq" field4:"data-pos" Now i need to have the date like below. i have just... (7 Replies)
Discussion started by: ckaramsetty
7 Replies

7. Shell Programming and Scripting

need to remove duplicates based on key in first column and pattern in last column

Given a file such as this I need to remove the duplicates. 00060011 PAUL BOWSTEIN ad_waq3_921_20100826_010517.txt 00060011 PAUL BOWSTEIN ad_waq3_921_20100827_010528.txt 0624-01 RUT CORPORATION ad_sade3_10_20100827_010528.txt 0624-01 RUT CORPORATION ... (13 Replies)
Discussion started by: script_op2a
13 Replies

8. Shell Programming and Scripting

duplicate row based on single column

I am a newbie to shell scripting .. I have a .csv file. It has 1000 some rows and about 7 columns... but before I insert this data to a table I have to parse it and clean it ..basing on the value of the first column..which a string of phone number type... example below.. column 1 ... (2 Replies)
Discussion started by: mitr
2 Replies

9. Shell Programming and Scripting

How to print column based on row number

Hi, I want to print column value based on row number say multiple of 8. Input file: line 1 67 34 line 2 45 57 . . . . . . line 8 12 46 . . . . . . line 16 24 90 . . . . . . line 24 49 67 Output 46 90 67 (2 Replies)
Discussion started by: Surabhi_so_mh
2 Replies

10. Shell Programming and Scripting

ITERATION: remove row based on string value

It is my first post, hoping to get help from the forum. In a directory, I have 5000 multiple files that contains around 4000 rows with 10 columns in each file containing a unique string 'AT' located at 4th column. OM 3328 O BT 268 5.800 7.500 4.700 0.000 ... (9 Replies)
Discussion started by: asanjuan
9 Replies
Login or Register to Ask a Question