Remove duplicates based on the two key columns

10-21-2010

Registered User

43, 0

Join Date: Mar 2010

Last Activity: 5 November 2013, 6:24 PM EST

Posts: 43

Thanks Given: 16

Thanked 0 Times in 0 Posts

Remove duplicates based on the two key columns

Hi All,
I needs to fetch unique records based on a keycolumn(ie., first column1) and also I needs to get the records which are having max value on column2 in sorted manner... and duplicates have to store in another output file.

Input :

Input.txt
1234,0,x
1234,1,y
5678,10,z
9999,10,k
5678,9,l

Desired Output:

Duplicates.txt
1234,0,x
5678,9,l

Uniqrecords.txt
1234,1,y
5678,10,z
9999,10,k

Regards,
MuniSekhar

Thanks in Advance....

kmsekhar

View Public Profile for kmsekhar

Find all posts by kmsekhar

10-21-2010

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Hi, what have you tried so far?

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

10-21-2010

Registered User

43, 0

Join Date: Mar 2010

Last Activity: 5 November 2013, 6:24 PM EST

Posts: 43

Thanks Given: 16

Thanked 0 Times in 0 Posts

earlier i am deleting all occurences on key column1 and stored into seperate file duplicates & uniq records, for that below

sort -t\| -k1 input1.txt|awk '{
x[$1]++
y[NR] = $0
} END {
for(i=1; i<=NR; i++)
{
tmp = y[i]
split(tmp,z)
print tmp> ((x[z[1]]>1) ? "output.txt" : "output2.txt")
}
}' SUBSEP="|" FS="|"

kmsekhar

View Public Profile for kmsekhar

Find all posts by kmsekhar

10-21-2010

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Well you certainly had the right idea. I noticed that you are using a | as field separator for sort, while the file is comma delimited. Also you should sort on the 2nd field I think if you want to keep the latest max value. If you reverse sort then you can get the max value first so the rest can go in the duplicates bin.

For the awk part you did not use a file separator (FS) which should be set to ',' then you do not need to use the split function. SUBSEP is used for 2-dimensional arrays and is not required here.

I would suggest something like this:

Code:

sort -t, -k2,2rn input.txt |
awk -F, '{print > ((A[$1]++)?"Duplicates.txt":"Uniqrecords.txt")}'

Last edited by Scrutinizer; 10-21-2010 at 06:50 AM.. Reason: Had duplicates and uniqrecords reversed

This User Gave Thanks to Scrutinizer For This Post:

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

10-21-2010

Registered User

2,977, 644

Join Date: Oct 2010

Last Activity: 14 September 2019, 1:15 PM EDT

Location: France

Posts: 2,977

Thanks Given: 88

Thanked 644 Times in 613 Posts

OMG this is a tricky nice one!
ok in awk, when i see question about duplicate, i should think using ++

... question : in awk ---> "" + 1 = 0 ????

indeed :

Code:

# sort -t, -k2rn infile
5678,10,z
9999,10,k
5678,9,l
1234,1,y
1234,0,x
# sort -t, -k2,2rn infile | nawk -F, '{print A[$1]}'





# sort -t, -k2,2rn infile | nawk -F, '{print A[$1]++}'
0
0
1
0
1
#

ctsgnb

View Public Profile for ctsgnb

Find all posts by ctsgnb

10-21-2010

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Not exactly. Variables are initialized as empty string, and become zero if converted to a number.
The ++ gets done after the value gets printed. Compare:

Code:

$ sort -t, -k2,2rn input.txt | nawk -F, '{print A[$1]++,A[$1]}'
0 1
0 1
1 2
0 1
1 2

and

Code:

$ sort -t, -k2,2rn input.txt | nawk -F, '{print ++A[$1]}'
1
1
2
1
2

Last edited by Scrutinizer; 10-21-2010 at 07:45 AM..

This User Gave Thanks to Scrutinizer For This Post:

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

10-21-2010

Registered User

2,977, 644

Join Date: Oct 2010

Last Activity: 14 September 2019, 1:15 PM EDT

Location: France

Posts: 2,977

Thanks Given: 88

Thanked 644 Times in 613 Posts

@Scru1Linizer

Damned ! i missunderstood (once more ...) ... Thx for clarifiying again my fuzzy brain Dude !

Just tell me when you'll be tired of my question and i will just stfu ...

ctsgnb

View Public Profile for ctsgnb

Find all posts by ctsgnb

Shell Programming and Scripting

Remove duplicates based on the two key columns

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Sort and remove duplicates in directory based on first 5 columns:

Discussion started by: gnnsprapa

2. Shell Programming and Scripting

Removing duplicates from delimited file based on 2 columns

Discussion started by: kevinprood

3. Shell Programming and Scripting

Remove Duplicates on multiple Key Columns and get the Latest Record from Date/Time Column

Discussion started by: vijaykodukula

4. Shell Programming and Scripting

Remove duplicates based on a field's value

Discussion started by: anniecarv

5. Shell Programming and Scripting

Removing duplicates in fixed width file which has multiple key columns

Discussion started by: saj

6. Shell Programming and Scripting

finding duplicates in csv based on key columns

Discussion started by: baskivs

7. UNIX for Dummies Questions & Answers

Removing duplicates based on key

Discussion started by: pandeesh

8. Shell Programming and Scripting

need to remove duplicates based on key in first column and pattern in last column

Discussion started by: script_op2a

9. Shell Programming and Scripting

Search based on 1,2,4,5 columns and remove duplicates in the same file.

Discussion started by: onesuri

10. Shell Programming and Scripting

removing duplicates based on key

Discussion started by: pukars4u