Remove duplicates based on the two key columns


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Remove duplicates based on the two key columns
# 1  
Old 10-21-2010
Remove duplicates based on the two key columns

Hi All,
I needs to fetch unique records based on a keycolumn(ie., first column1) and also I needs to get the records which are having max value on column2 in sorted manner... and duplicates have to store in another output file.

Input :

Input.txt
1234,0,x
1234,1,y
5678,10,z
9999,10,k
5678,9,l

Desired Output:

Duplicates.txt
1234,0,x
5678,9,l

Uniqrecords.txt
1234,1,y
5678,10,z
9999,10,k

Regards,
MuniSekhar

Thanks in Advance....
# 2  
Old 10-21-2010
Hi, what have you tried so far?
# 3  
Old 10-21-2010
earlier i am deleting all occurences on key column1 and stored into seperate file duplicates & uniq records, for that below

sort -t\| -k1 input1.txt|awk '{
x[$1]++
y[NR] = $0
} END {
for(i=1; i<=NR; i++)
{
tmp = y[i]
split(tmp,z)
print tmp> ((x[z[1]]>1) ? "output.txt" : "output2.txt")
}
}' SUBSEP="|" FS="|"
# 4  
Old 10-21-2010
Well you certainly had the right idea. I noticed that you are using a | as field separator for sort, while the file is comma delimited. Also you should sort on the 2nd field I think if you want to keep the latest max value. If you reverse sort then you can get the max value first so the rest can go in the duplicates bin.

For the awk part you did not use a file separator (FS) which should be set to ',' then you do not need to use the split function. SUBSEP is used for 2-dimensional arrays and is not required here.

I would suggest something like this:
Code:
sort -t, -k2,2rn input.txt |
awk -F, '{print > ((A[$1]++)?"Duplicates.txt":"Uniqrecords.txt")}'


Last edited by Scrutinizer; 10-21-2010 at 06:50 AM.. Reason: Had duplicates and uniqrecords reversed
This User Gave Thanks to Scrutinizer For This Post:
# 5  
Old 10-21-2010
OMG this is a tricky nice one!
ok in awk, when i see question about duplicate, i should think using ++


... question : in awk ---> "" + 1 = 0 ????

indeed :

Code:
# sort -t, -k2rn infile
5678,10,z
9999,10,k
5678,9,l
1234,1,y
1234,0,x
# sort -t, -k2,2rn infile | nawk -F, '{print A[$1]}'





# sort -t, -k2,2rn infile | nawk -F, '{print A[$1]++}'
0
0
1
0
1
#

# 6  
Old 10-21-2010
Not exactly. Variables are initialized as empty string, and become zero if converted to a number.
The ++ gets done after the value gets printed. Compare:
Code:
$ sort -t, -k2,2rn input.txt | nawk -F, '{print A[$1]++,A[$1]}'
0 1
0 1
1 2
0 1
1 2

and
Code:
$ sort -t, -k2,2rn input.txt | nawk -F, '{print ++A[$1]}'
1
1
2
1
2


Last edited by Scrutinizer; 10-21-2010 at 07:45 AM..
This User Gave Thanks to Scrutinizer For This Post:
# 7  
Old 10-21-2010
@Scru1Linizer

Damned ! i missunderstood (once more ...) ... Thx for clarifiying again my fuzzy brain Dude !

Just tell me when you'll be tired of my question and i will just stfu ...

Smilie
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Sort and remove duplicates in directory based on first 5 columns:

I have /tmp dir with filename as: 010020001_S-FOR-Sort-SYEXC_20160229_2212101.marker 010020001_S-FOR-Sort-SYEXC_20160229_2212102.marker 010020001-S-XOR-Sort-SYEXC_20160229_2212104.marker 010020001-S-XOR-Sort-SYEXC_20160229_2212105.marker 010020001_S-ZOR-Sort-SYEXC_20160229_2212106.marker... (4 Replies)
Discussion started by: gnnsprapa
4 Replies

2. Shell Programming and Scripting

Removing duplicates from delimited file based on 2 columns

Hi guys,Got a bit of a bind I'm in. I'm looking to remove duplicates from a pipe delimited file, but do so based on 2 columns. Sounds easy enough, but here's the kicker... Column #1 is a simple ID, which is used to identify the duplicate. Once dups are identified, I need to only keep the one... (2 Replies)
Discussion started by: kevinprood
2 Replies

3. Shell Programming and Scripting

Remove Duplicates on multiple Key Columns and get the Latest Record from Date/Time Column

Hi Experts , we have a CDC file where we need to get the latest record of the Key columns Key Columns will be CDC_FLAG and SRC_PMTN_I and fetch the latest record from the CDC_PRCS_TS Can we do it with a single awk command. Please help.... (3 Replies)
Discussion started by: vijaykodukula
3 Replies

4. Shell Programming and Scripting

Remove duplicates based on a field's value

Hi All, I have a text file with three columns. I would like a simple script that removes lines in which column 1 has duplicate entries, but use the largest value in column 3 to decide which one to keep. For example: Input file: 12345a rerere.rerere len=23 11111c fsdfdf.dfsdfdsf len=33 ... (3 Replies)
Discussion started by: anniecarv
3 Replies

5. Shell Programming and Scripting

Removing duplicates in fixed width file which has multiple key columns

Hi All , I have a requirement where I need to remove duplicates from a fixed width file which has multiple key columns .Also , need to capture the duplicate records into another file . File has 8 columns. Key columns are col1 and col2. Col1 has the length of 8 col 2 has the length of 3. ... (5 Replies)
Discussion started by: saj
5 Replies

6. Shell Programming and Scripting

finding duplicates in csv based on key columns

Hi team, I have 20 columns csv files. i want to find the duplicates in that file based on the column1 column10 column4 column6 coulnn8 coulunm2 . if those columns have same values . then it should be a duplicate record. can one help me on finding the duplicates, Thanks in advance. ... (2 Replies)
Discussion started by: baskivs
2 Replies

7. UNIX for Dummies Questions & Answers

Removing duplicates based on key

Hi, I have the input file with the below data: 12345|12|34 12345|13|23 3456|12|90 15670|12|13 12345|10|14 3456|12|13 I need to remove the duplicates based on the first field only. I need the output like: 12345|12|34 3456|12|90 15670|12|13 The first field needs to be unique . (4 Replies)
Discussion started by: pandeesh
4 Replies

8. Shell Programming and Scripting

need to remove duplicates based on key in first column and pattern in last column

Given a file such as this I need to remove the duplicates. 00060011 PAUL BOWSTEIN ad_waq3_921_20100826_010517.txt 00060011 PAUL BOWSTEIN ad_waq3_921_20100827_010528.txt 0624-01 RUT CORPORATION ad_sade3_10_20100827_010528.txt 0624-01 RUT CORPORATION ... (13 Replies)
Discussion started by: script_op2a
13 Replies

9. Shell Programming and Scripting

Search based on 1,2,4,5 columns and remove duplicates in the same file.

Hi, I am unable to search the duplicates in a file based on the 1st,2nd,4th,5th columns in a file and also remove the duplicates in the same file. Source filename: Filename.csv "1","ccc","information","5000","temp","concept","new" "1","ddd","information","6000","temp","concept","new"... (2 Replies)
Discussion started by: onesuri
2 Replies

10. Shell Programming and Scripting

removing duplicates based on key

HI I am having a file like this 1234 12345678 1234567890123 4321 43215678 432156789028433435 I want to get ouput as 1234567890123 432156789028433435 based on key position 1-4 I am using ksh can anyone give me an idea Thanks pukars (1 Reply)
Discussion started by: pukars4u
1 Replies
Login or Register to Ask a Question