Removing duplicates from delimited file based on 2 columns

08-12-2014

Registered User

1, 0

Join Date: Aug 2014

Last Activity: 13 August 2014, 9:57 AM EDT

Posts: 1

Thanks Given: 0

Thanked 0 Times in 0 Posts

Removing duplicates from delimited file based on 2 columns

Hi guys,Got a bit of a bind I'm in. I'm looking to remove duplicates from a pipe delimited file, but do so based on 2 columns. Sounds easy enough, but here's the kicker...
Column #1 is a simple ID, which is used to identify the duplicate.
Once dups are identified, I need to only keep the one with the latest date, which is column #4, in mm/dd/yyyy format. Of course, rows that don't have dup's would remain as-is.
Example input.txt:

Code:

9300617000372|Skittles|Candy|5/1/2013|12
4381472200131|M&Ms|Chocolate|9/20/2013|39
9414789515104|Jif|Peanut Butter|11/8/2013|14
4381472200131|Reese's|Peanut Butter|5/20/2014|61
4381472200131|Reese's|Chocolate|2/20/2014|36

In that scenario, the output would be rows 1, 3, and 4, since rows 2 and 5 are duplicates based on the ID and are older than the one in row 4 based on date.
The other kicker is...
The file I'm doing this with is 400,000 rows. So, I need the method to be extremely efficient and as quick as possible. I can't afford for this to take hours.

This is running on a Windows machine with GnuWin utils, as one last note.
I am definitely not enough of an expert to make this work, especially efficiently, so I'm hoping someone can help. Many thanks in advance.

Last edited by Don Cragun; 08-13-2014 at 12:29 AM.. Reason: Remove FONT tags; add CODE and ICODE tags.

kevinprood

View Public Profile for kevinprood

Find all posts by kevinprood

08-13-2014

Registered User

559, 160

Join Date: Jul 2012

Last Activity: 20 September 2019, 7:24 AM EDT

Location: India, Hyderabad

Posts: 559

Thanks Given: 11

Thanked 160 Times in 148 Posts

Code:

awk -F '|' '{split($4, d, "/"); dt=d[3] d[1] d[2]
  if($1 in a) {
    if(b[$1] < dt) {b[$1] = dt; a[$1] = $0}}
  else {a[$1] = $0; b[$1] = dt}}
END {for(x in a) {print a[x]}}' file

SriniShoo

View Public Profile for SriniShoo

Find all posts by SriniShoo

08-13-2014

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by SriniShoo

Code:

awk -F '|' '{split($4, d, "/"); dt=d[3] d[1] d[2]
  if($1 in a) {
    if(b[$1] < dt) {b[$1] = dt; a[$1] = $0}}
  else {a[$1] = $0; b[$1] = dt}}
END {for(x in a) {print a[x]}}' file

Although it looks like this will work for the small sample given, with the date formats being used (with no leading zeros on the month and day fields), this won't work reliably. I think you need to change:

Code:

dt=d[3] d[1] d[2]

to:

Code:

dt=sprintf("%d%02d%02d", d[3], d[1], d[2])

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

Shell Programming and Scripting

Removing duplicates from delimited file based on 2 columns

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Sort and remove duplicates in directory based on first 5 columns:

Discussion started by: gnnsprapa

2. Shell Programming and Scripting

Removing duplicates on a single "column" (delimited file)

Discussion started by: Rufinofr

3. Shell Programming and Scripting

To remove duplicates from pipe delimited file

Discussion started by: ginrkf

4. Shell Programming and Scripting

Removing duplicates in fixed width file which has multiple key columns

Discussion started by: saj

5. Shell Programming and Scripting

finding duplicates in csv based on key columns

Discussion started by: baskivs

6. UNIX for Dummies Questions & Answers

Removing duplicates based on key

Discussion started by: pandeesh

7. Shell Programming and Scripting

Search based on 1,2,4,5 columns and remove duplicates in the same file.

Discussion started by: onesuri

8. Shell Programming and Scripting

Remove duplicates based on the two key columns

Discussion started by: kmsekhar

9. Shell Programming and Scripting

finding duplicates in columns and removing lines

Discussion started by: totus

10. Shell Programming and Scripting

removing duplicates based on key

Discussion started by: pukars4u