Removing duplicate rows & selecting only latest date

06-06-2011

Registered User

70, 0

Join Date: Jan 2007

Last Activity: 12 April 2020, 6:06 AM EDT

Posts: 70

Thanks Given: 16

Thanked 0 Times in 0 Posts

Removing duplicate rows & selecting only latest date

Gurus,

From a file I need to remove duplicate rows based on the first column data but also we need to consider a date column where we need to keep the latest date (13th column).

Ex:

Input File:

Quote:

001519831030101654000||||||||||||||||||||||||||20090609|20090609|20090609
001519831030101654999||||||||||||||||||||||||||20090609|20090609|20090609
0015198310301016542R1|0015|001519831030101654||2|2||F|GBP|20050905||20151003|20091103|||0|1,000000|1 ,000000||5,340000|5,340000|||0,000000|||20090609|20090609|20090609
0015198310301016542R1|0015|001519831030101654||2|2||F|GBP|20050905||20151003|20151103|||0|1,000000|1 ,000000||5,340000|5,340000|||0,000000|||20090609|20090609|20090609
0015198310301016542R1|0015|001519831030101654||2|2||F|GBP|20050905||20151003||||0|1,000000|1,000000| |5,340000|5,340000|||0,000000|||20090609|20090609|20090609
0015198310301016543E1|0015|001519831030101654||2|2||V|GBP|20040923||20170903||||0|1,000000|1,000000| |1,500000|1,500000|||0,000000|||20090609|20090609|20090609
0015198310301016543E1|0015|001519831030101654||2|2||V|GBP|20040923||20170903||||0|1,000000|1,000000| |1,500000|1,500000|||0,000000|||20090609|20090609|20090609

Output File:

Quote:

I know how to take out the duplicates but I couldn't figure out selecting the latest date based on column 13th.

Can you please help me?

Thanks
Shash

shash

View Public Profile for shash

Find all posts by shash

06-06-2011

Registered User

353, 50

Join Date: Jul 2008

Last Activity: 15 March 2015, 2:47 AM EDT

Location: India

Posts: 353

Thanks Given: 9

Thanked 50 Times in 49 Posts

Few rows doesn't have the 13th column, what about these ?

kumaran_5555

View Public Profile for kumaran_5555

Find all posts by kumaran_5555

06-06-2011

Registered User

124, 29

Join Date: May 2011

Last Activity: 16 July 2018, 1:31 AM EDT

Location: India

Posts: 124

Thanks Given: 2

Thanked 29 Times in 29 Posts

Try this..

Code:

 
perl -F'\|' -lane '$hash{$F[0]}=$_ if($F[13] <= $hash{$F[0]})}{print $_ for values %hash' input

getmmg

View Public Profile for getmmg

Find all posts by getmmg

06-06-2011

Registered User

5, 0

Join Date: May 2011

Last Activity: 11 July 2011, 2:31 AM EDT

Location: Mumbai

Posts: 5

Thanks Given: 0

Thanked 0 Times in 0 Posts

if the column is common through-out, use awk like awk '{print $col_number}' to extract this field firstly and then go for extracting the latest one (tail -1 or head -1 depends upon your requirement).

subodh.thakar

View Public Profile for subodh.thakar

Find all posts by subodh.thakar

06-06-2011

Registered User

60, 1

Join Date: May 2010

Last Activity: 9 July 2012, 6:56 AM EDT

Location: Bangalore

Posts: 60

Thanks Given: 1

Thanked 1 Time in 1 Post

Lets say a file has rows with 2 columns & the file is sorted. Commad below will look for duplicates based on data in column1. If there are no duplicates it traverses forward. If duplicates are found, it reads column2 data for comparison and returns the row having largest value in column2.

Quote:

inputfile.txt

001|20090609
001|20090609
001|20080609
0015|20090609
00151|20090609
00151|20080609

awk -F"|" '{if(! a[$1] ) {a[$1]=$2;b[++i]=$0} else if( $2 > a[$1]){a[$1]=$2;b[i]=$0}} END {for(j=1;j<=i;j++) {print b[j]}}' inputfile.txt

output:
001|20090609
0015|20090609
00151|20090609

Try this on your file after changing the coulmn entries.

Sheel

View Public Profile for Sheel

Find all posts by Sheel

06-08-2011

Registered User

70, 0

Join Date: Jan 2007

Last Activity: 12 April 2020, 6:06 AM EDT

Posts: 70

Thanks Given: 16

Thanked 0 Times in 0 Posts

Thanks Sheel!

If I use the same code as yours I can get rid of duplicates however, it is not selecting the latest date. Can you please let me know what is the change required for selecting the latest date on column 13th?

Thanks
Shash

---------- Post updated at 10:34 AM ---------- Previous update was at 10:11 AM ----------

Just figured it. Thanks a lot for the help.

Last edited by shash; 06-08-2011 at 12:20 PM.. Reason: Got the code to work a bit

shash

View Public Profile for shash

Find all posts by shash

UNIX for Dummies Questions & Answers

Removing duplicate rows & selecting only latest date

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Size Selecting rows

Discussion started by: Xterra

2. Shell Programming and Scripting

Selecting latest entry in the log file

Discussion started by: simpsa27

3. UNIX for Dummies Questions & Answers

Log file - Delete duplicate line & keep last date

Discussion started by: vadim-bzh

4. Shell Programming and Scripting

Removing Duplicate Rows in a file

Discussion started by: ekbaazigar

5. UNIX for Dummies Questions & Answers

Selecting the file of latest Date

Discussion started by: KAREENA18

6. Shell Programming and Scripting

removing rows from text file older than certain date

Discussion started by: firefox2k2

7. Shell Programming and Scripting

Removing rows from a file based on date comparison

Discussion started by: Max_2503

8. UNIX for Dummies Questions & Answers

Help selecting some rows with awk

Discussion started by: capnino

9. Shell Programming and Scripting

To remove date and duplicate rows from a log file using unix commands

Discussion started by: Pank10

10. UNIX for Dummies Questions & Answers

Subtract date & time in diferent rows

Discussion started by: vanand420