Removing Duplicate Rows in a file

08-21-2014

Registered User

4, 0

Join Date: Aug 2014

Last Activity: 25 August 2014, 5:56 PM EDT

Posts: 4

Thanks Given: 3

Thanked 0 Times in 0 Posts

Removing Duplicate Rows in a file

Hello

I have a file with contents like this...

Part1 Field2 Field3 Field4 (line1)
Part2 Field2 Field3 Field4 (line2)
Part3 Field2 Field3 Field4 (line3)
Part1 Field2 Field3 Field4 (line4)
Part4 Field2 Field3 Field4 (line5)
Part5 Field2 Field3 Field4 (line6)
Part2 Field2 Field3 Field4 (line7)
Part1 Field2 Field3 Field4 (line8)
...

The lines are added throughout the day at different times by various programs so the listing is in the order of timestamp . At the end of the day, I want to remove the oldest values (since they are superseded). So in the example above, I want to get rid of line 1 line 2 and line 4 as there are more recent row of these Parts. Also delete the empty rows that get created during the delete of the row.

Part3 Field2 Field3 Field4 (line3)
Part4 Field2 Field3 Field4 (line5)
Part5 Field2 Field3 Field4 (line6)
Part2 Field2 Field3 Field4 (line7)
Part1 Field2 Field3 Field4 (line8)

Any help will be greatly appreciated.

ekbaazigar

View Public Profile for ekbaazigar

Find all posts by ekbaazigar

08-21-2014

Registered User

5,091, 1,931

Join Date: May 2012

Last Activity: 15 July 2020, 4:46 AM EDT

Location: Simplicity

Posts: 5,091

Thanks Given: 565

Thanked 1,931 Times in 1,668 Posts

I think the (line number) are added for demonstration, not in the real file?
Then it is with awk

Code:

awk '
 {s[$0]=NR}
 END {for (i=1;i<=NR;i++) for (j in s) if (i==s[j]) print j}
' file

For big files the END section should sort on the line numbers. With perl it becomes

Code:

perl -ne '
 $s{$_}=++$i;
 END {print sort {$s{$a}<=>$s{$b}} keys %s}
' file

MadeInGermany

View Public Profile for MadeInGermany

Find all posts by MadeInGermany

08-21-2014

Registered User

4, 0

Join Date: Aug 2014

Last Activity: 25 August 2014, 5:56 PM EDT

Posts: 4

Thanks Given: 3

Thanked 0 Times in 0 Posts

Yes, the line numbers at the end were added for demonstration purpose.

---------- Post updated at 05:10 PM ---------- Previous update was at 02:52 PM ----------

Quote:

Originally Posted by MadeInGermany

I think the (line number) are added for demonstration, not in the real file?
Then it is with awk

Code:

awk '
 {s[$0]=NR}
 END {for (i=1;i<=NR;i++) for (j in s) if (i==s[j]) print j}
' file

For big files the END section should sort on the line numbers. With perl it becomes

Code:

perl -ne '
 $s{$_}=++$i;
 END {print sort {$s{$a}<=>$s{$b}} keys %s}
' file

I tried it, but it just returned the original values.

ekbaazigar

View Public Profile for ekbaazigar

Find all posts by ekbaazigar

08-21-2014

Registered User

5,091, 1,931

Join Date: May 2012

Last Activity: 15 July 2020, 4:46 AM EDT

Location: Simplicity

Posts: 5,091

Thanks Given: 565

Thanked 1,931 Times in 1,668 Posts

It works with this file:

Code:

Part1 Field2 Field3 Field4
Part2 Field2 Field3 Field4
Part3 Field2 Field3 Field4
Part1 Field2 Field3 Field4
Part4 Field2 Field3 Field4
Part5 Field2 Field3 Field4
Part2 Field2 Field3 Field4
Part1 Field2 Field3 Field4

This User Gave Thanks to MadeInGermany For This Post:

MadeInGermany

View Public Profile for MadeInGermany

Find all posts by MadeInGermany

08-21-2014

Registered User

4, 0

Join Date: Aug 2014

Last Activity: 25 August 2014, 5:56 PM EDT

Posts: 4

Thanks Given: 3

Thanked 0 Times in 0 Posts

ok, i see it works only when the entire line duplicated.

Anyway to just check on the first column and not the entire row ?

Thank you so much for sharing your experience and expertise.

ekbaazigar

View Public Profile for ekbaazigar

Find all posts by ekbaazigar

08-22-2014

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Use s[$1] instead of s[$0] in awk.

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

08-22-2014

Registered User

5,091, 1,931

Join Date: May 2012

Last Activity: 15 July 2020, 4:46 AM EDT

Location: Simplicity

Posts: 5,091

Thanks Given: 565

Thanked 1,931 Times in 1,668 Posts

s[$1] only stores the key (column 1), so one needs to also store the rest of the row.
Or the entire row:

Code:

awk '
 {s[$1]=NR; row[NR]=$0}
 END {for (i=1;i<=NR;i++) for (j in s) if (i==s[j]) print row[i]}
' file

Code:

awk '
 {s[$1]=NR; row[$1]=$0}
 END {for (i=1;i<=NR;i++) for (j in s) if (i==s[j]) print row[j]}
' file

I wonder which one consumes less memory?

Last edited by MadeInGermany; 08-22-2014 at 05:57 AM..

These 2 Users Gave Thanks to MadeInGermany For This Post:

MadeInGermany

View Public Profile for MadeInGermany

Find all posts by MadeInGermany

Shell Programming and Scripting

Removing Duplicate Rows in a file

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Get duplicate rows from a csv file

Discussion started by: ggupta

2. Shell Programming and Scripting

Removing duplicate terms in a file

Discussion started by: Behrouzx77

3. UNIX for Dummies Questions & Answers

Removing duplicate rows & selecting only latest date

Discussion started by: shash

4. Shell Programming and Scripting

Duplicate rows in a text file

Discussion started by: whitecross

5. HP-UX

How to get Duplicate rows in a file

Discussion started by: raghu.iv85

6. UNIX for Dummies Questions & Answers

Remove duplicate rows of a file based on a value of a column

Discussion started by: risk_sly

7. Shell Programming and Scripting

removing the duplicate lines in a file

Discussion started by: Sharmila_P

8. Shell Programming and Scripting

how to delete duplicate rows in a file

Discussion started by: vamshikrishnab

9. Shell Programming and Scripting

duplicate rows in a file

Discussion started by: infyanurag

10. UNIX for Dummies Questions & Answers

removing duplicate lines from a file

Discussion started by: ocelot