Removing Duplicate Rows in a file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Removing Duplicate Rows in a file
# 1  
Old 08-21-2014
Removing Duplicate Rows in a file

Hello

I have a file with contents like this...

Part1 Field2 Field3 Field4 (line1)
Part2 Field2 Field3 Field4 (line2)
Part3 Field2 Field3 Field4 (line3)
Part1 Field2 Field3 Field4 (line4)
Part4 Field2 Field3 Field4 (line5)
Part5 Field2 Field3 Field4 (line6)
Part2 Field2 Field3 Field4 (line7)
Part1 Field2 Field3 Field4 (line8)
...

The lines are added throughout the day at different times by various programs so the listing is in the order of timestamp . At the end of the day, I want to remove the oldest values (since they are superseded). So in the example above, I want to get rid of line 1 line 2 and line 4 as there are more recent row of these Parts. Also delete the empty rows that get created during the delete of the row.

Part3 Field2 Field3 Field4 (line3)
Part4 Field2 Field3 Field4 (line5)
Part5 Field2 Field3 Field4 (line6)
Part2 Field2 Field3 Field4 (line7)
Part1 Field2 Field3 Field4 (line8)

Any help will be greatly appreciated.
# 2  
Old 08-21-2014
I think the (line number) are added for demonstration, not in the real file?
Then it is with awk
Code:
awk '
 {s[$0]=NR}
 END {for (i=1;i<=NR;i++) for (j in s) if (i==s[j]) print j}
' file

For big files the END section should sort on the line numbers. With perl it becomes
Code:
perl -ne '
 $s{$_}=++$i;
 END {print sort {$s{$a}<=>$s{$b}} keys %s}
' file

# 3  
Old 08-21-2014
Yes, the line numbers at the end were added for demonstration purpose.

---------- Post updated at 05:10 PM ---------- Previous update was at 02:52 PM ----------

Quote:
Originally Posted by MadeInGermany
I think the (line number) are added for demonstration, not in the real file?
Then it is with awk
Code:
awk '
 {s[$0]=NR}
 END {for (i=1;i<=NR;i++) for (j in s) if (i==s[j]) print j}
' file

For big files the END section should sort on the line numbers. With perl it becomes
Code:
perl -ne '
 $s{$_}=++$i;
 END {print sort {$s{$a}<=>$s{$b}} keys %s}
' file

I tried it, but it just returned the original values.
# 4  
Old 08-21-2014
It works with this file:
Code:
Part1 Field2 Field3 Field4
Part2 Field2 Field3 Field4
Part3 Field2 Field3 Field4
Part1 Field2 Field3 Field4
Part4 Field2 Field3 Field4
Part5 Field2 Field3 Field4
Part2 Field2 Field3 Field4
Part1 Field2 Field3 Field4

This User Gave Thanks to MadeInGermany For This Post:
# 5  
Old 08-21-2014
ok, i see it works only when the entire line duplicated.

Anyway to just check on the first column and not the entire row ?

Thank you so much for sharing your experience and expertise.
# 6  
Old 08-22-2014
Use s[$1] instead of s[$0] in awk.
This User Gave Thanks to RudiC For This Post:
# 7  
Old 08-22-2014
s[$1] only stores the key (column 1), so one needs to also store the rest of the row.
Or the entire row:
Code:
awk '
 {s[$1]=NR; row[NR]=$0}
 END {for (i=1;i<=NR;i++) for (j in s) if (i==s[j]) print row[i]}
' file

Or
Code:
awk '
 {s[$1]=NR; row[$1]=$0}
 END {for (i=1;i<=NR;i++) for (j in s) if (i==s[j]) print row[j]}
' file

I wonder which one consumes less memory?

Last edited by MadeInGermany; 08-22-2014 at 05:57 AM..
These 2 Users Gave Thanks to MadeInGermany For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Get duplicate rows from a csv file

How can i get the duplicates rows from a file using unix, for example i have data like a,1 b,2 c,3 d,4 a,1 c,3 e,5 i want output to be like a,1 c,3 (4 Replies)
Discussion started by: ggupta
4 Replies

2. Shell Programming and Scripting

Removing duplicate terms in a file

Hi everybody I have a .txt file that contains some assembly code for optimizing it i need to remove some replicated parts. for example I have:e_li r0,-1 e_li r25,-1 e_lis r25,0000 add r31, r31 ,r0 e_li r28,-1 e_lis r28,0000 add r31, r31 ,r0 e_li r28,-1 ... (3 Replies)
Discussion started by: Behrouzx77
3 Replies

3. UNIX for Dummies Questions & Answers

Removing duplicate rows & selecting only latest date

Gurus, From a file I need to remove duplicate rows based on the first column data but also we need to consider a date column where we need to keep the latest date (13th column). Ex: Input File: Output File: I know how to take out the duplicates but I couldn't figure out... (5 Replies)
Discussion started by: shash
5 Replies

4. Shell Programming and Scripting

Duplicate rows in a text file

notes: i am using cygwin and notepad++ only for checking this and my OS is XP. #!/bin/bash typeset -i totalvalue=(wc -w /cygdrive/c/cygwinfiles/database.txt) typeset -i totallines=(wc -l /cygdrive/c/cygwinfiles/database.txt) typeset -i columnlines=`expr $totalvalue / $totallines` awk -F' ' -v... (5 Replies)
Discussion started by: whitecross
5 Replies

5. HP-UX

How to get Duplicate rows in a file

Hi all, I have written one shell script. The output file of this script is having sql output. In that file, I want to extract the rows which are having multiple entries(duplicate rows). For example, the output file will be like the following way. ... (7 Replies)
Discussion started by: raghu.iv85
7 Replies

6. UNIX for Dummies Questions & Answers

Remove duplicate rows of a file based on a value of a column

Hi, I am processing a file and would like to delete duplicate records as indicated by one of its column. e.g. COL1 COL2 COL3 A 1234 1234 B 3k32 2322 C Xk32 TTT A NEW XX22 B 3k32 ... (7 Replies)
Discussion started by: risk_sly
7 Replies

7. Shell Programming and Scripting

removing the duplicate lines in a file

Hi, I need to concatenate three files in to one destination file.In this if some duplicate data occurs it should be deleted. eg: file1: ----- data1 value1 data2 value2 data3 value3 file2: ----- data1 value1 data4 value4 data5 value5 file3: ----- data1 value1 data4 value4 (3 Replies)
Discussion started by: Sharmila_P
3 Replies

8. Shell Programming and Scripting

how to delete duplicate rows in a file

I have a file content like below. "0000000","ABLNCYI","BOTH",1049,2058,"XYZ","5711002","","Y","","","","","","","","" "0000000","ABLNCYI","BOTH",1049,2058,"XYZ","5711002","","Y","","","","","","","","" "0000000","ABLNCYI","BOTH",1049,2058,"XYZ","5711002","","Y","","","","","","","",""... (5 Replies)
Discussion started by: vamshikrishnab
5 Replies

9. Shell Programming and Scripting

duplicate rows in a file

hi all can anyone please let me know if there is a way to find out duplicate rows in a file. i have a file that has hundreds of numbers(all in next row). i want to find out the numbers that are repeted in the file. eg. 123434 534 5575 4746767 347624 5575 i want 5575 please help (3 Replies)
Discussion started by: infyanurag
3 Replies

10. UNIX for Dummies Questions & Answers

removing duplicate lines from a file

Hi, I am trying to remove duplicate lines from a file. For example the contents of example.txt is: this is a test 2342 this is a test 34343 this is a test 43434 and i want to remove the "this is a test" lines only and end up with the numbers in the file, that is, end up with: 2342... (4 Replies)
Discussion started by: ocelot
4 Replies
Login or Register to Ask a Question