Removing duplicates from a file


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Removing duplicates from a file
# 1  
Old 07-23-2013
Removing duplicates from a file

Hi All,

I am merging files coming from 2 different systems ,while doing that I am getting duplicates entries in the merged file

Code:
I,01,000131,764,2,4.00
I,01,000131,765,2,4.00
I,01,000131,772,2,4.00
I,01,000131,773,2,4.00
I,01,000168,762,2,2.00
I,01,000168,763,2,2.00
I,01,000622,761,6,14.64
I,01,000622,762,6,14.64
I,01,000622,763,6,14.64
I,01,000684,767,2,10.00
I,01,000131,764,2,5.00
I,01,000131,765,2,5.00
I,01,000131,772,2,6.00
I,01,000131,773,2,4.00
I,01,000168,762,2,2.00
I,01,000168,763,2,2.00
I,01,000622,761,6,14.64
I,01,000622,762,6,14.64
I,01,000622,763,6,14.64
I,01,000684,767,2,10.00


I tried using sort -u command to sort it unqiuely, using

Code:
sort -k 2,2 -k 3,3 -k 4,4 sample.txt | sort -u

,but it is not returning the correct result, it is giving the output like

Code:
I,01,000131,764,2,4.00
I,01,000131,764,2,5.00
I,01,000131,765,2,4.00
I,01,000131,765,2,5.00
I,01,000131,772,2,4.00
I,01,000131,772,2,6.00
I,01,000131,773,2,4.00
I,01,000168,762,2,2.00
I,01,000168,763,2,2.00
I,01,000622,761,6,14.64
I,01,000622,762,6,14.64
I,01,000622,763,6,14.64
I,01,000684,767,2,10.00

Is there way I can sort the file unique or remove duplicates using the 2nd ,3rd and 4th column as key fields.

Thanks
Sri

Last edited by Scrutinizer; 07-23-2013 at 02:15 AM.. Reason: code tags
# 2  
Old 07-23-2013
You need to specify a comma as the field separator: -t,
# 3  
Old 07-23-2013
Even after that I am getting the same output

Code:
-bash-3.2$ sort -t, -k 2,2 -k 3,3 -k 4,4 sample.txt |sort -u

Code:
I,01,000131,764,2,4.00
I,01,000131,764,2,5.00
I,01,000131,765,2,4.00
I,01,000131,765,2,5.00
I,01,000131,772,2,4.00
I,01,000131,772,2,6.00
I,01,000131,773,2,4.00
I,01,000168,762,2,2.00
I,01,000168,763,2,2.00
I,01,000622,761,6,14.64
I,01,000622,762,6,14.64
I,01,000622,763,6,14.64
I,01,000684,767,2,10.00


Last edited by Scrutinizer; 07-23-2013 at 03:24 AM..
# 4  
Old 07-23-2013
That is because the second sort is using the default field separator. Is this what you are looking for?
Code:
sort -u -t, -k 2,4 sample.txt

Otherwise, what is your expected output?
# 5  
Old 07-23-2013
Thanks,

Its working...But just out of curiosity, is there any other way of doing it , just comparing the 2nd ,3rd and 4th field as a key to find duplicates in a file.Would be great its there in Unix
# 6  
Old 07-24-2013
There are lots of ways to do this. If you want the output sorted by the key you're using to determine duplication, sort -u is the most logical choice. If you want to give preference to the 1st entry found with a given key and have output order match input order, awk provides a simple way to do it:
Code:
awk -F, '!a[$2,$3,$4]++' input1 input2 > mergeNoDup
awk -F, '!a[$2,$3,$4]++' mergeWithDups > mergeNoDup

Use the 1st line if you have your separate files from server 1 and server 2; use the 2nd line if you have already created a merged file and want to remove duplicates from the merged file. Both of these will work with an unlimited number of input files as long as you don't reach ARG_MAX limitations on the number of input files you're feeding into awk.

As always, if you want to try this on a Solaris/SunOS system, use /usr/xpg4/bin/awk, /usr/xpg6/bin/awk, or nawk rather than /bin/awk or /usr/bin/awk.

Other ways to do this include writing a C (or other high level language program), perl, using the associative arrays that are available in some shells, and an endless number of much less efficient combinations using read (to get a list of keys), grep (to get a count of lines containing a key), and an editor (to remove all but one occurrence of duplicated keys).
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Removing duplicates from new file

i hav two files like i want to remove/delete all the duplicate lines in file2 which are viz unix,unix2,unix3.I have tried previous post also,but in that complete line must be similar.In this case i have to verify first column only regardless what is the content in succeeding columns. (3 Replies)
Discussion started by: sagar_1986
3 Replies

2. Shell Programming and Scripting

Removing duplicates from new file

i hav two files like i want to remove/delete all the duplicate lines in file2 which are viz unix,unix2,unix3 (2 Replies)
Discussion started by: sagar_1986
2 Replies

3. UNIX for Dummies Questions & Answers

Grep from pattern file without removing duplicates?

I have been using grep to output whole lines using a pattern file with identifiers (fileA): fig|562.2322.peg.1 fig|562.2322.peg.3 fig|562.2322.peg.3 fig|562.2322.peg.3 fig|562.2322.peg.7 From fileB with corresponding identifiers in the second column: NODE_0 fig|562.2322.peg.1 peg ... (2 Replies)
Discussion started by: Mauve
2 Replies

4. Shell Programming and Scripting

Removing duplicates depending on file size

Hi all, I am working with a huge amount of files in a Linux environment and I was trying to filter my data. Here's what my data looks like Name............................Size OLUSDN.gf.gif-1.JPEG.......5 kb LKJFDA01.gf.gif-1.JPEG.....3 kb LKJFDA01.gf.gif-2.JPEG.....1 kb... (7 Replies)
Discussion started by: Error404
7 Replies

5. Shell Programming and Scripting

formatting a file and removing duplicates

Hi, I have a file that I want to change the format of. It is a large file in rows but I want it to be comma separated (comma then a space). The current file looks like this: HI, Joe, Bob, Jack, Jack After I would want to remove any duplicates so it would look like this: HI, Joe,... (2 Replies)
Discussion started by: kylle345
2 Replies

6. Shell Programming and Scripting

Removing Duplicates from file

Hi Experts, Please check the following new requirement. I got data like the following in a file. FILE_HEADER 01cbbfde7898410| 3477945| home| 1 01cbc275d2c122| 3478234| WORK| 1 01cbbe4362743da| 3496386| Rich Spare| 1 01cbc275d2c122| 3478234| WORK| 1 This is pipe separated file with... (3 Replies)
Discussion started by: tinufarid
3 Replies

7. Shell Programming and Scripting

Removing duplicates from log file?

I have a log file with posts looking like this: -- Messages can be delivered by different systems at different times. The id number is used to sort out duplicate messages. What I need is to strip the arrival time from each post, sort posts by id number, and reattach arrival time to respective... (2 Replies)
Discussion started by: Ilja
2 Replies

8. UNIX for Dummies Questions & Answers

removing duplicates of a pattern from a file

hey all, I need some help. I have a text file with names in it. My target is that if a particular pattern exists in that file more than once..then i want to rename all the occurences of that pattern by alternate patterns.. for e.g if i have PATTERN occuring 5 times then i want to... (3 Replies)
Discussion started by: ashisharora
3 Replies

9. Shell Programming and Scripting

Removing duplicates in a sorted file by field.

I have data like this: It's sorted by the 2nd field (TID). envoy,90000000000000634600010001,04/11/2008,23:19:27,RB00266,0015,DETAIL,ERROR, envoy,90000000000000634600010001,04/12/2008,04:23:45,RB00266,0015,DETAIL,ERROR,... (1 Reply)
Discussion started by: kinksville
1 Replies

10. UNIX for Dummies Questions & Answers

removing duplicates from a file

i have a file with some 1000 entries it will contain entries like 1000,ram 2000,pankaj 1001,rahim 1000,ram 2532,govind 2000,pankaj 3000,venkat 2532,govind what i want is i want to extract only the distinct rows from this file so my output should contain only 1000,ram... (2 Replies)
Discussion started by: trichyselva
2 Replies
Login or Register to Ask a Question