Removing duplicates from a file | Unix Linux Forums | UNIX for Dummies Questions & Answers

  Go Back    


UNIX for Dummies Questions & Answers If you're not sure where to post a UNIX or Linux question, post it here. All UNIX and Linux newbies welcome !!

Removing duplicates from a file

UNIX for Dummies Questions & Answers


Closed Thread    
 
Thread Tools Search this Thread Display Modes
    #1  
Old 07-23-2013
Sri3001 Sri3001 is offline
Registered User
 
Join Date: Sep 2011
Last Activity: 13 August 2014, 1:19 AM EDT
Posts: 36
Thanks: 4
Thanked 0 Times in 0 Posts
Removing duplicates from a file

Hi All,

I am merging files coming from 2 different systems ,while doing that I am getting duplicates entries in the merged file


Code:
I,01,000131,764,2,4.00
I,01,000131,765,2,4.00
I,01,000131,772,2,4.00
I,01,000131,773,2,4.00
I,01,000168,762,2,2.00
I,01,000168,763,2,2.00
I,01,000622,761,6,14.64
I,01,000622,762,6,14.64
I,01,000622,763,6,14.64
I,01,000684,767,2,10.00
I,01,000131,764,2,5.00
I,01,000131,765,2,5.00
I,01,000131,772,2,6.00
I,01,000131,773,2,4.00
I,01,000168,762,2,2.00
I,01,000168,763,2,2.00
I,01,000622,761,6,14.64
I,01,000622,762,6,14.64
I,01,000622,763,6,14.64
I,01,000684,767,2,10.00


I tried using sort -u command to sort it unqiuely, using


Code:
sort -k 2,2 -k 3,3 -k 4,4 sample.txt | sort -u

,but it is not returning the correct result, it is giving the output like


Code:
I,01,000131,764,2,4.00
I,01,000131,764,2,5.00
I,01,000131,765,2,4.00
I,01,000131,765,2,5.00
I,01,000131,772,2,4.00
I,01,000131,772,2,6.00
I,01,000131,773,2,4.00
I,01,000168,762,2,2.00
I,01,000168,763,2,2.00
I,01,000622,761,6,14.64
I,01,000622,762,6,14.64
I,01,000622,763,6,14.64
I,01,000684,767,2,10.00

Is there way I can sort the file unique or remove duplicates using the 2nd ,3rd and 4th column as key fields.

Thanks
Sri

Last edited by Scrutinizer; 07-23-2013 at 01:15 AM.. Reason: code tags
Sponsored Links
    #2  
Old 07-23-2013
Scrutinizer's Avatar
Scrutinizer Scrutinizer is offline Forum Staff  
Moderator
 
Join Date: Nov 2008
Last Activity: 21 October 2014, 10:43 PM EDT
Location: Amsterdam
Posts: 9,536
Thanks: 284
Thanked 2,420 Times in 2,170 Posts
You need to specify a comma as the field separator: -t,
Sponsored Links
    #3  
Old 07-23-2013
Sri3001 Sri3001 is offline
Registered User
 
Join Date: Sep 2011
Last Activity: 13 August 2014, 1:19 AM EDT
Posts: 36
Thanks: 4
Thanked 0 Times in 0 Posts
Even after that I am getting the same output


Code:
-bash-3.2$ sort -t, -k 2,2 -k 3,3 -k 4,4 sample.txt |sort -u


Code:
I,01,000131,764,2,4.00
I,01,000131,764,2,5.00
I,01,000131,765,2,4.00
I,01,000131,765,2,5.00
I,01,000131,772,2,4.00
I,01,000131,772,2,6.00
I,01,000131,773,2,4.00
I,01,000168,762,2,2.00
I,01,000168,763,2,2.00
I,01,000622,761,6,14.64
I,01,000622,762,6,14.64
I,01,000622,763,6,14.64
I,01,000684,767,2,10.00


Last edited by Scrutinizer; 07-23-2013 at 02:24 AM..
    #4  
Old 07-23-2013
Scrutinizer's Avatar
Scrutinizer Scrutinizer is offline Forum Staff  
Moderator
 
Join Date: Nov 2008
Last Activity: 21 October 2014, 10:43 PM EDT
Location: Amsterdam
Posts: 9,536
Thanks: 284
Thanked 2,420 Times in 2,170 Posts
That is because the second sort is using the default field separator. Is this what you are looking for?

Code:
sort -u -t, -k 2,4 sample.txt

Otherwise, what is your expected output?
Sponsored Links
    #5  
Old 07-23-2013
Sri3001 Sri3001 is offline
Registered User
 
Join Date: Sep 2011
Last Activity: 13 August 2014, 1:19 AM EDT
Posts: 36
Thanks: 4
Thanked 0 Times in 0 Posts
Thanks,

Its working...But just out of curiosity, is there any other way of doing it , just comparing the 2nd ,3rd and 4th field as a key to find duplicates in a file.Would be great its there in Unix
Sponsored Links
    #6  
Old 07-23-2013
Don Cragun's Avatar
Don Cragun Don Cragun is offline Forum Staff  
Moderator
 
Join Date: Jul 2012
Last Activity: 22 October 2014, 1:38 AM EDT
Location: San Jose, CA, USA
Posts: 4,882
Thanks: 182
Thanked 1,641 Times in 1,392 Posts
There are lots of ways to do this. If you want the output sorted by the key you're using to determine duplication, sort -u is the most logical choice. If you want to give preference to the 1st entry found with a given key and have output order match input order, awk provides a simple way to do it:

Code:
awk -F, '!a[$2,$3,$4]++' input1 input2 > mergeNoDup
awk -F, '!a[$2,$3,$4]++' mergeWithDups > mergeNoDup

Use the 1st line if you have your separate files from server 1 and server 2; use the 2nd line if you have already created a merged file and want to remove duplicates from the merged file. Both of these will work with an unlimited number of input files as long as you don't reach ARG_MAX limitations on the number of input files you're feeding into awk.

As always, if you want to try this on a Solaris/SunOS system, use /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk rather than /bin/awk or /usr/bin/awk .

Other ways to do this include writing a C (or other high level language program), perl, using the associative arrays that are available in some shells, and an endless number of much less efficient combinations using read (to get a list of keys), grep (to get a count of lines containing a key), and an editor (to remove all but one occurrence of duplicated keys).
Sponsored Links
Closed Thread

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
formatting a file and removing duplicates kylle345 Shell Programming and Scripting 2 10-05-2011 03:17 PM
Removing Duplicates from file tinufarid Shell Programming and Scripting 3 09-06-2011 09:36 AM
Removing duplicates from log file? Ilja Shell Programming and Scripting 2 01-21-2009 10:02 AM
removing duplicates of a pattern from a file ashisharora UNIX for Dummies Questions & Answers 3 09-04-2008 06:25 AM
removing duplicates from a file trichyselva UNIX for Dummies Questions & Answers 2 03-25-2008 10:49 AM



All times are GMT -4. The time now is 04:01 AM.