The UNIX and Linux Forums  

Go Back   The UNIX and Linux Forums > Top Forums > Shell Programming and Scripting
Google UNIX.COM


Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts here.

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
Remove duplicates from File from specific location gopikgunda Shell Programming and Scripting 1 04-08-2008 11:16 PM
Fastest way for searching the file vaibhavbhat UNIX for Advanced & Expert Users 3 03-10-2008 07:57 AM
How to remove duplicates without sorting orahi001 UNIX for Dummies Questions & Answers 4 01-17-2008 04:19 PM
how to delete/remove directory in fastest way getdpg Shell Programming and Scripting 6 03-07-2006 07:42 AM
fastest copy command vascobrito UNIX for Dummies Questions & Answers 0 07-20-2004 03:02 AM

Reply
 
Submit Tools LinkBack Thread Tools Search this Thread Display Modes
  #1  
Old 06-23-2005
Registered User
 

Join Date: Apr 2005
Posts: 51
fastest way to remove duplicates.

I have searched the FAQ - by using sort, duplicates, etc.... but I didn't get any articles or results on it.

Currently, I am using:
sort -u file1 > file2 to remove duplicates. For a file size of 1giga byte approx. time taken to remove duplicates is 1hr 21 mins.

Is there any other faster way to remove duplicates? Our file sizes could get to 10 to 12 giga bytes size.

Aprpeciate any pointers.
Thanks,
Radhika.
Reply With Quote
Forum Sponsor
  #2  
Old 06-23-2005
vino's Avatar
Supporter (in vino veritas)
 

Join Date: Feb 2005
Location: Bangalore, India
Posts: 2,698
Just a thought.

Why not use the divide and conquer approach ?

Vino

Last edited by vino; 06-24-2005 at 12:46 AM.
Reply With Quote
  #3  
Old 06-24-2005
Registered User
 

Join Date: Jun 2005
Location: Ireland
Posts: 61
That's about 200KB/s. Pretty crap.
I presume you're thrashing swap?

One thing to check is if you don't need multibyte sorting,
then prepend the sort command with LANG=C

Sounds like you need a database (indexes) to be honest.

If the output is a small % of the input, then
explicitly partitioning the input would be beneficial.
I.E.: while sort -u chunk | sort -u
Reply With Quote
  #4  
Old 06-24-2005
Registered User
 

Join Date: Jun 2005
Location: Bangalore , INDIA
Posts: 28
Cool

Try out this one...

sed '$!N; /^\(.*\)\n\1$/!P; D'

# The first line of duplicate ones is only kept and rest are deleted.

I have tested this with around 1GB file.

it took about 13 min to sort that file. Much Much Faster than sort command.


Last edited by amit_sapre; 06-24-2005 at 06:53 AM.
Reply With Quote
  #5  
Old 06-24-2005
vino's Avatar
Supporter (in vino veritas)
 

Join Date: Feb 2005
Location: Bangalore, India
Posts: 2,698
Quote:
Originally Posted by amit_sapre
Try out this one...

sed '$!N; /^\(.*\)\n\1$/!P; D'

# The first line of duplicate ones is only kept and rest are deleted.

Hope this will work faster than sort command.

I haven't tried on large files.
Havn't tried your sed. But doesnt it assume that all the entries are already sorted and then it removes the duplicates.

and/or

If the file is unsorted, then duplicate entries based on first line are removed. since sed makes just one-pass through the file.

Or did I get it wrong ?

vino
Reply With Quote
  #6  
Old 06-24-2005
Registered User
 

Join Date: Jun 2005
Location: Bangalore , INDIA
Posts: 28
Hi Vino,

This command will keep the first entry as it is and delete the other entries,

irrespective of whether the file is sorted or not.

No prior assumptions while executing this command.
Reply With Quote
  #7  
Old 06-24-2005
Registered User
 

Join Date: Apr 2005
Posts: 51
Hi Amit,


>>
sed '$!N; /^\(.*\)\n\1$/!P; D'

Could you explain the command - bit by bit if you don't mind.

Thanks!
Reply With Quote
Google The UNIX and Linux Forums
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes




All times are GMT -7. The time now is 09:32 PM.


Powered by: vBulletin, Copyright ©2000 - 2006, Jelsoft Enterprises Limited.
The UNIX and Linux Forums Content Copyright ©1993-2008. All Rights Reserved.Ad Management by RedTyger Visit The Complex Event Processing Blog

Content Relevant URLs by vBSEO 3.2.0