The UNIX and Linux Forums  


Go Back   The UNIX and Linux Forums > Top Forums > Shell Programming and Scripting
.
google unix.com



Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here.

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
Remove duplicates from File from specific location gopikgunda Shell Programming and Scripting 1 04-09-2008 03:16 AM
Fastest way for searching the file vaibhavbhat UNIX for Advanced & Expert Users 3 03-10-2008 10:57 AM
How to remove duplicates without sorting orahi001 UNIX for Dummies Questions & Answers 4 01-17-2008 07:19 PM
how to delete/remove directory in fastest way getdpg Shell Programming and Scripting 6 03-07-2006 10:42 AM
fastest copy command vascobrito UNIX for Dummies Questions & Answers 0 07-20-2004 07:02 AM

Closed Thread
English Japanese Spanish French German Portuguese Italian Dutch Swedish Russian Norwegian Hungarian Hebrew Danish Bulgarian Greek Powered by Powered by Google
 
LinkBack Thread Tools Search this Thread Rate Thread Display Modes
  #1 (permalink)  
Old 06-23-2005
radhika radhika is offline
Registered User
  
 

Join Date: Apr 2005
Posts: 51
fastest way to remove duplicates.

I have searched the FAQ - by using sort, duplicates, etc.... but I didn't get any articles or results on it.

Currently, I am using:
sort -u file1 > file2 to remove duplicates. For a file size of 1giga byte approx. time taken to remove duplicates is 1hr 21 mins.

Is there any other faster way to remove duplicates? Our file sizes could get to 10 to 12 giga bytes size.

Aprpeciate any pointers.
Thanks,
Radhika.
  #2 (permalink)  
Old 06-24-2005
vino's Avatar
vino vino is offline Forum Staff  
Supporter (in vino veritas)
  
 

Join Date: Feb 2005
Location: Bangalore, India
Posts: 2,798
Just a thought.

Why not use the divide and conquer approach ?

Vino

Last edited by vino; 06-24-2005 at 04:46 AM..
  #3 (permalink)  
Old 06-24-2005
pixelbeat pixelbeat is offline
Registered User
  
 

Join Date: Jun 2005
Location: Ireland
Posts: 61
That's about 200KB/s. Pretty crap.
I presume you're thrashing swap?

One thing to check is if you don't need multibyte sorting,
then prepend the sort command with LANG=C

Sounds like you need a database (indexes) to be honest.

If the output is a small % of the input, then
explicitly partitioning the input would be beneficial.
I.E.: while sort -u chunk | sort -u
  #4 (permalink)  
Old 06-24-2005
amit_sapre amit_sapre is offline
Registered User
  
 

Join Date: Jun 2005
Location: Bangalore , INDIA
Posts: 28
Cool

Try out this one...

sed '$!N; /^\(.*\)\n\1$/!P; D'

# The first line of duplicate ones is only kept and rest are deleted.

I have tested this with around 1GB file.

it took about 13 min to sort that file. Much Much Faster than sort command.


Last edited by amit_sapre; 06-24-2005 at 10:53 AM..
  #5 (permalink)  
Old 06-24-2005
vino's Avatar
vino vino is offline Forum Staff  
Supporter (in vino veritas)
  
 

Join Date: Feb 2005
Location: Bangalore, India
Posts: 2,798
Quote:
Originally Posted by amit_sapre
Try out this one...

sed '$!N; /^\(.*\)\n\1$/!P; D'

# The first line of duplicate ones is only kept and rest are deleted.

Hope this will work faster than sort command.

I haven't tried on large files.
Havn't tried your sed. But doesnt it assume that all the entries are already sorted and then it removes the duplicates.

and/or

If the file is unsorted, then duplicate entries based on first line are removed. since sed makes just one-pass through the file.

Or did I get it wrong ?

vino
  #6 (permalink)  
Old 06-24-2005
amit_sapre amit_sapre is offline
Registered User
  
 

Join Date: Jun 2005
Location: Bangalore , INDIA
Posts: 28
Hi Vino,

This command will keep the first entry as it is and delete the other entries,

irrespective of whether the file is sorted or not.

No prior assumptions while executing this command.
  #7 (permalink)  
Old 06-24-2005
radhika radhika is offline
Registered User
  
 

Join Date: Apr 2005
Posts: 51
Hi Amit,


>>
sed '$!N; /^\(.*\)\n\1$/!P; D'

Could you explain the command - bit by bit if you don't mind.

Thanks!
Closed Thread

Bookmarks

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On




All times are GMT -4. The time now is 06:21 PM.


Powered by: vBulletin, Copyright ©2000 - 2006, Jelsoft Enterprises Limited. Language Translations Powered by .
vBCredits v1.4 Copyright ©2007 - 2008, PixelFX Studios
The UNIX and Linux Forums Content Copyright ©1993-2009. All Rights Reserved.Ad Management by RedTyger

Content Relevant URLs by vBSEO 3.2.0