![]() |
|
|
google unix.com
|
|||||||
| Forums | Register | Forum Rules | Links | Albums | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here. |
More UNIX and Linux Forum Topics You Might Find Helpful
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Remove duplicates from File from specific location | gopikgunda | Shell Programming and Scripting | 1 | 04-09-2008 03:16 AM |
| Fastest way for searching the file | vaibhavbhat | UNIX for Advanced & Expert Users | 3 | 03-10-2008 10:57 AM |
| How to remove duplicates without sorting | orahi001 | UNIX for Dummies Questions & Answers | 4 | 01-17-2008 07:19 PM |
| how to delete/remove directory in fastest way | getdpg | Shell Programming and Scripting | 6 | 03-07-2006 10:42 AM |
| fastest copy command | vascobrito | UNIX for Dummies Questions & Answers | 0 | 07-20-2004 07:02 AM |
![]() |
|
|
LinkBack | Thread Tools | Search this Thread | Rate Thread | Display Modes |
|
|
|
||||
|
fastest way to remove duplicates.
I have searched the FAQ - by using sort, duplicates, etc.... but I didn't get any articles or results on it.
Currently, I am using: sort -u file1 > file2 to remove duplicates. For a file size of 1giga byte approx. time taken to remove duplicates is 1hr 21 mins. Is there any other faster way to remove duplicates? Our file sizes could get to 10 to 12 giga bytes size. Aprpeciate any pointers. Thanks, Radhika. |
|
||||
|
That's about 200KB/s. Pretty crap.
I presume you're thrashing swap? One thing to check is if you don't need multibyte sorting, then prepend the sort command with LANG=C Sounds like you need a database (indexes) to be honest. If the output is a small % of the input, then explicitly partitioning the input would be beneficial. I.E.: while sort -u chunk | sort -u |
|
||||
|
Try out this one...
sed '$!N; /^\(.*\)\n\1$/!P; D' # The first line of duplicate ones is only kept and rest are deleted. I have tested this with around 1GB file. it took about 13 min to sort that file. Much Much Faster than sort command. ![]() Last edited by amit_sapre; 06-24-2005 at 10:53 AM.. |
![]() |
| Bookmarks |
| Thread Tools | Search this Thread |
| Display Modes | Rate This Thread |
|
|