![]() |
|
|
|
|
|||||||
| Forums | Portal | Register | Forum Rules | FAQ | Contribute | Members List | Arcade | Search | Today's Posts | Mark Forums Read |
| Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts here. |
|
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Remove duplicates from File from specific location | gopikgunda | Shell Programming and Scripting | 1 | 04-08-2008 11:16 PM |
| Fastest way for searching the file | vaibhavbhat | UNIX for Advanced & Expert Users | 3 | 03-10-2008 06:57 AM |
| How to remove duplicates without sorting | orahi001 | UNIX for Dummies Questions & Answers | 4 | 01-17-2008 03:19 PM |
| how to delete/remove directory in fastest way | getdpg | Shell Programming and Scripting | 6 | 03-07-2006 06:42 AM |
| fastest copy command | vascobrito | UNIX for Dummies Questions & Answers | 0 | 07-20-2004 03:02 AM |
|
|
Submit Tools | LinkBack | Thread Tools | Display Modes |
|
|||
|
fastest way to remove duplicates.
I have searched the FAQ - by using sort, duplicates, etc.... but I didn't get any articles or results on it.
Currently, I am using: sort -u file1 > file2 to remove duplicates. For a file size of 1giga byte approx. time taken to remove duplicates is 1hr 21 mins. Is there any other faster way to remove duplicates? Our file sizes could get to 10 to 12 giga bytes size. Aprpeciate any pointers. Thanks, Radhika. |
| Forum Sponsor | ||
|
|
|
|||
|
That's about 200KB/s. Pretty crap.
I presume you're thrashing swap? One thing to check is if you don't need multibyte sorting, then prepend the sort command with LANG=C Sounds like you need a database (indexes) to be honest. If the output is a small % of the input, then explicitly partitioning the input would be beneficial. I.E.: while sort -u chunk | sort -u |
|
|||
|
Try out this one...
sed '$!N; /^\(.*\)\n\1$/!P; D' # The first line of duplicate ones is only kept and rest are deleted. I have tested this with around 1GB file. it took about 13 min to sort that file. Much Much Faster than sort command. Last edited by amit_sapre; 06-24-2005 at 06:53 AM. |
| Thread Tools | |
| Display Modes | |
|
|