![]() |
|
|
|
|
|||||||
| Forums | Portal | Register | Forum Rules | FAQ | Contribute | Members List | Arcade | Search | Today's Posts | Mark Forums Read |
| Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts here. |
|
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Remove duplicates from File from specific location | gopikgunda | Shell Programming and Scripting | 1 | 04-08-2008 11:16 PM |
| Fastest way for searching the file | vaibhavbhat | UNIX for Advanced & Expert Users | 3 | 03-10-2008 07:57 AM |
| How to remove duplicates without sorting | orahi001 | UNIX for Dummies Questions & Answers | 4 | 01-17-2008 04:19 PM |
| how to delete/remove directory in fastest way | getdpg | Shell Programming and Scripting | 6 | 03-07-2006 07:42 AM |
| fastest copy command | vascobrito | UNIX for Dummies Questions & Answers | 0 | 07-20-2004 03:02 AM |
|
|
Submit Tools | LinkBack | Thread Tools | Search this Thread | Display Modes |
|
#1
|
|||
|
|||
|
fastest way to remove duplicates.
I have searched the FAQ - by using sort, duplicates, etc.... but I didn't get any articles or results on it.
Currently, I am using: sort -u file1 > file2 to remove duplicates. For a file size of 1giga byte approx. time taken to remove duplicates is 1hr 21 mins. Is there any other faster way to remove duplicates? Our file sizes could get to 10 to 12 giga bytes size. Aprpeciate any pointers. Thanks, Radhika. |
| Forum Sponsor | ||
|
|
|
#2
|
||||
|
||||
|
Just a thought.
Why not use the divide and conquer approach ? Vino Last edited by vino; 06-24-2005 at 12:46 AM. |
|
#3
|
|||
|
|||
|
That's about 200KB/s. Pretty crap.
I presume you're thrashing swap? One thing to check is if you don't need multibyte sorting, then prepend the sort command with LANG=C Sounds like you need a database (indexes) to be honest. If the output is a small % of the input, then explicitly partitioning the input would be beneficial. I.E.: while sort -u chunk | sort -u |
|
#4
|
|||
|
|||
|
Try out this one...
sed '$!N; /^\(.*\)\n\1$/!P; D' # The first line of duplicate ones is only kept and rest are deleted. I have tested this with around 1GB file. it took about 13 min to sort that file. Much Much Faster than sort command. Last edited by amit_sapre; 06-24-2005 at 06:53 AM. |
|
#5
|
||||
|
||||
|
Quote:
and/or If the file is unsorted, then duplicate entries based on first line are removed. since sed makes just one-pass through the file. Or did I get it wrong ? vino |
|
#6
|
|||
|
|||
|
Hi Vino,
This command will keep the first entry as it is and delete the other entries, irrespective of whether the file is sorted or not. No prior assumptions while executing this command. |
|
#7
|
|||
|
|||
|
Hi Amit,
>> sed '$!N; /^\(.*\)\n\1$/!P; D' Could you explain the command - bit by bit if you don't mind. Thanks! |
|||
| Google The UNIX and Linux Forums |
| Thread Tools | Search this Thread |
| Display Modes | |
|
|