In a huge file, Delete duplicate lines leaving unique lines


 
Thread Tools Search this Thread
Top Forums UNIX for Advanced & Expert Users In a huge file, Delete duplicate lines leaving unique lines
# 8  
Old 08-02-2011
Excuse me but "sort -m". This will require much less memory.

OOPS. Yes. With -m it's possible that duplicates can stay. But if the last sort wouldn't work (because of lack of memory), then it's possible "sort -m | uniq"
# 9  
Old 08-02-2011
Quote:
Originally Posted by yazu
Excuse me but "sort -m". This will require much less memory.
Consider the following:

Code:
% cat infile 
adsf123
asdlfkjlasdfj
adsf123
asdfasdf12341234
asdfasdf12341234
asdlfkjlasdfj343
asdlfkjlasdfj56
adsf123
asdlfkjlasdfj
adsf123
asdfasdf12341234
asdfasdf12341234
asdlfkjlasdfj343
asdlfkjlasdfj56
% split -l 5 infile 
% for f in x*; do sort -u "$f" > "$f"_sorted; done
% sort -m x*_sorted
adsf123
adsf123
asdfasdf12341234
asdfasdf12341234
asdlfkjlasdfj
asdlfkjlasdfj
asdlfkjlasdfj343
asdlfkjlasdfj343
asdlfkjlasdfj56
asdlfkjlasdfj56

Or you were suggesting something different?
This User Gave Thanks to radoulov For This Post:
# 10  
Old 08-02-2011
Yes. You are very right. I've corrected my previous post.
# 11  
Old 08-02-2011
I am trying the split and sort, I will let you know once it is done. Meanwhile, I have a doubt, why can't we implement something like below so that it will not take much space ..
for line in `cat infile`
do
#delete all lines in infile matching $line leaving 1 $line#
done
exit 0
# 12  
Old 08-02-2011
Because this is O(n^2) algorithm. And for 4GB file it will work really very long. (Days, months? Who knows... Smilie )
# 13  
Old 08-02-2011
sort: write failed: /tmp/sortmO2esr: No space left on device
during the last sort execution.
# 14  
Old 08-02-2011
'man sort' yields:
Code:
       -T Directory
            Places all temporary files that are created into the directory specified by the
            Directory parameter.

specify '-T' with the directory with 'enough' space for the temp files.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Delete duplicate like pattern lines

Hi I need to delete duplicate like pattern lines from a text file containing 2 duplicates only (one being subset of the other) using sed or awk preferably. Input: FM:Chicago:Development FM:Chicago:Development:Score SR:Cary:Testing:Testcases PM:Newyork:Scripting PM:Newyork:Scripting:Audit... (6 Replies)
Discussion started by: tech_frk
6 Replies

2. UNIX for Beginners Questions & Answers

How to delete identical lines while leaving one undeleted?

Hi, I have a file as follows. file1 Hello Hi His Hi Hi Hungry hi so I want to delete identical lines while leaving one of them undeleted. So desired output will be Hello Hi (2 Replies)
Discussion started by: beginner_99
2 Replies

3. Shell Programming and Scripting

Delete duplicate lines... with a twist!

Hi, I'm sorry I'm no coder so I came here, counting on your free time and good will to beg for spoonfeeding some good code. I'll try to be quick and concise! Got file with 50k lines like this: "Heh, heh. Those darn ninjas. They're _____."*wacky The "canebrake", "timber" & "pygmy" are types... (7 Replies)
Discussion started by: shadowww
7 Replies

4. Shell Programming and Scripting

Delete lines in file containing duplicate strings, keeping longer strings

The question is not as simple as the title... I have a file, it looks like this <string name="string1">RZ-LED</string> <string name="string2">2.0</string> <string name="string2">Version 2.0</string> <string name="string3">BP</string> I would like to check for duplicate entries of... (11 Replies)
Discussion started by: raidzero
11 Replies

5. UNIX for Dummies Questions & Answers

How to delete partial duplicate lines unix

hi :) I need to delete partial duplicate lines I have this in a file sihp8027,/opt/cf20,1980182 sihp8027,/opt/oracle/10gRelIIcd,155200016 sihp8027,/opt/oracle/10gRelIIcd,155200176 sihp8027,/var/opt/ERP,10376312 and need to leave it like this: sihp8027,/opt/cf20,1980182... (2 Replies)
Discussion started by: C|KiLLeR|S
2 Replies

6. UNIX for Dummies Questions & Answers

Delete lines with duplicate strings based on date

Hey all, a relative bash/script newbie trying solve a problem. I've got a text file with lots of lines that I've been able to clean up and format with awk/sed/cut, but now I'd like to remove the lines with duplicate usernames based on time stamp. Here's what the data looks like 2007-11-03... (3 Replies)
Discussion started by: mattv
3 Replies

7. UNIX for Dummies Questions & Answers

How to delete or remove duplicate lines in a file

Hi please help me how to remove duplicate lines in any file. I have a file having huge number of lines. i want to remove selected lines in it. And also if there exists duplicate lines, I want to delete the rest & just keep one of them. Please help me with any unix commands or even fortran... (7 Replies)
Discussion started by: reva
7 Replies

8. UNIX for Dummies Questions & Answers

Delete duplicate lines and print to file

OK, I have read several things on how to do this, but can't make it work. I am writing this to a vi file then calling it as an awk script. So I need to search a file for duplicate lines, delete duplicate lines, then write the result to another file, say /home/accountant/files/docs/nodup ... (2 Replies)
Discussion started by: bfurlong
2 Replies

9. Shell Programming and Scripting

delete semi-duplicate lines from file?

Ok here's what I'm trying to do. I need to get a listing of all the mountpoints on a system into a file, which is easy enough, just using something like "mount | awk '{print $1}'" However, on a couple of systems, they have some mount points looking like this: /stage /stand /usr /MFPIS... (2 Replies)
Discussion started by: paqman
2 Replies

10. Shell Programming and Scripting

Delete lines from huge file

I have to delete 1st 7000 lines of a file which is 12GB large. As it is so large, i can't open in vi and delete these lines. Also I found one post here which gave solution using perl, but I don't have perl installed. Also some solutions were redirecting the o/p to a different file and renaming it.... (3 Replies)
Discussion started by: rahulrathod
3 Replies
Login or Register to Ask a Question