Unix/Linux Go Back    


UNIX for Advanced & Expert Users Expert-to-Expert. Learn advanced UNIX, UNIX commands, Linux, Operating Systems, System Administration, Programming, Shell, Shell Scripts, Solaris, Linux, HP-UX, AIX, OS X, BSD.

In a huge file, Delete duplicate lines leaving unique lines

UNIX for Advanced & Expert Users


Closed    
 
Thread Tools Search this Thread Display Modes
    #1  
Old Unix and Linux 08-02-2011
krishnix krishnix is offline
Registered User
 
Join Date: Aug 2011
Last Activity: 21 December 2011, 11:43 AM EST
Posts: 10
Thanks: 0
Thanked 0 Times in 0 Posts
In a huge file, Delete duplicate lines leaving unique lines

Hi All,

I have a very huge file (4GB) which has duplicate lines. I want to delete duplicate lines leaving unique lines. Sort, uniq, awk '!x[$0]++' are not working as its running out of buffer space.

I dont know if this works : I want to read each line of the File in a For Loop, and want to delete all the matching lines leaving 1 line. This way I think it will not use any buffer space.
PS: Idea is not use any second file.
Suggestions please.

input data:

Code:
adsf123
asdlfkjlasdfj
adsf123
asdfasdf12341234
asdfasdf12341234
asdlfkjlasdfj343
asdlfkjlasdfj56

output data:

Code:
adsf123
asdlfkjlasdfj
asdfasdf12341234
asdlfkjlasdfj343
asdlfkjlasdfj56

Thanks,
Krish

Last edited by radoulov; 08-02-2011 at 10:07 AM.. Reason: Code tags!
Sponsored Links
    #2  
Old Unix and Linux 08-02-2011
radoulov's Unix or Linux Image
radoulov radoulov is offline Forum Advisor  
Forum Adviser
 
Join Date: Jan 2007
Last Activity: 9 January 2017, 4:40 AM EST
Location: Варна, България / Milano, Italia
Posts: 5,690
Thanks: 184
Thanked 629 Times in 586 Posts
What's the exact error message returned by the awk command?


Code:
awk '!x[$0]++' infile

Sponsored Links
    #3  
Old Unix and Linux 08-02-2011
krishnix krishnix is offline
Registered User
 
Join Date: Aug 2011
Last Activity: 21 December 2011, 11:43 AM EST
Posts: 10
Thanks: 0
Thanked 0 Times in 0 Posts

Code:
awk: cmd. line:1: (FILENAME=result FNR=6094197) fatal: assoc_lookup: bucket->ahname_str: can't allocate 423 bytes of memory (Cannot allocate memory)

---------- Post updated at 09:21 AM ---------- Previous update was at 09:19 AM ----------

command i tried:

Code:
awk '!x[$0]++' result > result_new


Last edited by radoulov; 08-02-2011 at 10:22 AM.. Reason: Code tags.
    #4  
Old Unix and Linux 08-02-2011
radoulov's Unix or Linux Image
radoulov radoulov is offline Forum Advisor  
Forum Adviser
 
Join Date: Jan 2007
Last Activity: 9 January 2017, 4:40 AM EST
Location: Варна, България / Milano, Italia
Posts: 5,690
Thanks: 184
Thanked 629 Times in 586 Posts
Try with Perl:


Code:
perl -ne'
  print unless $_{$_}++ 
  ' infile

Sponsored Links
    #5  
Old Unix and Linux 08-02-2011
yazu yazu is offline
Registered User
 
Join Date: Jun 2011
Last Activity: 2 August 2017, 9:27 AM EDT
Location: From far
Posts: 1,000
Thanks: 21
Thanked 237 Times in 231 Posts
You can split the file (with "split" command), then "sort -u" the chunks separately and then merge them with "sort -m". (Of course whether you need it depends on the memory size of your system).
The Following User Says Thank You to yazu For This Useful Post:
radoulov (08-02-2011)
Sponsored Links
    #6  
Old Unix and Linux 08-02-2011
krishnix krishnix is offline
Registered User
 
Join Date: Aug 2011
Last Activity: 21 December 2011, 11:43 AM EST
Posts: 10
Thanks: 0
Thanked 0 Times in 0 Posts
Sorry not the mention .. "thanks for your prompt replies".

Now I am running ..

Code:
perl -ne'print unless $_{$_}++' result > result_new2

I am getting the error messge - Segmentation fault

Last edited by radoulov; 08-02-2011 at 10:35 AM.. Reason: Code tags.
Sponsored Links
    #7  
Old Unix and Linux 08-02-2011
radoulov's Unix or Linux Image
radoulov radoulov is offline Forum Advisor  
Forum Adviser
 
Join Date: Jan 2007
Last Activity: 9 January 2017, 4:40 AM EST
Location: Варна, България / Milano, Italia
Posts: 5,690
Thanks: 184
Thanked 629 Times in 586 Posts
Try the solution suggested by yazu:


Code:
split -l 1000000 infile

for f in x*; do
  sort -u "$f" > "$f"_sorted
done

sort -u x*_sorted > final.out

I believe the final sort should be with -u (not -m).
Sponsored Links
Closed

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Linux More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
How to delete or remove duplicate lines in a file reva UNIX for Dummies Questions & Answers 7 07-20-2009 08:45 AM
Delete duplicate lines and print to file bfurlong UNIX for Dummies Questions & Answers 2 12-13-2008 10:44 PM
delete semi-duplicate lines from file? paqman Shell Programming and Scripting 2 02-11-2008 04:46 PM
Delete lines from huge file rahulrathod Shell Programming and Scripting 3 11-13-2006 11:40 AM



All times are GMT -4. The time now is 11:49 PM.