Performance problem with removing duplicates in a huge file (50+ GB) | Unix Linux Forums | UNIX for Advanced & Expert Users

  Go Back    


UNIX for Advanced & Expert Users Expert-to-Expert. Learn advanced UNIX, UNIX commands, Linux, Operating Systems, System Administration, Programming, Shell, Shell Scripts, Solaris, Linux, HP-UX, AIX, OS X, BSD.

Performance problem with removing duplicates in a huge file (50+ GB)

UNIX for Advanced & Expert Users


Closed Thread    
 
Thread Tools Search this Thread Display Modes
    #1  
Old 01-04-2013
Kannan K Kannan K is offline
Registered User
 
Join Date: Jan 2013
Last Activity: 6 February 2013, 7:35 AM EST
Location: Chennai
Posts: 2
Thanks: 4
Thanked 0 Times in 0 Posts
Performance problem with removing duplicates in a huge file (50+ GB)

I'm trying to remove duplicate data from an input file with unsorted data which is of size >50GB and write the unique records to a new file.

I'm trying and already tried out a variety of options posted in similar threads/forums. But no luck so far..

Any suggestions please ?

Thanks !!

Last edited by Kannan K; 01-07-2013 at 05:20 AM..
Sponsored Links
    #2  
Old 01-04-2013
jim mcnamara jim mcnamara is online now Forum Staff  
...@...
 
Join Date: Feb 2004
Last Activity: 30 September 2014, 9:41 AM EDT
Location: NM
Posts: 10,219
Thanks: 278
Thanked 800 Times in 744 Posts
The problems are:

Using associative arrays in awk can use lots of memory for a file that large, so it may not work well. If the key(s) are a small size you might get it to work. And awk is really good for this application.

Please post a few records from the big file and carefully define the columns or keys you use to detect duplicates.
The Following User Says Thank You to jim mcnamara For This Useful Post:
Kannan K (01-07-2013)
Sponsored Links
    #3  
Old 01-04-2013
DGPickett DGPickett is offline Forum Advisor  
Registered User
 
Join Date: Oct 2010
Last Activity: 25 September 2014, 5:44 PM EDT
Location: Southern NJ, USA (Nord)
Posts: 4,422
Thanks: 8
Thanked 541 Times in 519 Posts
The sort -u has less scaling difficulty than associative arrays in a shell. If the order of lines is important, number them first, then sort out the duplicates, then sort it back to original order and then remove numbering. The sort will save the first of any duplicates.

JAVA and C++ PPIs like RogueWave H++ have the same associative array functionality, more accurately called a hash map container. The file could be mmap64() mapped to the program so the hash map just has hash keys and offsets. The clean file could be recreated or even updated in place and truncated with fcntl(). The shell has noticably more overhead. Big jobs often need sharper tools.

Last edited by DGPickett; 01-04-2013 at 03:53 PM..
The Following User Says Thank You to DGPickett For This Useful Post:
Kannan K (01-07-2013)
    #4  
Old 01-07-2013
Kannan K Kannan K is offline
Registered User
 
Join Date: Jan 2013
Last Activity: 6 February 2013, 7:35 AM EST
Location: Chennai
Posts: 2
Thanks: 4
Thanked 0 Times in 0 Posts
Sample Records

Sample records from file:


Code:
14480020180,A20180,A020180,143245765381,A00062,17284171796 
14480020180,A20180,A020180,143245765381,A00062,17284171796 
14480000127,A00127,A000127,143245730649,A00127, 
14480020180,A20180,A020180,143245765381,A00062,17284171796 
14480000127,A00127,A000127,143245730649,A00127, 
14480020180,A20180,A020180,143245765381,A00062,17284171796 
14480042302,A42302,A000127,143245800913,A00127, 
14480020180,A20180,A020180,143245765381,A00062,17284171796 
14480041999,A41999,A000127,143245801337,A00127, 
14480020180,A20180,A020180,143245765381,A00062,17284171796 
14480000163,A00163,A000163,143245730774,A00163,4133403 
14480042302,A42302,A000127,143245800913,A00127,

Desired Output:-

Code:
14480020180,A20180,A020180,143245765381,A00062,17284171796 
14480000127,A00127,A000127,143245730649,A00127, 
14480000163,A00163,A000163,143245730774,A00163,4133403 
14480041999,A41999,A000127,143245801337,A00127, 
14480042302,A42302,A000127,143245800913,A00127,

I also want to add the fact that this file contains 40-50% (20-25 GB) of duplicate records.
And unfortunately, all columns need to considered as part of the key to determine duplicates.

The order of the data (sorted/unsorted) in the resultant file doesn't matter. Only the removal of duplicates is essential.
Sponsored Links
    #5  
Old 01-07-2013
Corona688 Corona688 is offline Forum Staff  
Mead Rotor
 
Join Date: Aug 2005
Last Activity: 29 September 2014, 8:03 PM EDT
Location: Saskatchewan
Posts: 19,468
Thanks: 794
Thanked 3,282 Times in 3,077 Posts
As is, the data is going to be very difficult to manage.

My suggestion would be to try transforming the data into something more suitable for sort -u, then transforming it back after.
Sponsored Links
    #6  
Old 01-07-2013
achenle achenle is offline
Registered User
 
Join Date: Jun 2009
Last Activity: 29 September 2014, 10:06 PM EDT
Posts: 695
Thanks: 1
Thanked 96 Times in 92 Posts
Any DBAs around with some spare disk space?

Use a DB server. Create a single column table with a unique index on that column. Insert each line as a row into the table, ignoring duplicate entry failures. Export the data.

Might not be super fast, but it'll be faster than any script. And it's easy.
The Following 2 Users Say Thank You to achenle For This Useful Post:
Corona688 (01-07-2013), Kannan K (01-08-2013)
Sponsored Links
    #7  
Old 01-07-2013
Corona688 Corona688 is offline Forum Staff  
Mead Rotor
 
Join Date: Aug 2005
Last Activity: 29 September 2014, 8:03 PM EDT
Location: Saskatchewan
Posts: 19,468
Thanks: 794
Thanked 3,282 Times in 3,077 Posts
I don't think there is a super-fast way to handle 50G of data.

Taking advantage of a database index sounds as good a way as any, a proper DB is designed to duplicate-check data larger than memory.
Sponsored Links
Closed Thread

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
Performance issue with 'grep' command for huge file size arb_1984 HP-UX 7 12-05-2011 11:10 AM
Removing Duplicates from file tinufarid Shell Programming and Scripting 3 09-06-2011 09:36 AM
Removing duplicates from log file? Ilja Shell Programming and Scripting 2 01-21-2009 10:02 AM
removing duplicates of a pattern from a file ashisharora UNIX for Dummies Questions & Answers 3 09-04-2008 06:25 AM
removing duplicates from a file trichyselva UNIX for Dummies Questions & Answers 2 03-25-2008 10:49 AM



All times are GMT -4. The time now is 09:51 AM.