|
|||||||
| Forums | Search Forums | Register | Forum Rules | Man Pages | Albums | FAQ | Members | Calendar | Search | Today's Posts | Mark Forums Read |
| UNIX for Advanced & Expert Users Expert-to-Expert. Learn advanced UNIX, UNIX commands, Linux, Operating Systems, System Administration, Programming, Shell, Shell Scripts, Solaris, Linux, HP-UX, AIX, OS X, BSD. |
|
|
|
Thread Tools | Search this Thread | Display Modes |
|
#1
|
|||
|
|||
|
Performance problem with removing duplicates in a huge file (50+ GB)
I'm trying to remove duplicate data from an input file with unsorted data which is of size >50GB and write the unique records to a new file.
I'm trying and already tried out a variety of options posted in similar threads/forums. But no luck so far.. Any suggestions please ? Thanks !! Last edited by Kannan K; 01-07-2013 at 05:20 AM.. |
| Sponsored Links | ||
|
|
#2
|
|||
|
|||
|
The problems are:
Using associative arrays in awk can use lots of memory for a file that large, so it may not work well. If the key(s) are a small size you might get it to work. And awk is really good for this application. Please post a few records from the big file and carefully define the columns or keys you use to detect duplicates. |
| The Following User Says Thank You to jim mcnamara For This Useful Post: | ||
Kannan K (01-07-2013) | ||
| Sponsored Links | ||
|
|
#3
|
|||
|
|||
|
The sort -u has less scaling difficulty than associative arrays in a shell. If the order of lines is important, number them first, then sort out the duplicates, then sort it back to original order and then remove numbering. The sort will save the first of any duplicates.
JAVA and C++ PPIs like RogueWave H++ have the same associative array functionality, more accurately called a hash map container. The file could be mmap64() mapped to the program so the hash map just has hash keys and offsets. The clean file could be recreated or even updated in place and truncated with fcntl(). The shell has noticably more overhead. Big jobs often need sharper tools. Last edited by DGPickett; 01-04-2013 at 03:53 PM.. |
| The Following User Says Thank You to DGPickett For This Useful Post: | ||
Kannan K (01-07-2013) | ||
|
#4
|
|||
|
|||
|
Sample Records
Sample records from file: Code:
14480020180,A20180,A020180,143245765381,A00062,17284171796 14480020180,A20180,A020180,143245765381,A00062,17284171796 14480000127,A00127,A000127,143245730649,A00127, 14480020180,A20180,A020180,143245765381,A00062,17284171796 14480000127,A00127,A000127,143245730649,A00127, 14480020180,A20180,A020180,143245765381,A00062,17284171796 14480042302,A42302,A000127,143245800913,A00127, 14480020180,A20180,A020180,143245765381,A00062,17284171796 14480041999,A41999,A000127,143245801337,A00127, 14480020180,A20180,A020180,143245765381,A00062,17284171796 14480000163,A00163,A000163,143245730774,A00163,4133403 14480042302,A42302,A000127,143245800913,A00127, Desired Output:- Code:
14480020180,A20180,A020180,143245765381,A00062,17284171796 14480000127,A00127,A000127,143245730649,A00127, 14480000163,A00163,A000163,143245730774,A00163,4133403 14480041999,A41999,A000127,143245801337,A00127, 14480042302,A42302,A000127,143245800913,A00127, I also want to add the fact that this file contains 40-50% (20-25 GB) of duplicate records. And unfortunately, all columns need to considered as part of the key to determine duplicates. The order of the data (sorted/unsorted) in the resultant file doesn't matter. Only the removal of duplicates is essential. |
| Sponsored Links | |
|
|
#5
|
|||
|
|||
|
As is, the data is going to be very difficult to manage.
My suggestion would be to try transforming the data into something more suitable for sort -u, then transforming it back after. |
| Sponsored Links | |
|
|
#6
|
|||
|
|||
|
Any DBAs around with some spare disk space?
Use a DB server. Create a single column table with a unique index on that column. Insert each line as a row into the table, ignoring duplicate entry failures. Export the data. Might not be super fast, but it'll be faster than any script. And it's easy. |
| Sponsored Links | |
|
|
#7
|
|||
|
|||
|
I don't think there is a super-fast way to handle 50G of data.
Taking advantage of a database index sounds as good a way as any, a proper DB is designed to duplicate-check data larger than memory. |
| Sponsored Links | ||
|
![]() |
| Thread Tools | Search this Thread |
| Display Modes | |
More UNIX and Linux Forum Topics You Might Find Helpful
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Performance issue with 'grep' command for huge file size | arb_1984 | HP-UX | 7 | 12-05-2011 11:10 AM |
| Removing Duplicates from file | tinufarid | Shell Programming and Scripting | 3 | 09-06-2011 09:36 AM |
| Removing duplicates from log file? | Ilja | Shell Programming and Scripting | 2 | 01-21-2009 10:02 AM |
| removing duplicates of a pattern from a file | ashisharora | UNIX for Dummies Questions & Answers | 3 | 09-04-2008 06:25 AM |
| removing duplicates from a file | trichyselva | UNIX for Dummies Questions & Answers | 2 | 03-25-2008 10:49 AM |
|
|