Remove dupes in a large file


Login or Register to Reply

 
Thread Tools Search this Thread
# 1  
Old 10-13-2018
Remove dupes in a large file

I have a large file 1.5 gb and want to sort the file.
I used the following AWK script to do the job
Code:
!x[$0]++

The script works but it is very slow and takes over an hour to do the job. I suspect this is because the file is not sorted.
Any solution to speed up the AWk script or a Perl script would be appreciated.
I work under Windows Environment and hence Unix tools don't work
Many thanks
p.S. I have checked an earlier solution available in the repository but it is just as slow if not slower.
# 2  
Old 10-13-2018
Quote:
Originally Posted by gimley
I have a large file 1.5 gb and want to sort the file.
I used the following AWK script to do the job
Code:
!x[$0]++

The script works but it is very slow and takes over an hour to do the job. I suspect this is because the file is not sorted.
Hi,

I presume you mean you want to dedupe the file (because that is what your script does and that is in the title), not necessarily sort it.

You can try the difference between
Code:
awk '!X[$0]++' file > file.dedup

and
Code:
sort -u file > file.deduped_sort

The awk version is typically a lot faster because the file does not have to be sorted. Whether the file is sorted or not should make no difference for the awk command.
They both dedupe, but the second one sorts as well.


------- Edit ---------

I just did a test with a 1.6 GiB file and it took under 3 minutes to dedup it, so I would examine what you are doing exactly.
  • Are you deduping and then sorting?
  • Are you running out of memory and is your system paging/swapping?
Otherwise can you post the exact script/command that you are using?

Last edited by Scrutinizer; 10-13-2018 at 06:14 AM..
This User Gave Thanks to Scrutinizer For This Post:
gimley (10-13-2018)
# 3  
Old 10-13-2018
In case there is a RAM shortage, the following variant helps (saves some bytes per line).
Code:
awk '!($0 in X) { print; X[$0] }' file > file.dedup

This User Gave Thanks to MadeInGermany For This Post:
gimley (10-13-2018)
# 4  
Old 10-13-2018
Hi MadeInGermany,


mind to explain that approach? Is that because X[$0]++ becomes a number and consumes a "float" 's space, whereas X[$0] has just an index but points to nowhere?
# 5  
Old 10-14-2018
Exactly, X[$0]++ holds a number value; i.e. each new line consumes a number's space.
This User Gave Thanks to MadeInGermany For This Post:
gimley (10-14-2018)
Login or Register to Reply

|
Thread Tools Search this Thread
Search this Thread:
Advanced Search

More UNIX and Linux Forum Topics You Might Find Helpful
Modify script to remove dupes with two delimiters gimley Shell Programming and Scripting 6 01-24-2017 03:12 AM
Help with Perl script for identifying dupes in column1 gimley Shell Programming and Scripting 7 01-25-2015 06:18 AM
Removing dupes within 2 delimited areas in a large dictionary file gimley Shell Programming and Scripting 6 12-07-2012 08:50 AM
Merging dupes on different lines in a dictionary gimley Shell Programming and Scripting 2 09-09-2012 11:13 PM
deleting dupes in a row gimley Shell Programming and Scripting 2 09-09-2012 02:53 AM
remove large portion of web page code between two tags georgi58 Shell Programming and Scripting 5 04-29-2012 01:46 PM
Removing Dupes from huge file- awk/perl/uniq makn Shell Programming and Scripting 17 04-14-2012 04:34 PM
Script for identifying and deleting dupes in a line gimley Shell Programming and Scripting 3 03-16-2012 07:09 PM
Filtering F-Dupes furashgf UNIX for Dummies Questions & Answers 0 02-06-2012 01:05 PM
How to remove a subset of data from a large dataset based on values on one line davegen Shell Programming and Scripting 2 11-24-2011 07:12 AM
Using an awk script to identify dupes in two files gimley Shell Programming and Scripting 6 02-23-2011 11:57 AM
Remove Duplicate Filenames in 2 very large directories jaysunn Shell Programming and Scripting 7 10-20-2009 08:34 PM
remove a specific line in a LARGE file blubbiblubbkekz Shell Programming and Scripting 2 09-06-2009 08:35 AM
Sed or awk script to remove text / or perform calculations from large CSV files metronomadic Shell Programming and Scripting 6 06-17-2009 03:49 PM
remove a large number of user from oracle upengan78 Shell Programming and Scripting 4 08-22-2008 03:42 PM