Home Man
Search
Today's Posts
Register

BSD, Linux, and UNIX shell scripting — Post awk, bash, csh, ksh, perl, php, python, sed, sh, shell scripts, and other shell scripting languages questions here.

Remove dupes in a large file

Tags
awk, duplicates, file, large, large file, perl, remove, shell scripts

Login to Reply

 
Thread Tools Search this Thread
# 1  
Old 1 Week Ago
Remove dupes in a large file

I have a large file 1.5 gb and want to sort the file.
I used the following AWK script to do the job
Code:
!x[$0]++

The script works but it is very slow and takes over an hour to do the job. I suspect this is because the file is not sorted.
Any solution to speed up the AWk script or a Perl script would be appreciated.
I work under Windows Environment and hence Unix tools don't work
Many thanks
p.S. I have checked an earlier solution available in the repository but it is just as slow if not slower.
# 2  
Old 1 Week Ago
Quote:
Originally Posted by gimley
I have a large file 1.5 gb and want to sort the file.
I used the following AWK script to do the job
Code:
!x[$0]++

The script works but it is very slow and takes over an hour to do the job. I suspect this is because the file is not sorted.
Hi,

I presume you mean you want to dedupe the file (because that is what your script does and that is in the title), not necessarily sort it.

You can try the difference between
Code:
awk '!X[$0]++' file > file.dedup

and
Code:
sort -u file > file.deduped_sort

The awk version is typically a lot faster because the file does not have to be sorted. Whether the file is sorted or not should make no difference for the awk command.
They both dedupe, but the second one sorts as well.


------- Edit ---------

I just did a test with a 1.6 GiB file and it took under 3 minutes to dedup it, so I would examine what you are doing exactly.
  • Are you deduping and then sorting?
  • Are you running out of memory and is your system paging/swapping?
Otherwise can you post the exact script/command that you are using?

Last edited by Scrutinizer; 1 Week Ago at 06:14 AM..
The Following User Says Thank You to Scrutinizer For This Useful Post:
gimley (1 Week Ago)
# 3  
Old 1 Week Ago
In case there is a RAM shortage, the following variant helps (saves some bytes per line).
Code:
awk '!($0 in X) { print; X[$0] }' file > file.dedup

The Following User Says Thank You to MadeInGermany For This Useful Post:
gimley (1 Week Ago)
# 4  
Old 1 Week Ago
Hi MadeInGermany,


mind to explain that approach? Is that because X[$0]++ becomes a number and consumes a "float" 's space, whereas X[$0] has just an index but points to nowhere?
# 5  
Old 1 Week Ago
Exactly, X[$0]++ holds a number value; i.e. each new line consumes a number's space.
The Following User Says Thank You to MadeInGermany For This Useful Post:
gimley (1 Week Ago)
Login to Reply

« Previous Thread | Next Thread »
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
Modify script to remove dupes with two delimiters gimley Shell Programming and Scripting 6 01-24-2017 03:12 AM
Removing dupes within 2 delimited areas in a large dictionary file gimley Shell Programming and Scripting 6 12-07-2012 08:50 AM
remove large portion of web page code between two tags georgi58 Shell Programming and Scripting 5 04-29-2012 01:46 PM
Removing Dupes from huge file- awk/perl/uniq makn Shell Programming and Scripting 17 04-14-2012 04:34 PM
Filtering F-Dupes furashgf UNIX for Dummies Questions & Answers 0 02-06-2012 01:05 PM
How to remove a subset of data from a large dataset based on values on one line davegen Shell Programming and Scripting 2 11-24-2011 07:12 AM
Remove Duplicate Filenames in 2 very large directories jaysunn Shell Programming and Scripting 7 10-20-2009 08:34 PM
remove a specific line in a LARGE file blubbiblubbkekz Shell Programming and Scripting 2 09-06-2009 08:35 AM
Sed or awk script to remove text / or perform calculations from large CSV files metronomadic Shell Programming and Scripting 6 06-17-2009 03:49 PM
remove a large number of user from oracle upengan78 Shell Programming and Scripting 4 08-22-2008 03:42 PM


All times are GMT -4. The time now is 07:52 AM.

Unix & Linux Forums Content Copyright©1993-2018. All Rights Reserved.
UNIX.COM Login
Username:
Password:  
Show Password