01-04-2013
Performance problem with removing duplicates in a huge file (50+ GB)
I'm trying to remove duplicate data from an input file with unsorted data which is of size >50GB and write the unique records to a new file.
I'm trying and already tried out a variety of options posted in similar threads/forums. But no luck so far..
Any suggestions please ?
Thanks !!
Last edited by Kannan K; 01-07-2013 at 06:20 AM..
10 More Discussions You Might Find Interesting
1. UNIX for Dummies Questions & Answers
i have a file with some 1000 entries it will contain entries like
1000,ram
2000,pankaj
1001,rahim
1000,ram
2532,govind
2000,pankaj
3000,venkat
2532,govind
what i want is i want to extract only the distinct rows from this file
so my output should contain only
1000,ram... (2 Replies)
Discussion started by: trichyselva
2 Replies
2. UNIX for Dummies Questions & Answers
hey all,
I need some help.
I have a text file with names in it.
My target is that if a particular pattern exists in that file more than once..then i want to rename all the occurences of that pattern by alternate patterns..
for e.g if i have PATTERN occuring 5 times then i want to... (3 Replies)
Discussion started by: ashisharora
3 Replies
3. Shell Programming and Scripting
I have a log file with posts looking like this:
--
Messages can be delivered by different systems at different times. The id number is used to sort out duplicate messages. What I need is to strip the arrival time from each post, sort posts by id number, and reattach arrival time to respective... (2 Replies)
Discussion started by: Ilja
2 Replies
4. Shell Programming and Scripting
Hi Experts,
Please check the following new requirement. I got data like the following in a file.
FILE_HEADER
01cbbfde7898410| 3477945| home| 1
01cbc275d2c122| 3478234| WORK| 1
01cbbe4362743da| 3496386| Rich Spare| 1
01cbc275d2c122| 3478234| WORK| 1
This is pipe separated file with... (3 Replies)
Discussion started by: tinufarid
3 Replies
5. Shell Programming and Scripting
Hi,
I have a file that I want to change the format of. It is a large file in rows but I want it to be comma separated (comma then a space).
The current file looks like this:
HI, Joe, Bob, Jack, Jack
After I would want to remove any duplicates so it would look like this:
HI, Joe,... (2 Replies)
Discussion started by: kylle345
2 Replies
6. HP-UX
I have 2 files; one file (say, details.txt) contains the details of employees and another file (say, emp.txt) has some selected employee names. I am extracting employee details from details.txt by using emp.txt and the corresponding code is:
while read line
do
emp_name=`echo $line`
grep -e... (7 Replies)
Discussion started by: arb_1984
7 Replies
7. UNIX for Dummies Questions & Answers
Hi All,
I am merging files coming from 2 different systems ,while doing that I am getting duplicates entries in the merged file
I,01,000131,764,2,4.00
I,01,000131,765,2,4.00
I,01,000131,772,2,4.00
I,01,000131,773,2,4.00
I,01,000168,762,2,2.00
I,01,000168,763,2,2.00... (5 Replies)
Discussion started by: Sri3001
5 Replies
8. Shell Programming and Scripting
i hav two files like
i want to remove/delete all the duplicate lines in file2 which are viz unix,unix2,unix3 (2 Replies)
Discussion started by: sagar_1986
2 Replies
9. Shell Programming and Scripting
i hav two files like
i want to remove/delete all the duplicate lines in file2 which are viz unix,unix2,unix3.I have tried previous post also,but in that complete line must be similar.In this case i have to verify first column only regardless what is the content in succeeding columns. (3 Replies)
Discussion started by: sagar_1986
3 Replies
10. Shell Programming and Scripting
I am trying to remove whitespaces from a file containing sample data as:
457 <EOFD> Mar 1 2007 12:00:00:000AM <EOFD> Mar 31 2007 12:00:00:000AM <EOFD> system <EORD> 458 <EOFD> Mar 1 2007 12:00:00:000AM<EOFD>agf <EOFD> Apr 20 2007 9:10:56:036PM <EOFD> prodiws<EORD> . Basically these... (11 Replies)
Discussion started by: amvip
11 Replies
IGAWK(1) Utility Commands IGAWK(1)
NAME
igawk - gawk with include files
SYNOPSIS
igawk [ all gawk options ] -f program-file [ -- ] file ...
igawk [ all gawk options ] [ -- ] program-text file ...
DESCRIPTION
Igawk is a simple shell script that adds the ability to have ``include files'' to gawk(1).
AWK programs for igawk are the same as for gawk, except that, in addition, you may have lines like
@include getopt.awk
in your program to include the file getopt.awk from either the current directory or one of the other directories in the search path.
OPTIONS
See gawk(1) for a full description of the AWK language and the options that gawk supports.
EXAMPLES
cat << EOF > test.awk
@include getopt.awk
BEGIN {
while (getopt(ARGC, ARGV, "am:q") != -1)
...
}
EOF
igawk -f test.awk
SEE ALSO
gawk(1)
Effective AWK Programming, Edition 1.0, published by the Free Software Foundation, 1995.
AUTHOR
Arnold Robbins (arnold@skeeve.com).
Free Software Foundation Nov 3 1999 IGAWK(1)