Huge files manipulation

11-06-2008

Registered User

779, 112

Join Date: Feb 2006

Last Activity: 18 May 2018, 1:51 PM EDT

Location: Almer�a, Spain

Posts: 779

Thanks Given: 24

Thanked 112 Times in 106 Posts

Huge files manipulation

Hi , i need a fast way to delete duplicates entrys from very huge files ( >2 Gbs ) , these files are in plain text.

I tried all the usual methods ( awk / sort /uniq / sed /grep .. ) but it always ended with the same result (memory core dump)

In using HP-UX large servers.

Any advice will be very well come.

Thx in advance.

PD:I do not want to split the files.

Klashxx

View Public Profile for Klashxx

Find all posts by Klashxx

11-06-2008

Registered User

2,524, 241

Join Date: Dec 2007

Last Activity: 17 March 2020, 2:04 PM EDT

Posts: 2,524

Thanks Given: 173

Thanked 241 Times in 206 Posts

Could you break into lots of files, do your work, then recombine?

I am thinking here...
based on the first position character, copy all lines with "^[aA]" to file_a by using grep (for instance)
repeat for bB and cC and so on

then do your dup check (sort -u maybe) on each of 26 files

finally, recombine the 26 files

joeyg

View Public Profile for joeyg

Find all posts by joeyg

11-06-2008

Registered User

11,728, 1,345

Join Date: Feb 2004

Last Activity: 8 May 2020, 9:07 AM EDT

Location: NM

Posts: 11,728

Thanks Given: 903

Thanked 1,345 Times in 1,201 Posts

If your key first character is not highly redundant you can try this with awk.
The idea is predicated on your original code blowing the limits for a hash:

Code:

#example with a numeric value as the first char of key
# this uses a concatenated key

awk '{ key=substr($0,10,3) substr($0,35,10)
       ch=substr(key,1,1)
       if(ch="0") if(!arr0[key]++) {print $0; continue}
       if(ch="1") if(!arr1[key]++) {print $0; continue}
       if(ch="2") if(!arr2[key]++) {print $0; continue}
       if(ch="3") if(!arr3[key]++) {print $0; continue}     
       if(ch="4") if(!arr4[key]++) {print $0; continue}
       if(ch="5") if(!arr5[key]++) {print $0; continue}
       if(ch="6") if(!arr6[key]++) {print $0; continue}
       if(ch="7") if(!arr7[key]++) {print $0; continue}
       if(ch="8") if(!arr8[key]++) {print $0; continue}
       if(ch="9") if(!arr9[key]++) {print $0;}
      }'  inputfile  > outputfile

This worked for me with a > 2GB file on a V class 9000 11.0

jim mcnamara

View Public Profile for jim mcnamara

Find all posts by jim mcnamara

11-06-2008

Registered User

317, 7

Join Date: Oct 2007

Last Activity: 25 July 2018, 2:09 AM EDT

Location: Stockholm

Posts: 317

Thanks Given: 0

Thanked 7 Times in 7 Posts

Hi Klashxx, I wonder, will the duplicated lines always follow each other, or are they spread around in the file?

/Lakris

Lakris

View Public Profile for Lakris

Find all posts by Lakris

11-06-2008

Banned

67, 0

Join Date: Nov 2008

Last Activity: 2 July 2014, 11:24 AM EDT

Posts: 67

Thanks Given: 0

Thanked 0 Times in 0 Posts

Hmm cant you do:

Code:

cat file1 | awk '!L[$0]++' "$@" >> file2

chatwizrd

View Public Profile for chatwizrd

Find all posts by chatwizrd

11-07-2008

Registered User

779, 112

Join Date: Feb 2006

Last Activity: 18 May 2018, 1:51 PM EDT

Location: Almer�a, Spain

Posts: 779

Thanks Given: 24

Thanked 112 Times in 106 Posts

Many thanks for your ideas..

Quote:

Nice i one .. but i want to use the split tools as last option.

Quote:

If your key first character is not highly redundant you can try this with awk.
The idea is predicated on your original code blowing the limits for a hash

Definitely a good trick , but the file content is a little messy ...

Quote:

I wonder, will the duplicated lines always follow each other, or are they spread around in the file?

Unfortunately it is unsorted, and the duplicated keys are in a random order.

chatwizrd, Doesn't work , .. out of memory.

..To clarify, this is the structure of the file:

Code:

30xx|000009925000194653|00000000000000|20081031|02510|00000005445363|01|F|0207|00|||+0005655,00|||+0000000000000,00
30xx|000009925000194653|00000000000000|20081031|02510|00000005445363|01|F|0207|00|||+0000000000000,00|||+0000000000000,00
30xx|4150010003502043|CARDS|20081031|MP415001|00000024265698|01|F|1804|00|||+0000000000000,00|||+0000000000000,00

Having a key formed by the first 7 fields i want to print or delete only the duplicates.

I 'm very new to perl, but i read somewhere tha Tie::File module can handle very large files , i tried but cannot get the right code...
Any ideas?

Thank you in advance.

Regards

Klashxx

View Public Profile for Klashxx

Find all posts by Klashxx

11-10-2008

Registered User

2,288, 480

Join Date: Apr 2007

Last Activity: 3 May 2020, 8:28 AM EDT

Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris

Posts: 2,288

Thanks Given: 430

Thanked 480 Times in 395 Posts

Hi.

Interesting problem -- a few thoughts.

For timings, I used a 1 GB text file, about 15 M lines, with many duplicates (it is a large number of copies of the text of a novel.)

1) I didn't see any requirement that the file be kept in the original order, so one solution is to sort the file. On my system, sort processed the file using 7 keys in under a minute. An option to remove duplicates about halved the time (many duplicates did not need to get written out).

If the original ordering is needed, one could add a field containing the line number, which could then be used as an additional key, so the final output would be in the original order. You might be able to get by with a single sort, but if 2 sorts would be needed, they could be in a pipeline, so that the system would handle the connections, and no large intermediate file need be directly used.

2) The running out of memory in awk suggests that awk doesn't go beyond real memory, that your system does not use virtual memory, or that you have no swap space -- or similar reasons along those lines. I used perl to keep an in-memory hash of MD5 checksums of the lines. I did see some paging near the end -- the test system has 3 GB of real memory. I arranged for the file to have an additional field making every line unique, so that I had 15 M entries. I did no more processing except for checking the counts of the hashes -- the entire process took about 2.5 minutes of real time.

The advantage of using a checksum + line number is that if the hash does not fit into memory (for whatever reason), the derived data (checksum + line number) can be written out, and the resulting file can be sorted. The duplicate checksum lines will be be in order and the file can be processed to obtain the line numbers of the originals as well as subsequent duplicates. These line numbers can then be used with other utilities, say sed, to be displayed or to refine the original file.

3) You mentioned perl module Tie::File. For small files, this might be an useful choice, depending on what you wanted to do. Simply opening my test file took about 100 seconds. I tested reading the file and writing to /dev/null. The "normal" perl "<>" operator took about half a minute of wall-clock time. Using Tie::File took about 55 minutes -- 2 orders of magnitude slower -- reading straight through, with no other processing. I don't have a lot of experience with Tie::File, but from what I have seen so far, I would avoid it with applications like this where you probably need to look at every line in the file.

Good luck ... cheers, drl

drl

View Public Profile for drl

Find all posts by drl

UNIX for Advanced & Expert Users

Huge files manipulation

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Aggregation of Huge files

Discussion started by: Ravichander

2. UNIX for Dummies Questions & Answers

File comparison of huge files

Discussion started by: kaaliakahn

3. Shell Programming and Scripting

Compression - Exclude huge files

Discussion started by: DevendraG

4. Shell Programming and Scripting

Comparing 2 huge text files

Discussion started by: linuxgeek

5. Shell Programming and Scripting

Compare 2 folders to find several missing files among huge amounts of files.

Discussion started by: jiapei100

6. Shell Programming and Scripting

Splitting the Huge file into several files...

Discussion started by: lakteja

7. Shell Programming and Scripting

Split a huge data into few different files?!

Discussion started by: patrick87

8. High Performance Computing

Huge Files to be Joined on Ux instead of ORACLE

Discussion started by: magedfawzy

9. UNIX for Dummies Questions & Answers

Difference between two huge files

Discussion started by: pyaranoid

10. Shell Programming and Scripting

Comparing two huge files

Discussion started by: kmkbuddy_1983