Huge files manipulation Post: 302256768

Sponsored Content

Top Forums UNIX for Advanced & Expert Users Huge files manipulation Post 302256768 by drl on Monday 10th of November 2008 02:00:37 PM

11-10-2008

Registered User

Hi.

Interesting problem -- a few thoughts.

For timings, I used a 1 GB text file, about 15 M lines, with many duplicates (it is a large number of copies of the text of a novel.)

1) I didn't see any requirement that the file be kept in the original order, so one solution is to sort the file. On my system, sort processed the file using 7 keys in under a minute. An option to remove duplicates about halved the time (many duplicates did not need to get written out).

If the original ordering is needed, one could add a field containing the line number, which could then be used as an additional key, so the final output would be in the original order. You might be able to get by with a single sort, but if 2 sorts would be needed, they could be in a pipeline, so that the system would handle the connections, and no large intermediate file need be directly used.

2) The running out of memory in awk suggests that awk doesn't go beyond real memory, that your system does not use virtual memory, or that you have no swap space -- or similar reasons along those lines. I used perl to keep an in-memory hash of MD5 checksums of the lines. I did see some paging near the end -- the test system has 3 GB of real memory. I arranged for the file to have an additional field making every line unique, so that I had 15 M entries. I did no more processing except for checking the counts of the hashes -- the entire process took about 2.5 minutes of real time.

The advantage of using a checksum + line number is that if the hash does not fit into memory (for whatever reason), the derived data (checksum + line number) can be written out, and the resulting file can be sorted. The duplicate checksum lines will be be in order and the file can be processed to obtain the line numbers of the originals as well as subsequent duplicates. These line numbers can then be used with other utilities, say sed, to be displayed or to refine the original file.

3) You mentioned perl module Tie::File. For small files, this might be an useful choice, depending on what you wanted to do. Simply opening my test file took about 100 seconds. I tested reading the file and writing to /dev/null. The "normal" perl "<>" operator took about half a minute of wall-clock time. Using Tie::File took about 55 minutes -- 2 orders of magnitude slower -- reading straight through, with no other processing. I don't have a lot of experience with Tie::File, but from what I have seen so far, I would avoid it with applications like this where you probably need to look at every line in the file.

Good luck ... cheers, drl

drl

View Public Profile for drl

Find all posts by drl

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Comparing two huge files

Hi, I have two files file A and File B. File A is a error file and File B is source file. In the error file. First line is the actual error and second line gives the information about the record (client ID) that throws error. I need to compare the first field (which doesnt start with '//') of...

2. UNIX for Dummies Questions & Answers

Difference between two huge files

Hi, As per my requirement, I need to take difference between two big files(around 6.5 GB) and get the difference to a output file without any line numbers or '<' or '>' in front of each new line. As DIFF command wont work for big files, i tried to use BDIFF instead. I am getting incorrect...

3. High Performance Computing

Huge Files to be Joined on Ux instead of ORACLE

we have one file (11 Million) line that is being matched with (10 Billion) line. the proof of concept we are trying , is to join them on Unix : All files are delimited and they have composite keys.. could unix be faster than Oracle in This regards.. Please advice

4. Shell Programming and Scripting

Split a huge data into few different files?!

Input file data contents: >seq_1 MSNQSPPQSQRPGHSHSHSHSHAGLASSTSSHSNPSANASYNLNGPRTGGDQRYRASVDA >seq_2 AGAAGRGWGRDVTAAASPNPRNGGGRPASDLLSVGNAGGQASFASPETIDRWFEDLQHYE >seq_3 ATLEEMAAASLDANFKEELSAIEQWFRVLSEAERTAALYSLLQSSTQVQMRFFVTVLQQM ARADPITALLSPANPGQASMEAQMDAKLAAMGLKSPASPAVRQYARQSLSGDTYLSPHSA...

5. Shell Programming and Scripting

Splitting the Huge file into several files...

Hi I have to write a script to split the huge file into several pieces. The file columns is | pipe delimited. The data sample is as: 6625060|1420215|07308806|N|20100120|5572477081|+0002.79|+0000.00|0004|0001|.........

6. Shell Programming and Scripting

Compare 2 folders to find several missing files among huge amounts of files.

Hi, all: I've got two folders, say, "folder1" and "folder2". Under each, there are thousands of files. It's quite obvious that there are some files missing in each. I just would like to find them. I believe this can be done by "diff" command. However, if I change the above question a...

7. Shell Programming and Scripting

Comparing 2 huge text files

I have this 2 files: k5login sanwar@systems.nyfix.com jjamnik@systems.nyfix.com nisha@SYSTEMS.NYFIX.COM rdpena@SYSTEMS.NYFIX.COM service/backups-ora@SYSTEMS.NYFIX.COM ivanr@SYSTEMS.NYFIX.COM nasapova@SYSTEMS.NYFIX.COM tpulay@SYSTEMS.NYFIX.COM rsueno@SYSTEMS.NYFIX.COM...

8. Shell Programming and Scripting

Compression - Exclude huge files

I have a DB folder which sizes to 60GB approx. It has logs which size from 500MB - 1GB. I have an Installation which would update the DB. I need to backup this DB folder, just incase my Installation FAILS. But I do not need the logs in my backup. How do I exclude them during compression (tar)? ...

9. UNIX for Dummies Questions & Answers

File comparison of huge files

Hi all, I hope you are well. I am very happy to see your contribution. I am eager to become part of it. I have the following question. I have two huge files to compare (almost 3GB each). The files are simulation outputs. The format of the files are as below For clear picture, please see...

10. Shell Programming and Scripting

Aggregation of Huge files

Hi Friends !! I am facing a hash total issue while performing over a set of files of huge volume: Command used: tail -n +2 <File_Name> |nawk -F"|" -v '%.2f' qq='"' '{gsub(qq,"");sa+=($156<0)?-$156:$156}END{print sa}' OFMT='%.5f' Pipe delimited file and 156 column is for hash totalling....

LEARN ABOUT LINUX

igawk

IGAWK(1)							 Utility Commands							  IGAWK(1)

NAME

       igawk - gawk with include files

SYNOPSIS

       igawk [ all gawk options ] -f program-file [ -- ] file ...
       igawk [ all gawk options ] [ -- ] program-text file ...

DESCRIPTION

       Igawk is a simple shell script that adds the ability to have ``include files'' to gawk(1).

       AWK programs for igawk are the same as for gawk, except that, in addition, you may have lines like

	      @include getopt.awk

       in your program to include the file getopt.awk from either the current directory or one of the other directories in the search path.

OPTIONS

       See gawk(1) for a full description of the AWK language and the options that gawk supports.

EXAMPLES

       cat << EOF > test.awk
       @include getopt.awk

       BEGIN {
	    while (getopt(ARGC, ARGV, "am:q") != -1)
		 ...
       }
       EOF

       igawk -f test.awk

SEE ALSO

       gawk(1)

       Effective AWK Programming, Edition 1.0, published by the Free Software Foundation, 1995.

AUTHOR

       Arnold Robbins (arnold@skeeve.com).

Free Software Foundation					    Nov 3 1999								  IGAWK(1)

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Comparing two huge files

Discussion started by: kmkbuddy_1983

2. UNIX for Dummies Questions & Answers

Difference between two huge files

Discussion started by: pyaranoid

3. High Performance Computing

Huge Files to be Joined on Ux instead of ORACLE

Discussion started by: magedfawzy

4. Shell Programming and Scripting

Split a huge data into few different files?!

Discussion started by: patrick87