Hi there, I'm camor and I'm trying to process huge files with bash scripting and awk.
I've got a dataset folder with 10 files (16 millions of row each one - 600MB), and I've got a sorted file with all keys inside.
For example:
I would like to obtain an output like this:
I create a bash script like this, It produces a sorted file with all of my keys, it's very huge, about 62 millions of record, I slice this file into pieces and I pass each piece to my awk script.
Here is my AWK script:
I've figured out that my bottleneck came from iterating on dataset folder by awk input (10 files with 16.000.000 lines each). Everything is working on a small set of data, but with real data, RAM (30GB) congested. I thin the problem is " dataset/*" as AWK input.
Does anyone have any suggestions or advices? Thank you.
You are reading all of your large files about 20 times. If you could read those files once (instead of twenty times), you would probably reduce running time to less than 5% of what it is now.
With a 60Gb system and data totaling ~6Gb, is there a reason why you can't schedule a time to run an awk script that will need to be allocated a little ~7Gb of RAM while it is running and let it read all of your data into memory and process it in one pass?
Do you really want 16 output files or do you just want one output file?
Do you care if the output is sorted, or do you just sort the keys to create distinct smaller lists of keys to be processed individually? (I note that the data written into your output files by awk aren't sorted.)
And a trivial performance note... There is no need to fire up a subshell to gather arguments for echo. It is easier to just use:
instead of:
And, for consistency with the printf statements in your awk script, the last print statement in your awk script should be:
instead of:
Looking at your code more closely, it seems that it is even worse than I thought. You are loading all of the data from all of the files into awk on each run, but just printing a fraction of the results. And, you aren't being consistent in your specified pathnames (sometimes using pathnames relative to the current directory and sometimes using the directory specified by $BASEPATH. The following should run MUCH faster for you:
or, if you need the output sorted, change the next to the last line of the script to:
which with your 3 sample input files located in the directory $BASEPATH/dataset stores the following data in the file named $BASEPATH/processed/$FILENAME:
(although the output order may vary with different versions of awk) or, if you pipe the output through sort:
I didn't see any need to clear the screen before outputting two lines of data, but you can add the clear back into the top of the script if you want to.
As always, if you are using this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk.
Last edited by Don Cragun; 12-06-2015 at 04:14 PM..
Reason: Fixed typo noted by RudiC: s/d\[/data[/
---------- Post updated at 13:29 ---------- Previous update was at 11:57 ----------
Not sure if this will perform better (or even worse?) but it leaves the memory consumption to sort which has options to handle that. We need the files' names upfront, thus the ls trick. Try
---------- Post updated at 13:29 ---------- Previous update was at 11:57 ----------
Not sure if this will perform better (or even worse?) but it leaves the memory consumption to sort which has options to handle that. We need the files' names upfront, thus the ls trick. Try
Thank you for catchng my typo. (I made the mistake of trying to make the array name more descriptive after I tested the script, and missed the last needed change.)
Note that ls -x file* can produce more than one line of output depending on the length of the list of file names. We know that the input files are selected by $BASEPATH/dataset/*, but we haven't been given any indication of the length's of the names matched by * in that pattern nor of what the real length of the expansion of BASEPATH will be.
I've got two files that each contain a 16-digit number in positions 1-16. The first file has 63,120 entries all sorted numerically. The second file has 142,479 entries, also sorted numerically.
I want to read through each file and output the entries that appear in both. So far I've had no... (13 Replies)
I have a large zone file dump that consists of
; DNS record for the adomain.com domain
data1
data2
data3
data4
data5
CRLF
CRLF
CRLF
; DNS record for the anotherdomain.com domain
data1
data2
data3
data4
data5
data6
CRLF (7 Replies)
I need to write a shell script for below scenario
My input file has data in format:
qwerty0101TWE 12345 01022005 01022005 datainala alanfernanded 26
qwerty0101mXZ 12349 01022005 06022008 datainalb johngalilo 28
qwerty0101TWE 12342 01022005 07022009 datainalc hitalbert 43
qwerty0101CFG 12345... (19 Replies)
Hello,
Error
awk: Internal software error in the tostring function on TS1101?05044400?.0085498227?0?.0011041461?.0034752266?.00397045?0?0?0?0?0?0?11/02/10?09/23/10???10?no??0??no?sct_det3_10_20110516_143936.txt
What it is
It is a unix shell script that contains an awk program as well as... (4 Replies)
Hello gurus,
I am new to "awk" and trying to break a large file having 4 million records into several output files each having half million but at the same time I want to keep the similar key records in the same output file, not to exist accross the files.
e.g. my data is like:
Row_Num,... (6 Replies)
Hi,
I'd like to process multiple files. For example:
file1.txt
file2.txt
file3.txt
Each file contains several lines of data. I want to extract a piece of data and output it to a new file.
file1.txt ----> newfile1.txt
file2.txt ----> newfile2.txt
file3.txt ----> newfile3.txt
Here is... (3 Replies)
I have a 500 MB XML file from a FileMaker database export, it's formatted horribly (no line breaks at all). The node structure is basically
<FMPXMLRESULT>
<METADATA>
<FIELD att="............." id="..."/>
</METADATA>
<RESULTSET FOUND="1763457">
<ROW att="....." etc="....">
... (16 Replies)
I have a file with a simple list of ids. 750,000 rows. I have to break it down into multiple 50,000 row files to submit in a batch process.. Is there an easy script I could write to accomplish this task? (2 Replies)
Hi,
I need some help creating a tidy shell program with awk or other language that will split large length files efficiently.
Here is an example dump:
<A001_MAIL.DAT>
0001 Ronald McDonald 01 H81
0002 Elmo St. Elmo 02 H82
0003 Cookie Monster 01 H81
0004 Oscar ... (16 Replies)
Can you please help me with writing script for following purpose.
I have to divide single large web access log file into multiple log files based on dates inside the log file.
For example:
if data is logged in the access file for jan-10-08 , jan-11-08 , Jan-12-08
then make small log file... (1 Reply)