Process multiple large files with awk

12-05-2015

Registered User

1, 0

Join Date: Dec 2015

Last Activity: 6 December 2015, 1:32 PM EST

Posts: 1

Thanks Given: 0

Thanked 0 Times in 0 Posts

Process multiple large files with awk

Hi there, I'm camor and I'm trying to process huge files with bash scripting and awk.
I've got a dataset folder with 10 files (16 millions of row each one - 600MB), and I've got a sorted file with all keys inside.
For example:

Code:

a sample_1 200
a.b sample_2 10
a sample_3 10

Code:

a sample_1 10
a sample_3 67

Code:

a sample_1 1
a.b sample_2 10
a sample_4 20

I would like to obtain an output like this:

Code:

a sample_1 200 10 1 211
a.b sample_2 10 0 10 20
a sample_3 10 67 0 77
a sample_4 0 0 20 20

I create a bash script like this, It produces a sorted file with all of my keys, it's very huge, about 62 millions of record, I slice this file into pieces and I pass each piece to my awk script.

Code:

#! /bin/bash
clear
FILENAME=<result>
BASEPATH=<base_path>
mkdir processed/slice
cat $BASEPATH/dataset/* | cut -d' ' -f1,2 > $BASEPATH/processed/aggr
sort -u -k2 $BASEPATH/processed/aggr > $BASEPATH/processed/sorted
split -d -l 1000000 processed/sorted processed/slice/slice-
echo $(date "+START PROCESSING DATE: %d/%m/%y - TIME: %H:%M:%S")
for filename in processed/slice/*; do
  awk -v filename="$filename" -f algorithm.awk dataset/* >> processed/$FILENAME
done
echo $(date "+END PROCESSING DATE: %d/%m/%y - TIME: %H:%M:%S")
rm $BASEPATH/processed/aggr
rm $BASEPATH/processed/sorted
rm -rf $BASEPATH/processed/slice

Here is my AWK script:

Code:

BEGIN{
while(getline < filename){
 key=$1" "$2;
 sources[key];
 for(i=1;i<11;i++){
   keys[key"-"i] = "0";
 }
}
close(filename);
}
{
if(FNR==1){
 ARGIND++;
}
key=$1" "$2;
keys[key"-"ARGIND] = $3
}END{
for (s in sources) {
 sum = 0
 printf "%s", s
 for (j=1;j<11;j++) {
   printf "%s%s", OFS, keys[s"-"j]
   sum += keys[s"-"j]
 }
print " "sum
}
}

I've figured out that my bottleneck came from iterating on dataset folder by awk input (10 files with 16.000.000 lines each). Everything is working on a small set of data, but with real data, RAM (30GB) congested. I thin the problem is " dataset/*" as AWK input.
Does anyone have any suggestions or advices? Thank you.

camor

View Public Profile for camor

Find all posts by camor

12-06-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

You are reading all of your large files about 20 times. If you could read those files once (instead of twenty times), you would probably reduce running time to less than 5% of what it is now.

With a 60Gb system and data totaling ~6Gb, is there a reason why you can't schedule a time to run an awk script that will need to be allocated a little ~7Gb of RAM while it is running and let it read all of your data into memory and process it in one pass?

Do you really want 16 output files or do you just want one output file?

Do you care if the output is sorted, or do you just sort the keys to create distinct smaller lists of keys to be processed individually? (I note that the data written into your output files by awk aren't sorted.)

And a trivial performance note... There is no need to fire up a subshell to gather arguments for echo. It is easier to just use:

Code:

date '+START PROCESSING DATE: %d/%m/%y - TIME: %H:%M:%S'
...
date '+END PROCESSING DATE: %d/%m/%y - TIME: %H:%M:%S'

instead of:

Code:

echo $(date "+START PROCESSING DATE: %d/%m/%y - TIME: %H:%M:%S")
...
echo $(date "+END PROCESSING DATE: %d/%m/%y - TIME: %H:%M:%S")

And, for consistency with the printf statements in your awk script, the last print statement in your awk script should be:

Code:

print OFS sum

instead of:

Code:

print " "sum

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

12-06-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Looking at your code more closely, it seems that it is even worse than I thought. You are loading all of the data from all of the files into awk on each run, but just printing a fraction of the results. And, you aren't being consistent in your specified pathnames (sometimes using pathnames relative to the current directory and sometimes using the directory specified by $BASEPATH. The following should run MUCH faster for you:

Code:

#!/bin/bash
BASEPATH="/path/to/your/data"	# Must be an absolute pathname
		# Source files must be in directory $BASEPATH/dataset
FILENAME="result"	# Store results in $BASEPATH/processed/$FILENAME

cd "$BASEPATH"
date '+START PROCESSING DATE: %d/%m/%y - TIME: %H:%M:%S'
awk '
FNR == 1 {	
	fn++
}
{	keys[key = $1 OFS $2]
	data[key, fn] = $3
}
END {	for(key in keys) {
		sum = 0
		printf("%s", key)
		for(i = 1; i <= fn; i++) {
			printf("%s%d", OFS, data[key, i])
			sum += data[key, i]
		}
		print OFS sum
	}
}' "dataset/"* > "processed/$FILENAME"
date '+END PROCESSING DATE: %d/%m/%y - TIME: %H:%M:%S'

or, if you need the output sorted, change the next to the last line of the script to:

Code:

}' "dataset/"* | sort -k2,2 -o "processed/$FILENAME"

which with your 3 sample input files located in the directory $BASEPATH/dataset stores the following data in the file named $BASEPATH/processed/$FILENAME:

Code:

a sample_1 200 10 1 211
a sample_3 10 67 0 77
a sample_4 0 0 20 20
a.b sample_2 10 0 10 20

(although the output order may vary with different versions of awk) or, if you pipe the output through sort:

Code:

a sample_1 200 10 1 211
a.b sample_2 10 0 10 20
a sample_3 10 67 0 77
a sample_4 0 0 20 20

I didn't see any need to clear the screen before outputting two lines of data, but you can add the clear back into the top of the script if you want to.

As always, if you are using this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk.

Last edited by Don Cragun; 12-06-2015 at 04:14 PM.. Reason: Fixed typo noted by RudiC: s/d\[/data[/

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

12-06-2015

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

You need to replace

Code:

sum += d[key, i]

with

Code:

sum += data[key, i]

to have the sum printed.

---------- Post updated at 13:29 ---------- Previous update was at 11:57 ----------

Not sure if this will perform better (or even worse?) but it leaves the memory consumption to sort which has options to handle that. We need the files' names upfront, thus the ls trick. Try

Code:

{ ls -x file[1-3]; grep '' file[1-3] | sort -k2,2 -k1,1; } | awk  '
NR == 1         {for (FC=n=split ($0, FN); n>0; n--) FILES[FN[n]] = n
                 next
                }

function ADDZERO(ST, EN)        {for (i=ST; i<EN; i++) LINE = LINE FS "0"
                                 LF = ++i
                                }


                {split ($1, T, ":")
                 FNO = FILES[T[1]]
                }

$2 != LAST      {ADDZERO(LF, FC+1)
                 if (NR > 2) print LINE, SUM
                 SUM  = 0
                 LINE =  T[2] FS $2
                 LF = 1
                }
                {LAST = $2
                 ADDZERO(LF, FNO)
                 LINE = LINE FS $3
                 SUM += $3
                }
END             {print LINE, SUM
                }
'
a sample_1 200 10 1 211
a.b sample_2 10 0 10 20
a sample_3 10 67 0 77
a sample_4 0 0 20 20

Last edited by RudiC; 12-06-2015 at 08:24 AM..

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

12-06-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by RudiC

You need to replace

Code:

sum += d[key, i]

with

Code:

sum += data[key, i]

Code:

{ ls -x file[1-3]; grep '' file[1-3] | sort -k2,2 -k1,1; } | awk  '
NR == 1         {for (FC=n=split ($0, FN); n>0; n--) FILES[FN[n]] = n
                 next
                }
 ... ... ...

Thank you for catchng my typo. (I made the mistake of trying to make the array name more descriptive after I tested the script, and missed the last needed change.)

Note that ls -x file* can produce more than one line of output depending on the length of the list of file names. We know that the input files are selected by $BASEPATH/dataset/*, but we haven't been given any indication of the length's of the names matched by * in that pattern nor of what the real length of the expansion of BASEPATH will be.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

Shell Programming and Scripting

Process multiple large files with awk

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Find common numbers from two very large files using awk or the like

Discussion started by: Scottie1954

2. Shell Programming and Scripting

Split large zone file dump into multiple files

Discussion started by: Bluemerlin

3. Shell Programming and Scripting

Splitting large file into multiple files in unix based on pattern

Discussion started by: jimmy12

4. Emergency UNIX and Linux Support

Help to make awk script more efficient for large files

Discussion started by: script_op2a

5. Shell Programming and Scripting

awk - splitting 1 large file into multiple based on same key records

Discussion started by: kam66

6. UNIX for Dummies Questions & Answers

Using AWK: Extract data from multiple files and output to multiple new files

Discussion started by: Liverpaul09

7. Shell Programming and Scripting

Using AWK to separate data from a large XML file into multiple files

Discussion started by: JRy

8. UNIX for Dummies Questions & Answers

multiple smaller files from one large file

Discussion started by: rtroscianecki

9. Shell Programming and Scripting

AWK Shell Program to Split Large Files

Discussion started by: mkastin

10. Shell Programming and Scripting

how to divide single large log file into multiple files.

Discussion started by: kamleshm