Sponsored Content
Top Forums Shell Programming and Scripting Process multiple large files with awk Post 302961899 by Don Cragun on Sunday 6th of December 2015 04:33:34 AM
Old 12-06-2015
Looking at your code more closely, it seems that it is even worse than I thought. You are loading all of the data from all of the files into awk on each run, but just printing a fraction of the results. And, you aren't being consistent in your specified pathnames (sometimes using pathnames relative to the current directory and sometimes using the directory specified by $BASEPATH. The following should run MUCH faster for you:
Code:
#!/bin/bash
BASEPATH="/path/to/your/data"	# Must be an absolute pathname
		# Source files must be in directory $BASEPATH/dataset
FILENAME="result"	# Store results in $BASEPATH/processed/$FILENAME

cd "$BASEPATH"
date '+START PROCESSING DATE: %d/%m/%y - TIME: %H:%M:%S'
awk '
FNR == 1 {	
	fn++
}
{	keys[key = $1 OFS $2]
	data[key, fn] = $3
}
END {	for(key in keys) {
		sum = 0
		printf("%s", key)
		for(i = 1; i <= fn; i++) {
			printf("%s%d", OFS, data[key, i])
			sum += data[key, i]
		}
		print OFS sum
	}
}' "dataset/"* > "processed/$FILENAME"
date '+END PROCESSING DATE: %d/%m/%y - TIME: %H:%M:%S'

or, if you need the output sorted, change the next to the last line of the script to:
Code:
}' "dataset/"* | sort -k2,2 -o "processed/$FILENAME"

which with your 3 sample input files located in the directory $BASEPATH/dataset stores the following data in the file named $BASEPATH/processed/$FILENAME:
Code:
a sample_1 200 10 1 211
a sample_3 10 67 0 77
a sample_4 0 0 20 20
a.b sample_2 10 0 10 20

(although the output order may vary with different versions of awk) or, if you pipe the output through sort:
Code:
a sample_1 200 10 1 211
a.b sample_2 10 0 10 20
a sample_3 10 67 0 77
a sample_4 0 0 20 20

I didn't see any need to clear the screen before outputting two lines of data, but you can add the clear back into the top of the script if you want to.

As always, if you are using this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk.

Last edited by Don Cragun; 12-06-2015 at 04:14 PM.. Reason: Fixed typo noted by RudiC: s/d\[/data[/
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

how to divide single large log file into multiple files.

Can you please help me with writing script for following purpose. I have to divide single large web access log file into multiple log files based on dates inside the log file. For example: if data is logged in the access file for jan-10-08 , jan-11-08 , Jan-12-08 then make small log file... (1 Reply)
Discussion started by: kamleshm
1 Replies

2. Shell Programming and Scripting

AWK Shell Program to Split Large Files

Hi, I need some help creating a tidy shell program with awk or other language that will split large length files efficiently. Here is an example dump: <A001_MAIL.DAT> 0001 Ronald McDonald 01 H81 0002 Elmo St. Elmo 02 H82 0003 Cookie Monster 01 H81 0004 Oscar ... (16 Replies)
Discussion started by: mkastin
16 Replies

3. UNIX for Dummies Questions & Answers

multiple smaller files from one large file

I have a file with a simple list of ids. 750,000 rows. I have to break it down into multiple 50,000 row files to submit in a batch process.. Is there an easy script I could write to accomplish this task? (2 Replies)
Discussion started by: rtroscianecki
2 Replies

4. Shell Programming and Scripting

Using AWK to separate data from a large XML file into multiple files

I have a 500 MB XML file from a FileMaker database export, it's formatted horribly (no line breaks at all). The node structure is basically <FMPXMLRESULT> <METADATA> <FIELD att="............." id="..."/> </METADATA> <RESULTSET FOUND="1763457"> <ROW att="....." etc="...."> ... (16 Replies)
Discussion started by: JRy
16 Replies

5. UNIX for Dummies Questions & Answers

Using AWK: Extract data from multiple files and output to multiple new files

Hi, I'd like to process multiple files. For example: file1.txt file2.txt file3.txt Each file contains several lines of data. I want to extract a piece of data and output it to a new file. file1.txt ----> newfile1.txt file2.txt ----> newfile2.txt file3.txt ----> newfile3.txt Here is... (3 Replies)
Discussion started by: Liverpaul09
3 Replies

6. Shell Programming and Scripting

awk - splitting 1 large file into multiple based on same key records

Hello gurus, I am new to "awk" and trying to break a large file having 4 million records into several output files each having half million but at the same time I want to keep the similar key records in the same output file, not to exist accross the files. e.g. my data is like: Row_Num,... (6 Replies)
Discussion started by: kam66
6 Replies

7. Emergency UNIX and Linux Support

Help to make awk script more efficient for large files

Hello, Error awk: Internal software error in the tostring function on TS1101?05044400?.0085498227?0?.0011041461?.0034752266?.00397045?0?0?0?0?0?0?11/02/10?09/23/10???10?no??0??no?sct_det3_10_20110516_143936.txt What it is It is a unix shell script that contains an awk program as well as... (4 Replies)
Discussion started by: script_op2a
4 Replies

8. Shell Programming and Scripting

Splitting large file into multiple files in unix based on pattern

I need to write a shell script for below scenario My input file has data in format: qwerty0101TWE 12345 01022005 01022005 datainala alanfernanded 26 qwerty0101mXZ 12349 01022005 06022008 datainalb johngalilo 28 qwerty0101TWE 12342 01022005 07022009 datainalc hitalbert 43 qwerty0101CFG 12345... (19 Replies)
Discussion started by: jimmy12
19 Replies

9. Shell Programming and Scripting

Split large zone file dump into multiple files

I have a large zone file dump that consists of ; DNS record for the adomain.com domain data1 data2 data3 data4 data5 CRLF CRLF CRLF ; DNS record for the anotherdomain.com domain data1 data2 data3 data4 data5 data6 CRLF (7 Replies)
Discussion started by: Bluemerlin
7 Replies

10. UNIX for Dummies Questions & Answers

Find common numbers from two very large files using awk or the like

I've got two files that each contain a 16-digit number in positions 1-16. The first file has 63,120 entries all sorted numerically. The second file has 142,479 entries, also sorted numerically. I want to read through each file and output the entries that appear in both. So far I've had no... (13 Replies)
Discussion started by: Scottie1954
13 Replies
All times are GMT -4. The time now is 06:46 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy