Text string parsing in awk


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Text string parsing in awk
# 1  
Old 03-13-2014
Text string parsing in awk

I have a awk script that parses many millions of lines so performance is critical. At one point I am extracting some variables from a space delimited string.

Code:
alarm = $11; len = split(alarm,a," "); ent = a[3]; chem = a[4]; for (i = 5; i<= len; i++) {chem = chem " " a[i]}

It works but is slow. Adding the array slowed things down. Adding a for loop made it even worse. Is there a faster way to do what I am trying to do with string functions? In BASH, I'd do substitutions. Is there something built into awk to take the text between the 2nd and 3rd space in a string into one variable and everything after the 4th space into another without arrays or loops?

Code:
alarm = "padding1 padding2 ent_name chem_name can have spaces but goes to end of string"

I want:
Code:
ent = "ent_name"
chem = "chem_name can have spaces but goes to end of string"

Mike

Last edited by Michael Stora; 03-13-2014 at 09:33 PM..
# 2  
Old 03-13-2014
If you can provide the entire script, I will see if there is an option to tune
# 3  
Old 03-13-2014
You haven't shown us what a single line in your input file looks like, and you haven't even shown us one complete action section of your awk script. How did you determine that this section of code is your only performance problem?

With the following complete awk script (which incorporates code very similar to the code you showed us):
Code:
awk '
{	alarm = $0
	len = split(alarm,a," ")
	ent = a[3]
	chem = a[4]
	for (i = 5; i<= len; i++)
		chem = chem " " a[i]
	print "ent=" ent " chem=" chem
}' file

and with file containing 10,000 lines starting with the following:
Code:
junk1-1 junk2-1 ent1 chem1-1 chem2-1 chem3-1 chem4-1 chem5-1 chem6-1 chem7-1 chem7-1 chem8-1
junk1-2 junk2-2 ent2 chem1-2 chem2-2 chem3-2 chem4-2 chem5-2 chem6-2 chem7-2 chem7-2 chem8-2
junk1-3 junk2-3 ent3 chem1-3 chem2-3 chem3-3 chem4-3 chem5-3 chem6-3 chem7-3 chem7-3 chem8-3
junk1-4 junk2-4 ent4 chem1-4 chem2-4 chem3-4 chem4-4 chem5-4 chem6-4 chem7-4 chem7-4 chem8-4
junk1-5 junk2-5 ent5 chem1-5 chem2-5 chem3-5 chem4-5 chem5-5 chem6-5 chem7-5 chem7-5 chem8-5
junk1-6 junk2-6 ent6 chem1-6 chem2-6 chem3-6 chem4-6 chem5-6 chem6-6 chem7-6 chem7-6 chem8-6
junk1-7 junk2-7 ent7 chem1-7 chem2-7 chem3-7 chem4-7 chem5-7 chem6-7 chem7-7 chem7-7 chem8-7
junk1-8 junk2-8 ent8 chem1-8 chem2-8 chem3-8 chem4-8 chem5-8 chem6-8 chem7-8 chem7-8 chem8-8
junk1-9 junk2-9 ent9 chem1-9 chem2-9 chem3-9 chem4-9 chem5-9 chem6-9 chem7-9 chem7-9 chem8-9
junk1-10 junk2-10 ent10 chem1-10 chem2-10 chem3-10 chem4-10 chem5-10 chem6-10 chem7-10 chem7-10 chem8-10

I get 10,000 lines of output starting with:
Code:
ent=ent1 chem=chem1-1 chem2-1 chem3-1 chem4-1 chem5-1 chem6-1 chem7-1 chem7-1 chem8-1
ent=ent2 chem=chem1-2 chem2-2 chem3-2 chem4-2 chem5-2 chem6-2 chem7-2 chem7-2 chem8-2
ent=ent3 chem=chem1-3 chem2-3 chem3-3 chem4-3 chem5-3 chem6-3 chem7-3 chem7-3 chem8-3
ent=ent4 chem=chem1-4 chem2-4 chem3-4 chem4-4 chem5-4 chem6-4 chem7-4 chem7-4 chem8-4
ent=ent5 chem=chem1-5 chem2-5 chem3-5 chem4-5 chem5-5 chem6-5 chem7-5 chem7-5 chem8-5
ent=ent6 chem=chem1-6 chem2-6 chem3-6 chem4-6 chem5-6 chem6-6 chem7-6 chem7-6 chem8-6
ent=ent7 chem=chem1-7 chem2-7 chem3-7 chem4-7 chem5-7 chem6-7 chem7-7 chem7-7 chem8-7
ent=ent8 chem=chem1-8 chem2-8 chem3-8 chem4-8 chem5-8 chem6-8 chem7-8 chem7-8 chem8-8
ent=ent9 chem=chem1-9 chem2-9 chem3-9 chem4-9 chem5-9 chem6-9 chem7-9 chem7-9 chem8-9
ent=ent10 chem=chem1-10 chem2-10 chem3-10 chem4-10 chem5-10 chem6-10 chem7-10 chem7-10 chem8-10

in 0.26 to 0.27 seconds.

The following script:
Code:
awk '
{	alarm = $0
	match(alarm, /^[^ ]* [^ ]* [^ ]* /)
	chem = substr(alarm, RLENGTH + 1)
	s = RLENGTH
	match(alarm, /^[^ ]* [^ ]* /)
	ent = substr(alarm, RLENGTH + 1, s - RLENGTH - 1)
	print "ent=" ent " chem=" chem
}' file

with the same input file produces exactly the same output in 0.06 seconds.

These tests were run using awk on Mac OS X Version 10.7.5 on a MacBook Pro laptop. There is no guarantee that you will see this type of speed up on your system with your data, but it should give you an idea to examine.
This User Gave Thanks to Don Cragun For This Post:
# 4  
Old 03-14-2014
On my system those two version were pretty much the same, not that the one that is run first is faster due to caching, etc but after that they were indistinguishable on performance

I did get some about a 50% reduction in runtime using index and substr:

Code:
awk '
{   alarm = $0
    p=index(alarm, " ")
    q=index(substr(alarm,p), " ")+p
    r=index(substr(alarm,q), " ")+q
    s=index(substr(alarm,r), " ")+r
    chem = substr(alarm, s)
    ent = substr(alarm,r,s-r-1)
    print "ent=" ent " chem=" chem
}' file

With 1 Million records average times were Original 10.580s; Don 10.430s; This 6.190s
This User Gave Thanks to Chubler_XL For This Post:
# 5  
Old 03-14-2014
Clearly the results vary considerably from system to system. I upped my input test file to 1,000,000 lines and ran each set of code 11 times. I threw out the 1st run for each awk script. The remaining results on my system are:
Code:
    Original    	  Don Cragun   		  Chubler_XL   
================	===============		===============
real	0m26.82s	real	0m5.91s		real	0m6.59s
user	0m25.79s	user	0m5.19s		user	0m6.07s
sys	0m0.68s		sys	0m0.48s		sys	0m0.47s
		
real	0m26.62s	real	0m5.72s		real	0m6.63s
user	0m25.69s	user	0m5.17s		user	0m6.06s
sys	0m0.66s		sys	0m0.47s		sys	0m0.47s
		
real	0m26.95s	real	0m5.85s		real	0m6.63s
user	0m25.79s	user	0m5.18s		user	0m6.07s
sys	0m0.67s		sys	0m0.47s		sys	0m0.46s
		
real	0m26.80s	real	0m5.73s		real	0m6.73s
user	0m25.81s	user	0m5.16s		user	0m6.11s
sys	0m0.68s		sys	0m0.48s		sys	0m0.47s
		
real	0m26.79s	real	0m5.82s		real	0m6.65s
user	0m25.89s	user	0m5.18s		user	0m6.08s
sys	0m0.68s		sys	0m0.47s		sys	0m0.47s
		
real	0m27.20s	real	0m5.76s		real	0m6.64s
user	0m26.01s	user	0m5.18s		user	0m6.05s
sys	0m0.69s		sys	0m0.47s		sys	0m0.47s
		
real	0m27.12s	real	0m5.74s		real	0m6.61s
user	0m26.04s	user	0m5.17s		user	0m6.05s
sys	0m0.68s		sys	0m0.47s		sys	0m0.47s
		
real	0m26.99s	real	0m5.78s		real	0m6.68s
user	0m26.11s	user	0m5.19s		user	0m6.07s
sys	0m0.67s		sys	0m0.48s		sys	0m0.47s
		
real	0m27.00s	real	0m5.78s		real	0m6.65s
user	0m25.98s	user	0m5.17s		user	0m6.09s
sys	0m0.66s		sys	0m0.47s		sys	0m0.47s
		
real	0m26.71s	real	0m5.84s		real	0m6.64s
user	0m25.85s	user	0m5.21s		user	0m6.07s
sys	0m0.67s		sys	0m0.48s		sys	0m0.47s

On OS X, awk using index() 4 times and substr() 5 times seems to be a little slower than using match() 2 times and substr() 2 times. Both are considerably faster than using split() and a for loop to append fields to create the concatenation of the 4th through the NFth fields.
This User Gave Thanks to Don Cragun For This Post:
# 6  
Old 03-14-2014
How about the below?

Code:
awk '{print gensub(/^[^ ]+[ ][^ ]+[ ]([^ ]+)[ ](.*)$/, "ent=\\1" FS "chem=\\2", "g", $0)}'

We can use $11 instead of $0 while parsing the file

Don, could you please test my code and give me the times

Last edited by SriniShoo; 03-14-2014 at 06:11 AM.. Reason: request
This User Gave Thanks to SriniShoo For This Post:
# 7  
Old 03-14-2014
Thanks a lot guys. As the script has grown and added a lot of function calls for date and time processing and other things it has grown very slow. As far as parsing this one variable, I found out that the first field is always 4 characters and the second is 6 characters (the remaining vary). Because of this I did something very similar to Chubbler_XL but knowing the second space is always at position 12:
Code:
alarm = $11; alarmEnd = substr(alarm,13); i=index(alarmEnd," ")
ent = substr(alarmEnd,1,i-1); chem = substr(alarmEnd,i+1)

I need to study the other solutions to learn something. I have learned a lot of AWK in the last couple days.

Mike
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to change specific string to new value if found in text file

I am trying to use awk to change a specific string in a field, if it is found, to another value. In the tab-delimited file the text in bold in $3 contains the string 23, which is always right before a ., if it is present. I am trying to change that string to X, keeping the formatting and the... (3 Replies)
Discussion started by: cmccabe
3 Replies

2. Shell Programming and Scripting

Complex text parsing with speed/performance problem (awk solution?)

I have 1.6 GB (and growing) of files with needed data between the 11th and 34th line (inclusive) of the second column of comma delimited files. There is also a lot of stray white space in the file that needs to be trimmed. They have DOS-like end of lines. I need to transpose the 11th through... (13 Replies)
Discussion started by: Michael Stora
13 Replies

3. Shell Programming and Scripting

awk + gsub to search multiple input values & replace with located string + extra text

Hi all. I have the following command that is successfully searching for any one of the strings on all lines of a file and replacing it with the instructed value. cat inputFile | awk '{gsub(/aaa|bbb|ccc|ddd/,"1234")}1' > outputFile This does in fact replace any occurrence of aaa, bbb,... (2 Replies)
Discussion started by: dazhoop
2 Replies

4. Shell Programming and Scripting

Parsing a long string string problem for procmail

Hi everyone, I am working on fetchmail + procmail to filter mails and I am having problem with parsing a long line in the body of the email. Could anyone help me construct a reg exp for this string below. It needs to match exactly as this string. GetRyt... (4 Replies)
Discussion started by: cwiggler
4 Replies

5. Shell Programming and Scripting

how to extract a paticular string from the text file with awk.

hello forum members I have txt file which consists the following information. Server: abababa.xyz.ap.mxmx.com Address: 111.143.211.202 Name: rmxd.ipc.ap.mxmx.com Address: 144.111.99.9 from the abovefile i have to extract only string "rmxd.ipc.ap.mxmx.com" through awk command.... (1 Reply)
Discussion started by: rajkumar_g
1 Replies

6. Shell Programming and Scripting

choose random text between constant string.. using awk?

Hallo I have maybe a little bit advanced request.... I need to choose one random part betwen %.... so i have this.. % text1 text1 text1 text1 text1 text1 text1 text1 text1 % text2 text2 text2 text2 text2 % text3 text3 text3 tetx3 % this choose text between % awk ' /%/... (8 Replies)
Discussion started by: sandwich
8 Replies

7. Shell Programming and Scripting

String parsing with awk/sed/?

If I have a string that has some name followed by an ID#(ex.B123456) followed by some more #'s and/or letters, would it be possible to just grab the ID portion of this string? If so how? I am pretty new with these text tools so any help is appreciated. Example: "Name_One-B123456A-12348A" (2 Replies)
Discussion started by: airon23bball
2 Replies

8. Shell Programming and Scripting

Parsing of file for Report Generation (String parsing and splitting)

Hey guys, I have this file generated by me... i want to create some HTML output from it. The problem is that i am really confused about how do I go about reading the file. The file is in the following format: TID1 Name1 ATime=xx AResult=yyy AExpected=yyy BTime=xx BResult=yyy... (8 Replies)
Discussion started by: umar.shaikh
8 Replies

9. UNIX for Dummies Questions & Answers

Parsing string

I am passing argument 1-13 to a sh file. I want to parse the string and the get the numbers on either side of "-" in two different variables. I am not familiar with unix .. how can i do this? (3 Replies)
Discussion started by: rolex.mp
3 Replies

10. Shell Programming and Scripting

Need help parsing a string

Hi, I'm writing a shell script that outputs, among other things, some of the information that is outputted by the mysqladmin status command. The output of the command looks like this: Uptime: 816351 Threads: 19 Questions: 80719739 Slow queries: 1419 Opens: 15903523 Flush tables: 1 Open tables:... (6 Replies)
Discussion started by: achieve
6 Replies
Login or Register to Ask a Question