Text string parsing in awk

03-13-2014

Registered User

183, 15

Join Date: Jul 2010

Last Activity: 22 June 2015, 3:25 PM EDT

Posts: 183

Thanks Given: 56

Thanked 15 Times in 13 Posts

Text string parsing in awk

I have a awk script that parses many millions of lines so performance is critical. At one point I am extracting some variables from a space delimited string.

Code:

alarm = $11; len = split(alarm,a," "); ent = a[3]; chem = a[4]; for (i = 5; i<= len; i++) {chem = chem " " a[i]}

It works but is slow. Adding the array slowed things down. Adding a for loop made it even worse. Is there a faster way to do what I am trying to do with string functions? In BASH, I'd do substitutions. Is there something built into awk to take the text between the 2nd and 3rd space in a string into one variable and everything after the 4th space into another without arrays or loops?

Code:

alarm = "padding1 padding2 ent_name chem_name can have spaces but goes to end of string"

I want:

Code:

ent = "ent_name"
chem = "chem_name can have spaces but goes to end of string"

Mike

Last edited by Michael Stora; 03-13-2014 at 09:33 PM..

Michael Stora

View Public Profile for Michael Stora

Find all posts by Michael Stora

03-13-2014

Registered User

559, 160

Join Date: Jul 2012

Last Activity: 20 September 2019, 7:24 AM EDT

Location: India, Hyderabad

Posts: 559

Thanks Given: 11

Thanked 160 Times in 148 Posts

If you can provide the entire script, I will see if there is an option to tune

SriniShoo

View Public Profile for SriniShoo

Find all posts by SriniShoo

03-13-2014

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

You haven't shown us what a single line in your input file looks like, and you haven't even shown us one complete action section of your awk script. How did you determine that this section of code is your only performance problem?

With the following complete awk script (which incorporates code very similar to the code you showed us):

Code:

awk '
{	alarm = $0
	len = split(alarm,a," ")
	ent = a[3]
	chem = a[4]
	for (i = 5; i<= len; i++)
		chem = chem " " a[i]
	print "ent=" ent " chem=" chem
}' file

and with file containing 10,000 lines starting with the following:

Code:

junk1-1 junk2-1 ent1 chem1-1 chem2-1 chem3-1 chem4-1 chem5-1 chem6-1 chem7-1 chem7-1 chem8-1
junk1-2 junk2-2 ent2 chem1-2 chem2-2 chem3-2 chem4-2 chem5-2 chem6-2 chem7-2 chem7-2 chem8-2
junk1-3 junk2-3 ent3 chem1-3 chem2-3 chem3-3 chem4-3 chem5-3 chem6-3 chem7-3 chem7-3 chem8-3
junk1-4 junk2-4 ent4 chem1-4 chem2-4 chem3-4 chem4-4 chem5-4 chem6-4 chem7-4 chem7-4 chem8-4
junk1-5 junk2-5 ent5 chem1-5 chem2-5 chem3-5 chem4-5 chem5-5 chem6-5 chem7-5 chem7-5 chem8-5
junk1-6 junk2-6 ent6 chem1-6 chem2-6 chem3-6 chem4-6 chem5-6 chem6-6 chem7-6 chem7-6 chem8-6
junk1-7 junk2-7 ent7 chem1-7 chem2-7 chem3-7 chem4-7 chem5-7 chem6-7 chem7-7 chem7-7 chem8-7
junk1-8 junk2-8 ent8 chem1-8 chem2-8 chem3-8 chem4-8 chem5-8 chem6-8 chem7-8 chem7-8 chem8-8
junk1-9 junk2-9 ent9 chem1-9 chem2-9 chem3-9 chem4-9 chem5-9 chem6-9 chem7-9 chem7-9 chem8-9
junk1-10 junk2-10 ent10 chem1-10 chem2-10 chem3-10 chem4-10 chem5-10 chem6-10 chem7-10 chem7-10 chem8-10

I get 10,000 lines of output starting with:

Code:

ent=ent1 chem=chem1-1 chem2-1 chem3-1 chem4-1 chem5-1 chem6-1 chem7-1 chem7-1 chem8-1
ent=ent2 chem=chem1-2 chem2-2 chem3-2 chem4-2 chem5-2 chem6-2 chem7-2 chem7-2 chem8-2
ent=ent3 chem=chem1-3 chem2-3 chem3-3 chem4-3 chem5-3 chem6-3 chem7-3 chem7-3 chem8-3
ent=ent4 chem=chem1-4 chem2-4 chem3-4 chem4-4 chem5-4 chem6-4 chem7-4 chem7-4 chem8-4
ent=ent5 chem=chem1-5 chem2-5 chem3-5 chem4-5 chem5-5 chem6-5 chem7-5 chem7-5 chem8-5
ent=ent6 chem=chem1-6 chem2-6 chem3-6 chem4-6 chem5-6 chem6-6 chem7-6 chem7-6 chem8-6
ent=ent7 chem=chem1-7 chem2-7 chem3-7 chem4-7 chem5-7 chem6-7 chem7-7 chem7-7 chem8-7
ent=ent8 chem=chem1-8 chem2-8 chem3-8 chem4-8 chem5-8 chem6-8 chem7-8 chem7-8 chem8-8
ent=ent9 chem=chem1-9 chem2-9 chem3-9 chem4-9 chem5-9 chem6-9 chem7-9 chem7-9 chem8-9
ent=ent10 chem=chem1-10 chem2-10 chem3-10 chem4-10 chem5-10 chem6-10 chem7-10 chem7-10 chem8-10

in 0.26 to 0.27 seconds.

The following script:

Code:

awk '
{	alarm = $0
	match(alarm, /^[^ ]* [^ ]* [^ ]* /)
	chem = substr(alarm, RLENGTH + 1)
	s = RLENGTH
	match(alarm, /^[^ ]* [^ ]* /)
	ent = substr(alarm, RLENGTH + 1, s - RLENGTH - 1)
	print "ent=" ent " chem=" chem
}' file

with the same input file produces exactly the same output in 0.06 seconds.

These tests were run using awk on Mac OS X Version 10.7.5 on a MacBook Pro laptop. There is no guarantee that you will see this type of speed up on your system with your data, but it should give you an idea to examine.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

03-14-2014

Moderator

3,791, 1,452

Join Date: Oct 2010

Last Activity: 1 August 2020, 1:38 AM EDT

Posts: 3,791

Thanks Given: 183

Thanked 1,452 Times in 1,302 Posts

On my system those two version were pretty much the same, not that the one that is run first is faster due to caching, etc but after that they were indistinguishable on performance

I did get some about a 50% reduction in runtime using index and substr:

Code:

awk '
{   alarm = $0
    p=index(alarm, " ")
    q=index(substr(alarm,p), " ")+p
    r=index(substr(alarm,q), " ")+q
    s=index(substr(alarm,r), " ")+r
    chem = substr(alarm, s)
    ent = substr(alarm,r,s-r-1)
    print "ent=" ent " chem=" chem
}' file

With 1 Million records average times were Original 10.580s; Don 10.430s; This 6.190s

This User Gave Thanks to Chubler_XL For This Post:

Chubler_XL

View Public Profile for Chubler_XL

Find all posts by Chubler_XL

03-14-2014

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Clearly the results vary considerably from system to system. I upped my input test file to 1,000,000 lines and ran each set of code 11 times. I threw out the 1st run for each awk script. The remaining results on my system are:

Code:

    Original    	  Don Cragun   		  Chubler_XL   
================	===============		===============
real	0m26.82s	real	0m5.91s		real	0m6.59s
user	0m25.79s	user	0m5.19s		user	0m6.07s
sys	0m0.68s		sys	0m0.48s		sys	0m0.47s
		
real	0m26.62s	real	0m5.72s		real	0m6.63s
user	0m25.69s	user	0m5.17s		user	0m6.06s
sys	0m0.66s		sys	0m0.47s		sys	0m0.47s
		
real	0m26.95s	real	0m5.85s		real	0m6.63s
user	0m25.79s	user	0m5.18s		user	0m6.07s
sys	0m0.67s		sys	0m0.47s		sys	0m0.46s
		
real	0m26.80s	real	0m5.73s		real	0m6.73s
user	0m25.81s	user	0m5.16s		user	0m6.11s
sys	0m0.68s		sys	0m0.48s		sys	0m0.47s
		
real	0m26.79s	real	0m5.82s		real	0m6.65s
user	0m25.89s	user	0m5.18s		user	0m6.08s
sys	0m0.68s		sys	0m0.47s		sys	0m0.47s
		
real	0m27.20s	real	0m5.76s		real	0m6.64s
user	0m26.01s	user	0m5.18s		user	0m6.05s
sys	0m0.69s		sys	0m0.47s		sys	0m0.47s
		
real	0m27.12s	real	0m5.74s		real	0m6.61s
user	0m26.04s	user	0m5.17s		user	0m6.05s
sys	0m0.68s		sys	0m0.47s		sys	0m0.47s
		
real	0m26.99s	real	0m5.78s		real	0m6.68s
user	0m26.11s	user	0m5.19s		user	0m6.07s
sys	0m0.67s		sys	0m0.48s		sys	0m0.47s
		
real	0m27.00s	real	0m5.78s		real	0m6.65s
user	0m25.98s	user	0m5.17s		user	0m6.09s
sys	0m0.66s		sys	0m0.47s		sys	0m0.47s
		
real	0m26.71s	real	0m5.84s		real	0m6.64s
user	0m25.85s	user	0m5.21s		user	0m6.07s
sys	0m0.67s		sys	0m0.48s		sys	0m0.47s

On OS X, awk using index() 4 times and substr() 5 times seems to be a little slower than using match() 2 times and substr() 2 times. Both are considerably faster than using split() and a for loop to append fields to create the concatenation of the 4th through the NFth fields.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

03-14-2014

Registered User

559, 160

Join Date: Jul 2012

Last Activity: 20 September 2019, 7:24 AM EDT

Location: India, Hyderabad

Posts: 559

Thanks Given: 11

Thanked 160 Times in 148 Posts

How about the below?

Code:

awk '{print gensub(/^[^ ]+[ ][^ ]+[ ]([^ ]+)[ ](.*)$/, "ent=\\1" FS "chem=\\2", "g", $0)}'

We can use $11 instead of $0 while parsing the file

Don, could you please test my code and give me the times

Last edited by SriniShoo; 03-14-2014 at 06:11 AM.. Reason: request

This User Gave Thanks to SriniShoo For This Post:

SriniShoo

View Public Profile for SriniShoo

Find all posts by SriniShoo

03-14-2014

Registered User

183, 15

Join Date: Jul 2010

Last Activity: 22 June 2015, 3:25 PM EDT

Posts: 183

Thanks Given: 56

Thanked 15 Times in 13 Posts

Thanks a lot guys. As the script has grown and added a lot of function calls for date and time processing and other things it has grown very slow. As far as parsing this one variable, I found out that the first field is always 4 characters and the second is 6 characters (the remaining vary). Because of this I did something very similar to Chubbler_XL but knowing the second space is always at position 12:

Code:

alarm = $11; alarmEnd = substr(alarm,13); i=index(alarmEnd," ")
ent = substr(alarmEnd,1,i-1); chem = substr(alarmEnd,i+1)

I need to study the other solutions to learn something. I have learned a lot of AWK in the last couple days.

Mike

Michael Stora

View Public Profile for Michael Stora

Find all posts by Michael Stora

Shell Programming and Scripting

Text string parsing in awk

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to change specific string to new value if found in text file

Discussion started by: cmccabe

2. Shell Programming and Scripting

Complex text parsing with speed/performance problem (awk solution?)

Discussion started by: Michael Stora

3. Shell Programming and Scripting

awk + gsub to search multiple input values & replace with located string + extra text

Discussion started by: dazhoop

4. Shell Programming and Scripting

Parsing a long string string problem for procmail

Discussion started by: cwiggler

5. Shell Programming and Scripting

how to extract a paticular string from the text file with awk.

Discussion started by: rajkumar_g

6. Shell Programming and Scripting

choose random text between constant string.. using awk?

Discussion started by: sandwich

7. Shell Programming and Scripting

String parsing with awk/sed/?

Discussion started by: airon23bball

8. Shell Programming and Scripting

Parsing of file for Report Generation (String parsing and splitting)

Discussion started by: umar.shaikh

9. UNIX for Dummies Questions & Answers

Parsing string

Discussion started by: rolex.mp

10. Shell Programming and Scripting

Need help parsing a string

Discussion started by: achieve