cut, sed, awk too slow to retrieve line

12-30-2010

Registered User

8, 0

Join Date: Dec 2010

Last Activity: 26 March 2013, 3:54 AM EDT

Posts: 8

Thanks Given: 4

Thanked 0 Times in 0 Posts

cut, sed, awk too slow to retrieve line - other options?

Hi,
I have a script that, basically, has two input files of this type:

file1
key1=value1_1_1
key2=value1_2_1
key4=value1_4_1
...

file2
key2=value2_2_1
key2=value2_2_2
key3=value2_3_1
key4=value2_4_1
...

My files are 10k lines big each (approx).
The keys are strings that don't contain whitespaces; the values are classic text strings, without "=" symbol.

The purpouse of the script is to get from file 2 the value of each key that appears both in file1 and file2.

The first part of the script sorts file1 and file2 (in order to get a complexity of O(n) rather than O(n^2)) [argumentation might be done on this sort... but that's not the point right now, since it's not the bottleneck]

Then, basically, I read each line of the (sorted) files, check whether they have the same keys, and if they do, save the value to my output. Otherwise, get the next line of the file which has the smallest key.

The problem here is to get the keys. After running the script once, I noticed the files were generated with random whitespaces before the "=" symbol (before or after the key). I can't change the generator, so I had to change the script.

I tried three variations of it:

A - sed on the line:

Code:

lineFile1=`awk "NR==${currLine1}" file1_sorted`
keyFile1=`echo $lineFile1 | sed -e 's/\s*\(\S*\)\s*=.*/\1/g'`

This sed gets all the non whitespaces characters left from the equal sign.

As you might imagine, that took an awful lot of time.

Code:

real    0m1.030s
user    0m0.996s
sys     0m.028s

This is clearly not acceptable, since I have to do the operation over 20k lines.

So I tried option B:
B - using cut on each line

Code:

lineFile1=`awk "NR==${currLine1}" file1_sorted`
keyFile1=`echo $lineFile1 | cut -d '=' -f 1 | sed -e 's/\s//g'`

That wasn't that much better...

Code:

real    0m0.659s
user    0m0.632s
sys     0m0.028s

Still not acceptable.

So I browsed a bit this forum, read somewhere that "cat foo | bar" wasn't recommended, and changed a bit of the code.

I didn't need that lineFile1 there, so there was no point in retrieving it.
I added

Code:

cut -d '=' -f 1 file1_sorted > file1_keys_sorted

before my calls, and I'm now using

Code:

keyFile1=`awk "NR==${currLine1}" file1_keys_sorted`

to get the key.

This is way better:

Code:

real    0m0.043s
user    0m0.032s
sys     0m0.008s

The problem is ... it's still taking too much time. From my logs, I'm approximatilvely processing 20 lines per second, which means one loop takes ~0.050 sec (this includes the awk I'm running on the file1_sorted file to get the output). This also means 15 min for a 20k lines input.

Is there some way of speeding up that process? (clearly, the bottleneck is this getting the line thing)

Thanks!

PS: For some reason, the process is only taking 8% of my CPU at max. Are there some commands that are slow? (echo, perhaps?)

fzd

View Public Profile for fzd

Find all posts by fzd

12-30-2010

Registered User

413, 99

Join Date: Nov 2010

Last Activity: 12 July 2012, 8:07 AM EDT

Location: Hyderabad, India

Posts: 413

Thanks Given: 13

Thanked 99 Times in 96 Posts

Code:

awk -F= 'NR==FNR{a[$1]=$2;next}{if(a[$1]) print $2;}' file1 file2

Output for above file1 and file2 (values of key2 and key4):

Code:

value2_2_1
value2_2_2
value2_4_1

Is this the expected output? If not, pls post the expected one.

This User Gave Thanks to anurag.singh For This Post:

anurag.singh

View Public Profile for anurag.singh

Find all posts by anurag.singh

12-30-2010

Registered User

11,728, 1,345

Join Date: Feb 2004

Last Activity: 8 May 2020, 9:07 AM EDT

Location: NM

Posts: 11,728

Thanks Given: 903

Thanked 1,345 Times in 1,201 Posts

awk supports associative arrays. Also, when you run tests, you should use a reasonable dataset, not just 100 lines. The reason is: what you get with the time command does not reflect your algorithm as much as it reflects creating a process, opening files, etc.

for keys common to file1 and file2 (try this on a big file)

Code:

awk -F'='  'FILENAME=="file1" { arr[$1]=$2; next }
               FILENAME=="file2" { if($1 in arr) {print $1, arr[$1], $2} } ' file1 file2 | sort

jim mcnamara

View Public Profile for jim mcnamara

Find all posts by jim mcnamara

12-30-2010

Registered User

8, 0

Join Date: Dec 2010

Last Activity: 26 March 2013, 3:54 AM EDT

Posts: 8

Thanks Given: 4

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by anurag.singh

Code:

awk -F= 'NR==FNR{a[$1]=$2;next}{if(a[$1]) print $2;}' file1 file2

Output for above file1 and file2 (values of key2 and key4):

Code:

value2_2_1
value2_2_2
value2_4_1

Is this the expected output? If not, pls post the expected one.

expected output would be

Code:

key2=value2_2_1
key2=value2_2_2
key4=value2_4_1

I'm not an awk-expert (clearly not :-) ), but this is close.
There's just the part where one of the input files can be "key1 =value2_1_2" (with that whitespace), or "\tkey2 =value2_2_2", .. that does't match the pattern here.

@Jim McNamara
I'm running my tests on my 10k files

However, in order to get the results of the time queries, I run time on the exact query, not on the whole process.

fzd

View Public Profile for fzd

Find all posts by fzd

12-30-2010

Registered User

413, 99

Join Date: Nov 2010

Last Activity: 12 July 2012, 8:07 AM EDT

Location: Hyderabad, India

Posts: 413

Thanks Given: 13

Thanked 99 Times in 96 Posts

Code:

awk 'NR==FNR{idx=index($1,"=");if(idx) $1=substr($1,1,idx-1);a[$1]++;next;}{b=$0;idx=index($1,"=");if(idx) $1=substr($1,1,idx-1);if(a[$1]) print b;}' file1 file2

This should be able to handle all cases like

Code:

key=value
key =value
key= value
key = value
     key = value
.
.
.

This User Gave Thanks to anurag.singh For This Post:

anurag.singh

View Public Profile for anurag.singh

Find all posts by anurag.singh

12-30-2010

Registered User

1,271, 299

Join Date: Sep 2009

Last Activity: 17 July 2019, 5:46 PM EDT

Location: ./India/Bangalore

Posts: 1,271

Thanks Given: 70

Thanked 299 Times in 290 Posts

Hi, Try this,

Modified Anurag's code,

Code:

awk -F"[ =\t]" 'NR==FNR{a[$1]=$2;next}a[$1] || a[$2] { print}'  file1 file2

pravin27

View Public Profile for pravin27

Find all posts by pravin27

12-30-2010

Registered User

8, 0

Join Date: Dec 2010

Last Activity: 26 March 2013, 3:54 AM EDT

Posts: 8

Thanks Given: 4

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by anurag.singh

Code:

awk 'NR==FNR{idx=index($1,"=");if(idx) $1=substr($1,1,idx-1);a[$1]++;next;}{b=$0;idx=index($1,"=");if(idx) $1=substr($1,1,idx-1);if(a[$1]) print b;}' file1 file2

This should be able to handle all cases like

Code:

key=value
key =value
key= value
key = value
     key = value
.
.
.

I just ran it and ... wow ... That was mind-blowing!

Now, the sort is the bottleneck

But that's OK

Thanks a lot !

fzd

View Public Profile for fzd

Find all posts by fzd

Shell Programming and Scripting

cut, sed, awk too slow to retrieve line - other options?

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk - To retrieve an expression from the last line containing a pattern

Discussion started by: Antonio Fargas

2. Shell Programming and Scripting

Retrieve information Text/Word from HTML code using awk/sed

Discussion started by: sk2code

3. Shell Programming and Scripting

sed or awk, cut, to extract specific data from line

Discussion started by: ocramas

4. Shell Programming and Scripting

awk script file command line options

Discussion started by: tomr2k

5. Shell Programming and Scripting

Line/Variable Editing for Awk sed Cut

Discussion started by: limamichelle

6. Shell Programming and Scripting

sed/awk to retrieve max year in column

Discussion started by: CKT_newbie88

7. Shell Programming and Scripting

How to retrieve digital string using sed or awk

Discussion started by: victorcheung

8. UNIX for Dummies Questions & Answers

retrieve lines using sed, grep or awk

Discussion started by: learning_linux

9. Shell Programming and Scripting

Text cut between two $ in a line with SED

Discussion started by: l_p

10. UNIX for Dummies Questions & Answers

cut vs. sed vs. awk ?

Discussion started by: svennie