cut, sed, awk too slow to retrieve line - other options?


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting cut, sed, awk too slow to retrieve line - other options?
# 1  
Old 12-30-2010
cut, sed, awk too slow to retrieve line - other options?

Hi,
I have a script that, basically, has two input files of this type:

file1
key1=value1_1_1
key2=value1_2_1
key4=value1_4_1
...

file2
key2=value2_2_1
key2=value2_2_2
key3=value2_3_1
key4=value2_4_1
...

My files are 10k lines big each (approx).
The keys are strings that don't contain whitespaces; the values are classic text strings, without "=" symbol.

The purpouse of the script is to get from file 2 the value of each key that appears both in file1 and file2.

The first part of the script sorts file1 and file2 (in order to get a complexity of O(n) rather than O(n^2)) [argumentation might be done on this sort... but that's not the point right now, since it's not the bottleneck]

Then, basically, I read each line of the (sorted) files, check whether they have the same keys, and if they do, save the value to my output. Otherwise, get the next line of the file which has the smallest key.

The problem here is to get the keys. After running the script once, I noticed the files were generated with random whitespaces before the "=" symbol (before or after the key). I can't change the generator, so I had to change the script.

I tried three variations of it:

A - sed on the line:
Code:
lineFile1=`awk "NR==${currLine1}" file1_sorted`
keyFile1=`echo $lineFile1 | sed -e 's/\s*\(\S*\)\s*=.*/\1/g'`

This sed gets all the non whitespaces characters left from the equal sign.

As you might imagine, that took an awful lot of time.
Code:
real    0m1.030s
user    0m0.996s
sys     0m.028s

This is clearly not acceptable, since I have to do the operation over 20k lines.

So I tried option B:
B - using cut on each line
Code:
lineFile1=`awk "NR==${currLine1}" file1_sorted`
keyFile1=`echo $lineFile1 | cut -d '=' -f 1 | sed -e 's/\s//g'`

That wasn't that much better...
Code:
real    0m0.659s
user    0m0.632s
sys     0m0.028s

Still not acceptable.

So I browsed a bit this forum, read somewhere that "cat foo | bar" wasn't recommended, and changed a bit of the code.

I didn't need that lineFile1 there, so there was no point in retrieving it.
I added
Code:
cut -d '=' -f 1 file1_sorted > file1_keys_sorted

before my calls, and I'm now using
Code:
keyFile1=`awk "NR==${currLine1}" file1_keys_sorted`

to get the key.

This is way better:
Code:
real    0m0.043s
user    0m0.032s
sys     0m0.008s

The problem is ... it's still taking too much time. From my logs, I'm approximatilvely processing 20 lines per second, which means one loop takes ~0.050 sec (this includes the awk I'm running on the file1_sorted file to get the output). This also means 15 min for a 20k lines input.

Is there some way of speeding up that process? (clearly, the bottleneck is this getting the line thing)

Thanks!


PS: For some reason, the process is only taking 8% of my CPU at max. Are there some commands that are slow? (echo, perhaps?)
# 2  
Old 12-30-2010
Code:
awk -F= 'NR==FNR{a[$1]=$2;next}{if(a[$1]) print $2;}' file1 file2

Output for above file1 and file2 (values of key2 and key4):
Code:
value2_2_1
value2_2_2
value2_4_1

Is this the expected output? If not, pls post the expected one.
This User Gave Thanks to anurag.singh For This Post:
# 3  
Old 12-30-2010
awk supports associative arrays. Also, when you run tests, you should use a reasonable dataset, not just 100 lines. The reason is: what you get with the time command does not reflect your algorithm as much as it reflects creating a process, opening files, etc.

for keys common to file1 and file2 (try this on a big file)
Code:
awk -F'='  'FILENAME=="file1" { arr[$1]=$2; next }
               FILENAME=="file2" { if($1 in arr) {print $1, arr[$1], $2} } ' file1 file2 | sort

# 4  
Old 12-30-2010
Quote:
Originally Posted by anurag.singh
Code:
awk -F= 'NR==FNR{a[$1]=$2;next}{if(a[$1]) print $2;}' file1 file2

Output for above file1 and file2 (values of key2 and key4):
Code:
value2_2_1
value2_2_2
value2_4_1

Is this the expected output? If not, pls post the expected one.
expected output would be
Code:
key2=value2_2_1
key2=value2_2_2
key4=value2_4_1

I'm not an awk-expert (clearly not :-) ), but this is close.
There's just the part where one of the input files can be "key1 =value2_1_2" (with that whitespace), or "\tkey2 =value2_2_2", .. that does't match the pattern here.

@Jim McNamara
I'm running my tests on my 10k files Smilie
However, in order to get the results of the time queries, I run time on the exact query, not on the whole process.
# 5  
Old 12-30-2010
Code:
awk 'NR==FNR{idx=index($1,"=");if(idx) $1=substr($1,1,idx-1);a[$1]++;next;}{b=$0;idx=index($1,"=");if(idx) $1=substr($1,1,idx-1);if(a[$1]) print b;}' file1 file2

This should be able to handle all cases like
Code:
key=value
key =value
key= value
key = value
     key = value
.
.
.

This User Gave Thanks to anurag.singh For This Post:
# 6  
Old 12-30-2010
Hi, Try this,

Modified Anurag's code,

Code:
awk -F"[ =\t]" 'NR==FNR{a[$1]=$2;next}a[$1] || a[$2] { print}'  file1 file2

# 7  
Old 12-30-2010
Quote:
Originally Posted by anurag.singh
Code:
awk 'NR==FNR{idx=index($1,"=");if(idx) $1=substr($1,1,idx-1);a[$1]++;next;}{b=$0;idx=index($1,"=");if(idx) $1=substr($1,1,idx-1);if(a[$1]) print b;}' file1 file2

This should be able to handle all cases like
Code:
key=value
key =value
key= value
key = value
     key = value
.
.
.


I just ran it and ... wow ... That was mind-blowing!

Now, the sort is the bottleneck Smilie But that's OK

Thanks a lot !
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk - To retrieve an expression from the last line containing a pattern

Hi All, I'm new on this forum, and i'm trying since several days to find out a way to retrieve a expression from the last line containing a pattern. Could you please help me with this ? E.g. The file is containing the following lines 08/05 17:33:47 STAT1 Response(22) is... (4 Replies)
Discussion started by: Antonio Fargas
4 Replies

2. Shell Programming and Scripting

Retrieve information Text/Word from HTML code using awk/sed

awk/sed newbie here. I have a HTML file and from that file and I would like to retrieve a text word. <font face=arial size=-1><li><a href=/value_for_clients/Tokyo/abc_process.txt>abc</a> NDK Version: 4.0 </li> <font face=arial size=-1><li><a... (6 Replies)
Discussion started by: sk2code
6 Replies

3. Shell Programming and Scripting

sed or awk, cut, to extract specific data from line

Hi guys, I have been trying to do this, but... no luck so maybe you can help me. I have a line like this: Total Handled, Received, on queue Input Mgs: 140 / 14 => 0 I need to, get the number after the / until the =, to get only 14 . Any help is greatly appreciated. Thanks, (4 Replies)
Discussion started by: ocramas
4 Replies

4. Shell Programming and Scripting

awk script file command line options

Being new to awk I have a really basic question. It just has to be in the archives but it didn't bite me when I went looking for it. I've written an awk script, placed it in a file, added the "#!/usr/bin/awk -f" at the top of the script and away I go. "% myAwk <inputfile>" gives me exactly what... (2 Replies)
Discussion started by: tomr2k
2 Replies

5. Shell Programming and Scripting

Line/Variable Editing for Awk sed Cut

Hello, i have a file, i open the file and read the line, i want to get the first item in the csv file and also teh third+6 item and wirte it to a new csv file. only problem is that using echo it takes TOO LONG: please help a newbie. below is my code: WorkingDir=$1 FileName=`cut -d ',' -f... (2 Replies)
Discussion started by: limamichelle
2 Replies

6. Shell Programming and Scripting

sed/awk to retrieve max year in column

I am trying to retrieve that max 'year' in a text file that is delimited by tilde (~). It is the second column and the values may be in Char format (double quoted) and have duplicate values. Please help. (4 Replies)
Discussion started by: CKT_newbie88
4 Replies

7. Shell Programming and Scripting

How to retrieve digital string using sed or awk

Hi, I have filename in the following format: YUENLONG_20070818.DMP HK_20070818_V0.DMP WANCHAI_20070820.DMP KWUNTONG_20070820_V0.DMP How to retrieve only the digital part with sed or awk and return the following format: 20070818 20070818 20070820 20070820 Thanks! Victor (3 Replies)
Discussion started by: victorcheung
3 Replies

8. UNIX for Dummies Questions & Answers

retrieve lines using sed, grep or awk

Hi, I'm looking for a command to retrieve a block of lines using sed or grep, probably awk if that can do the job. In below example, By searching for words "Third line2" i'm expecting to retrieve the full block starting with 'BEGIN' and ending with 'END' of the search. Example: ... (3 Replies)
Discussion started by: learning_linux
3 Replies

9. Shell Programming and Scripting

Text cut between two $ in a line with SED

Let's say I have a line like that: I want to cut out numbers between two $ including $s. The result should be like that: I am so-newbei. I am non-stop reading about SED since yesterday and not a programmer. I know that it is a short period, thus maybe I had overlooked something. I... (4 Replies)
Discussion started by: l_p
4 Replies

10. UNIX for Dummies Questions & Answers

cut vs. sed vs. awk ?

hi again...need new help guys:p the file contains following infos... users/abc/bla1.exe newusers/defgh/ik/albg2.exe users2/opww/ertz/qqwertzu/rwerwew.exe how to get the file content into... users/abc/ newusers/defgh/ik/ users2/opww/ertz/qqwertzu/ with... you can erase the... (5 Replies)
Discussion started by: svennie
5 Replies
Login or Register to Ask a Question