My files are 10k lines big each (approx).
The keys are strings that don't contain whitespaces; the values are classic text strings, without "=" symbol.
The purpouse of the script is to get from file 2 the value of each key that appears both in file1 and file2.
The first part of the script sorts file1 and file2 (in order to get a complexity of O(n) rather than O(n^2)) [argumentation might be done on this sort... but that's not the point right now, since it's not the bottleneck]
Then, basically, I read each line of the (sorted) files, check whether they have the same keys, and if they do, save the value to my output. Otherwise, get the next line of the file which has the smallest key.
The problem here is to get the keys. After running the script once, I noticed the files were generated with random whitespaces before the "=" symbol (before or after the key). I can't change the generator, so I had to change the script.
I tried three variations of it:
A - sed on the line:
This sed gets all the non whitespaces characters left from the equal sign.
As you might imagine, that took an awful lot of time.
This is clearly not acceptable, since I have to do the operation over 20k lines.
So I tried option B:
B - using cut on each line
That wasn't that much better...
Still not acceptable.
So I browsed a bit this forum, read somewhere that "cat foo | bar" wasn't recommended, and changed a bit of the code.
I didn't need that lineFile1 there, so there was no point in retrieving it.
I added
before my calls, and I'm now using
to get the key.
This is way better:
The problem is ... it's still taking too much time. From my logs, I'm approximatilvely processing 20 lines per second, which means one loop takes ~0.050 sec (this includes the awk I'm running on the file1_sorted file to get the output). This also means 15 min for a 20k lines input.
Is there some way of speeding up that process? (clearly, the bottleneck is this getting the line thing)
Thanks!
PS: For some reason, the process is only taking 8% of my CPU at max. Are there some commands that are slow? (echo, perhaps?)
awk supports associative arrays. Also, when you run tests, you should use a reasonable dataset, not just 100 lines. The reason is: what you get with the time command does not reflect your algorithm as much as it reflects creating a process, opening files, etc.
for keys common to file1 and file2 (try this on a big file)
Output for above file1 and file2 (values of key2 and key4):
Is this the expected output? If not, pls post the expected one.
expected output would be
I'm not an awk-expert (clearly not :-) ), but this is close.
There's just the part where one of the input files can be "key1 =value2_1_2" (with that whitespace), or "\tkey2 =value2_2_2", .. that does't match the pattern here.
@Jim McNamara
I'm running my tests on my 10k files
However, in order to get the results of the time queries, I run time on the exact query, not on the whole process.
Hi All,
I'm new on this forum, and i'm trying since several days to find out a way to retrieve a expression from the last line containing a pattern. Could you please help me with this ?
E.g. The file is containing the following lines
08/05 17:33:47 STAT1 Response(22) is... (4 Replies)
awk/sed newbie here. I have a HTML file and from that file and I would like to retrieve a text word.
<font face=arial size=-1><li><a href=/value_for_clients/Tokyo/abc_process.txt>abc</a> NDK Version: 4.0 </li>
<font face=arial size=-1><li><a... (6 Replies)
Hi guys,
I have been trying to do this, but... no luck so maybe you can help me.
I have a line like this:
Total Handled, Received, on queue Input Mgs: 140 / 14 => 0
I need to, get the number after the / until the =, to get only 14 .
Any help is greatly appreciated.
Thanks, (4 Replies)
Being new to awk I have a really basic question. It just has to be in the archives but it didn't bite me when I went looking for it.
I've written an awk script, placed it in a file, added the "#!/usr/bin/awk -f" at the top of the script and away I go. "% myAwk <inputfile>" gives me exactly what... (2 Replies)
Hello,
i have a file, i open the file and read the line, i want to get the first item in the csv file and also teh third+6 item and wirte it to a new csv file. only problem is that using echo it takes TOO LONG:
please help a newbie. below is my code:
WorkingDir=$1
FileName=`cut -d ',' -f... (2 Replies)
I am trying to retrieve that max 'year' in a text file that is delimited by tilde (~). It is the second column and the values may be in Char format (double quoted) and have duplicate values.
Please help. (4 Replies)
Hi,
I have filename in the following format:
YUENLONG_20070818.DMP
HK_20070818_V0.DMP
WANCHAI_20070820.DMP
KWUNTONG_20070820_V0.DMP
How to retrieve only the digital part with sed or awk and return the following format:
20070818
20070818
20070820
20070820
Thanks!
Victor (3 Replies)
Hi,
I'm looking for a command to retrieve a block of lines using sed or grep, probably awk if that can do the job.
In below example,
By searching for words "Third line2" i'm expecting to retrieve the full block starting with 'BEGIN' and ending with 'END' of the search.
Example:
... (3 Replies)
Let's say I have a line like that:
I want to cut out numbers between two $ including $s. The result should be like that:
I am so-newbei. I am non-stop reading about SED since yesterday and not a programmer. I know that it is a short period, thus maybe I had overlooked something. I... (4 Replies)
hi again...need new help guys:p
the file contains following infos...
users/abc/bla1.exe
newusers/defgh/ik/albg2.exe
users2/opww/ertz/qqwertzu/rwerwew.exe
how to get the file content into...
users/abc/
newusers/defgh/ik/
users2/opww/ertz/qqwertzu/
with...
you can erase the... (5 Replies)