sed performance

03-11-2008

Registered User

3, 0

Join Date: Mar 2008

Last Activity: 11 March 2008, 10:31 AM EDT

Posts: 3

Thanks Given: 0

Thanked 0 Times in 0 Posts

sed performance

hello experts,

i am trying to replace a line in a 100+mb text file. the structure is similar to the passwd file, id:value1:value2 and so on. using the sed command

Code:

sed -i 's/\(123\):\([^:]\{1,\}\):/\1:bar:/' data.txt

works nicely, the line "123:foo:" is replaced by "123:bar:". however, it takes about 3 seconds.

using the grep command with environment variable LC_ALL set to "C" brings me the result instantly:

Code:

export LC_ALL=C
grep -Eh -- '(123):([^:]{1,}):' data.txt

now, the occurrences i'm looking for are all near the end of the file, so performance loss due to i/o could be avoided, since not the whole file should be rewritten. does anyone have an idea how sed could be accelerated? several ideas pop into my mind:

make sed somehow seek to the byte offset grep can deliver rapidly
only pipe the tail starting at the occurrence to sed and somehow write the tail back to the file
find out why sed is so slow, maybe it is matching in unicode (without LC_ALL=C, grep is slow too)

any hints or leads would be greatly appreciated

cheers,

-f3k.

f3k

View Public Profile for f3k

Find all posts by f3k

03-11-2008

Registered User

2,669, 20

Join Date: Sep 2006

Last Activity: 28 January 2015, 8:30 AM EST

Posts: 2,669

Thanks Given: 0

Thanked 20 Times in 20 Posts

take note, that your sed syntax with -i makes an inplace edit, while grep doesn't replace the file. you can consider that part of the reason why you said its slow.
if you roughly know where the pattern near the end is, you can give address ranges

Code:

eg
sed -i '2000,$ s/old/new/' file

ghostdog74

View Public Profile for ghostdog74

Find all posts by ghostdog74

03-11-2008

Registered User

3, 0

Join Date: Mar 2008

Last Activity: 11 March 2008, 10:31 AM EDT

Posts: 3

Thanks Given: 0

Thanked 0 Times in 0 Posts

thanks for the rapid answer!

i'm not entirely sure why sed is slow, its either because the search takes longer, or because it writes a lot of data, as you stated. i already tried the address parameter passing the line number, it doesn't help much, so i guess it's indeed the fact that it writes the whole file back to the disk.

i just did a quick test where i read the byte offset from grep, fseek to the position, read the rest from there, do the replace and write the result back to the same offset. the whole process took 25ms. however it's not bash

f3k

View Public Profile for f3k

Find all posts by f3k

03-11-2008

Registered User

2,669, 20

Join Date: Sep 2006

Last Activity: 28 January 2015, 8:30 AM EST

Posts: 2,669

Thanks Given: 0

Thanked 20 Times in 20 Posts

seriously, 100mb+ file is not big..so i think you shouldn't really worry about that..unless its really very time critical or something...

ghostdog74

View Public Profile for ghostdog74

Find all posts by ghostdog74

03-11-2008

Registered User

3, 0

Join Date: Mar 2008

Last Activity: 11 March 2008, 10:31 AM EDT

Posts: 3

Thanks Given: 0

Thanked 0 Times in 0 Posts

still, there's a difference between 3s and 25ms

i don't consider it very large neither, but well... it actually could be time critical, yes. maybe there is a way to tell sed only to write from where it replaced and not the whole file?

f3k

View Public Profile for f3k

Find all posts by f3k

03-11-2008

Registered User

6,384, 2,214

Join Date: May 2005

Last Activity: 28 October 2019, 4:59 PM EDT

Location: In the leftmost byte of /dev/kmem

Posts: 6,384

Thanks Given: 143

Thanked 2,214 Times in 1,548 Posts

The reason is how sed works: it never changes the file it is working on, but - per default - puts it results to <stdout>. The "-i" option, as ghostdog74 has pointed out, is a non-standard extension to sed and it probably works by producing an intermediary file and then replacing the original.

So the difference between what sed has to do and grep has to do is:

Code:

grep                             sed
            read the file
         parse it/change it
output to <stdout>      output to temp file
-                       replace original file with temp file

Probably you could "even the score" by having sed put its output to <stdout> too and compare the times then or - even better, as it eliminates the output delay completely - directing both grep's and sed's output to /dev/null and compare the times then.

I hope this helps.

bakunin

View Public Profile for bakunin

Find all posts by bakunin

03-11-2008

Registered User

21, 1

Join Date: Oct 2007

Last Activity: 16 April 2014, 2:23 PM EDT

Posts: 21

Thanks Given: 0

Thanked 1 Time in 1 Post

A totally different perspective

Could it be the first time you tried to edit the file you read if from disk an the second time it was in cache so there was no I/O? It will take a couple of seconds to read a 100MB file off disk.

If this is something you've never paid any attention to before you should run a tool like collectl - SourceForge.net: collectl which can show you what's going on even at the sub-second level on your system. It's amazing how ofter people just look at how long an operation takes to perform vs what the system is doing. With collectl you'll also be able to watch the cpu and memory during your tests...

-mark

MarkSeger

View Public Profile for MarkSeger

Find all posts by MarkSeger

UNIX for Advanced & Expert Users

sed performance

9 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Performance analysis sed vs awk

Discussion started by: Irishboy24

2. Shell Programming and Scripting

sed / grep / for statement performance - please help

Discussion started by: TehOne

3. Shell Programming and Scripting

Increase sed performance

Discussion started by: gpaulose

4. Solaris

best way and best performance

Discussion started by: samar

5. News, Links, Events and Announcements

Announcing collectl - new performance linux performance monitor

Discussion started by: MarkSeger

6. UNIX for Advanced & Expert Users

I/O performance

Discussion started by: gfhgfnhhn

7. UNIX for Advanced & Expert Users

performance

Discussion started by: big123456

8. UNIX for Advanced & Expert Users

performance

Discussion started by: big123456

9. UNIX for Advanced & Expert Users

Performance

Discussion started by: olso