A faster equivalent for this sed command

06-12-2012

Registered User

16, 0

Join Date: Jun 2012

Last Activity: 30 June 2012, 7:40 AM EDT

Posts: 16

Thanks Given: 9

Thanked 0 Times in 0 Posts

A faster equivalent for this sed command

Hello guys,

I'm cleaning out big XML files (we're talking about 1GB at least), most of them contain words written in a non-latin alphabet.

The command I'm using is so slow it's not even funny:

Code:

cat $1 | sed -e :a -e 's/&lt;[^&gt;]*&gt;//g;/&lt;/N;//ba;s/</ /g;s/>/ /g;s/_//g;s/-//g;s/–//g;s/(//g;s/)//g;s/,//g' | tr " " "\n" | sort | uniq >

I've tried to use tr -d but it breaks my files for some reason... some of my non-latin characters are completely messed up.

Do you guys know to optimize this command to make it a bit faster? Could I use awk to get the exact same result I get with the sed command above?

Thank you very much !

Last edited by Scrutinizer; 06-12-2012 at 12:50 AM..

bobylapointe

View Public Profile for bobylapointe

Find all posts by bobylapointe

06-12-2012

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

I don't get why you would need a unique sort and the sed substitutions are not removing non-latin characters. Can you give a sample of input and desired output?

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

06-12-2012

Registered User

6,384, 2,214

Join Date: May 2005

Last Activity: 28 October 2019, 4:59 PM EDT

Location: In the leftmost byte of /dev/kmem

Posts: 6,384

Thanks Given: 143

Thanked 2,214 Times in 1,548 Posts

Quote:

Originally Posted by bobylapointe

I'm cleaning out big XML files (we're talking about 1GB at least), most of them contain words written in a non-latin alphabet.

The command I'm using is so slow it's not even funny:

For clarity i decompose your sed-script first:

Code:

cat $1 |\
sed -e :a -e 's/&lt;[^&gt;]*&gt;//g
                 /&lt;/N
                 //ba
                 s/</ /g
                 s/>/ /g
                 s/_//g
                 s/-//g
                 s/-//g
                 s/(//g
                 s/)//g
                 s/,//g' |\
tr " " "\n" |\
sort |\
uniq >

The first is: i can't understand what lines 2 and 3 in your sed-statement do. They look like syntax errors to me. Please clarify.

Then: every call in a pipeline will have to shovel through the whole data anew, so i might help to minimize the number of program calls in your pipeline.

Code:

cat <file> | sed ...

is a useless use of cat. Use

Code:

sed '<commands>' <file>

instead.

Then the "tr"-command. This could be done inside sed too, yes? You want to change all space chars to newlines:

Code:

sed 's/<spc>/<ENTER>/g' infile > outfile

Replace "<spc>" with a literal space char and "^M" with <ENTER>. To enter special characters like the newline char press "<CTRL>-V" in vi before. The result will look like "^M" (one character), but in fact be a quoted newline.

Code:

<stream> | sort | uniq

might as well be shortened to

Code:

<stream> | sort -u

Lets get to the sed-command itself. You want some characters to be replaced by space chars, some to be deleted (line 4-end of your sed-script). This could be stated easier:

Code:

sed 's/[<>]/ /g; s/[_--(),]//g'

'

I am not sure if this speeds things up, but i presume it does: sed reads in one line after the other, applying one command after the other to each line. It follows, that having less commands it doesn't have to go through the pattern space that often and therefore be faster. Still, i haven't tested this so i this is just a presumption. Find out yourself, you might use a somewhat smaller file for testing.

Still, i would be interested in your findings, so please report back when you carried out these tests and tell us what you found.

I hope this helps.

bakunin

This User Gave Thanks to bakunin For This Post:

bakunin

View Public Profile for bakunin

Find all posts by bakunin

06-12-2012

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

Quote:

Originally Posted by bakunin

Then the "tr"-command. This could be done inside sed too, yes? You want to change all space chars to newlines:

Code:

sed 's/<spc>/<ENTER>/g' infile > outfile

To insert a newline with sed's s command, the newline needs to be preceded by a backslash.

In this case, a simple, global substitution of space with newline, it's easier to use the y command, which supports the \n sequence:
y/ /\n/

A small nit: ^M is a carriage return character. ^J is newline/linefeed

Regards,
Alister

alister

View Public Profile for alister

Find all posts by alister

06-12-2012

Registered User

16, 0

Join Date: Jun 2012

Last Activity: 30 June 2012, 7:40 AM EDT

Posts: 16

Thanks Given: 9

Thanked 0 Times in 0 Posts

Thanks bakunin, it definitely helped. The syntax errors you pointed out were errors indeed. That's pretty much what made the whole thing not work as expected.

It's still a bit slow, but I can live with that

Thanks for your explanations.

Btw: I did not manage to remove / with sed, even when escaping the character... I had to use tr instead.

bobylapointe

View Public Profile for bobylapointe

Find all posts by bobylapointe

UNIX for Dummies Questions & Answers

A faster equivalent for this sed command

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to make awk command faster?

Discussion started by: Peu Mukherjee

2. Shell Programming and Scripting

sed Equivalent for awk/grep

Discussion started by: timmywong

3. Shell Programming and Scripting

solaris sed equivalent

Discussion started by: sugarcane

4. Shell Programming and Scripting

Faster way to use this awk command

Discussion started by: SkySmart

5. Shell Programming and Scripting

Multi thread awk command for faster performance

Discussion started by: chetan.c

6. HP-UX

Faster command for file copy than cp ?

Discussion started by: shipra_31

7. Shell Programming and Scripting

faster command than find for sorting?

Discussion started by: unclecameron

8. Shell Programming and Scripting

HELP need to split this line faster than cut-command

Discussion started by: daytripper1021

9. UNIX for Dummies Questions & Answers

Which command will be faster? y?

Discussion started by: karthi_g

10. Shell Programming and Scripting

command faster in crontab..

Discussion started by: silverlocket