Go Back   The UNIX and Linux Forums > Top Forums > UNIX for Dummies Questions & Answers


UNIX for Dummies Questions & Answers If you're not sure where to post a UNIX or Linux question, post it here. All UNIX and Linux newbies welcome !!

Closed Thread    
 
Thread Tools Search this Thread Display Modes
    #1  
Old 06-11-2012
Registered User
 
Join Date: Jun 2012
Posts: 16
Thanks: 9
Thanked 0 Times in 0 Posts
A faster equivalent for this sed command

Hello guys,

I'm cleaning out big XML files (we're talking about 1GB at least), most of them contain words written in a non-latin alphabet.

The command I'm using is so slow it's not even funny:


Code:
cat $1 | sed -e :a -e 's/&lt;[^&gt;]*&gt;//g;/&lt;/N;//ba;s/</ /g;s/>/ /g;s/_//g;s/-//g;s/–//g;s/(//g;s/)//g;s/,//g' | tr " " "\n" | sort | uniq >


I've tried to use tr -d but it breaks my files for some reason... some of my non-latin characters are completely messed up.

Do you guys know to optimize this command to make it a bit faster? Could I use awk to get the exact same result I get with the sed command above?

Thank you very much !

Last edited by Scrutinizer; 06-11-2012 at 11:50 PM..
Sponsored Links
    #2  
Old 06-11-2012
Scrutinizer's Avatar
Moderator
 
Join Date: Nov 2008
Location: Amsterdam
Posts: 7,350
Thanks: 144
Thanked 1,755 Times in 1,592 Posts
I don't get why you would need a unique sort and the sed substitutions are not removing non-latin characters. Can you give a sample of input and desired output?
Sponsored Links
    #3  
Old 06-12-2012
bakunin bakunin is offline Forum Staff  
Bughunter Extraordinaire
 
Join Date: May 2005
Location: In the leftmost byte of /dev/kmem
Posts: 3,297
Thanks: 27
Thanked 454 Times in 353 Posts
Quote:
Originally Posted by bobylapointe View Post
I'm cleaning out big XML files (we're talking about 1GB at least), most of them contain words written in a non-latin alphabet.

The command I'm using is so slow it's not even funny:
For clarity i decompose your sed-script first:


Code:
cat $1 |\
sed -e :a -e 's/&lt;[^&gt;]*&gt;//g
                 /&lt;/N
                 //ba
                 s/</ /g
                 s/>/ /g
                 s/_//g
                 s/-//g
                 s/–//g
                 s/(//g
                 s/)//g
                 s/,//g' |\
tr " " "\n" |\
sort |\
uniq >

The first is: i can't understand what lines 2 and 3 in your sed-statement do. They look like syntax errors to me. Please clarify.

Then: every call in a pipeline will have to shovel through the whole data anew, so i might help to minimize the number of program calls in your pipeline.


Code:
cat <file> | sed ...

is a useless use of cat. Use


Code:
sed '<commands>' <file>

instead.

Then the "tr"-command. This could be done inside sed too, yes? You want to change all space chars to newlines:


Code:
sed 's/<spc>/<ENTER>/g' infile > outfile

Replace "<spc>" with a literal space char and "^M" with <ENTER>. To enter special characters like the newline char press "<CTRL>-V" in vi before. The result will look like "^M" (one character), but in fact be a quoted newline.


Code:
<stream> | sort | uniq

might as well be shortened to


Code:
<stream> | sort -u


Lets get to the sed-command itself. You want some characters to be replaced by space chars, some to be deleted (line 4-end of your sed-script). This could be stated easier:


Code:
sed 's/[<>]/ /g; s/[_-–(),]//g'

'

I am not sure if this speeds things up, but i presume it does: sed reads in one line after the other, applying one command after the other to each line. It follows, that having less commands it doesn't have to go through the pattern space that often and therefore be faster. Still, i haven't tested this so i this is just a presumption. Find out yourself, you might use a somewhat smaller file for testing.

Still, i would be interested in your findings, so please report back when you carried out these tests and tell us what you found.

I hope this helps.

bakunin
The Following User Says Thank You to bakunin For This Useful Post:
bobylapointe (06-12-2012)
    #4  
Old 06-12-2012
alister alister is offline Forum Advisor  
Registered User
 
Join Date: Dec 2009
Posts: 2,600
Thanks: 123
Thanked 717 Times in 600 Posts
Quote:
Originally Posted by bakunin View Post
Then the "tr"-command. This could be done inside sed too, yes? You want to change all space chars to newlines:


Code:
sed 's/<spc>/<ENTER>/g' infile > outfile

Replace "<spc>" with a literal space char and "^M" with <ENTER>. To enter special characters like the newline char press "<CTRL>-V" in vi before. The result will look like "^M" (one character), but in fact be a quoted newline.
To insert a newline with sed's s command, the newline needs to be preceded by a backslash.

In this case, a simple, global substitution of space with newline, it's easier to use the y command, which supports the \n sequence:
y/ /\n/

A small nit: ^M is a carriage return character. ^J is newline/linefeed

Regards,
Alister
Sponsored Links
    #5  
Old 06-12-2012
Registered User
 
Join Date: Jun 2012
Posts: 16
Thanks: 9
Thanked 0 Times in 0 Posts
Thanks bakunin, it definitely helped. The syntax errors you pointed out were errors indeed. That's pretty much what made the whole thing not work as expected.


It's still a bit slow, but I can live with that

Thanks for your explanations.

Btw: I did not manage to remove / with sed, even when escaping the character... I had to use tr instead.
Sponsored Links
Closed Thread

Tags
shell sed awk slow script

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
Faster way to use this awk command SkySmart Shell Programming and Scripting 8 05-25-2012 06:02 AM
Faster command for file copy than cp ? shipra_31 HP-UX 9 02-07-2012 04:45 AM
**HELP** need to split this line faster than cut-command daytripper1021 Shell Programming and Scripting 9 10-29-2009 03:52 AM
Which command will be faster? y? karthi_g UNIX for Dummies Questions & Answers 4 07-30-2009 06:31 PM
command faster in crontab.. silverlocket Shell Programming and Scripting 2 07-01-2009 04:10 AM



All times are GMT -4. The time now is 06:36 AM.