|
|||||||
| Forums | Search Forums | Register | Forum Rules | Man Pages | Albums | FAQ | Members | Calendar | Search | Today's Posts | Mark Forums Read |
| UNIX for Dummies Questions & Answers If you're not sure where to post a UNIX or Linux question, post it here. All UNIX and Linux newbies welcome !! |
|
|
|
Thread Tools | Search this Thread | Display Modes |
|
#1
|
|||
|
|||
|
A faster equivalent for this sed command
Hello guys, I'm cleaning out big XML files (we're talking about 1GB at least), most of them contain words written in a non-latin alphabet. The command I'm using is so slow it's not even funny: Code:
cat $1 | sed -e :a -e 's/<[^>]*>//g;/</N;//ba;s/</ /g;s/>/ /g;s/_//g;s/-//g;s/–//g;s/(//g;s/)//g;s/,//g' | tr " " "\n" | sort | uniq > I've tried to use tr -d but it breaks my files for some reason... some of my non-latin characters are completely messed up. Do you guys know to optimize this command to make it a bit faster? Could I use awk to get the exact same result I get with the sed command above? Thank you very much ! Last edited by Scrutinizer; 06-11-2012 at 11:50 PM.. |
| Sponsored Links | ||
|
|
#2
|
||||
|
||||
|
I don't get why you would need a unique sort and the sed substitutions are not removing non-latin characters. Can you give a sample of input and desired output?
|
| Sponsored Links | ||
|
|
#3
|
|||
|
|||
|
Quote:
Code:
cat $1 |\
sed -e :a -e 's/<[^>]*>//g
/</N
//ba
s/</ /g
s/>/ /g
s/_//g
s/-//g
s/–//g
s/(//g
s/)//g
s/,//g' |\
tr " " "\n" |\
sort |\
uniq >The first is: i can't understand what lines 2 and 3 in your sed-statement do. They look like syntax errors to me. Please clarify. Then: every call in a pipeline will have to shovel through the whole data anew, so i might help to minimize the number of program calls in your pipeline. Code:
cat <file> | sed ... is a useless use of cat. Use Code:
sed '<commands>' <file> instead. Then the "tr"-command. This could be done inside sed too, yes? You want to change all space chars to newlines: Code:
sed 's/<spc>/<ENTER>/g' infile > outfile Replace "<spc>" with a literal space char and "^M" with <ENTER>. To enter special characters like the newline char press "<CTRL>-V" in vi before. The result will look like "^M" (one character), but in fact be a quoted newline. Code:
<stream> | sort | uniq might as well be shortened to Code:
<stream> | sort -u Lets get to the sed-command itself. You want some characters to be replaced by space chars, some to be deleted (line 4-end of your sed-script). This could be stated easier: Code:
sed 's/[<>]/ /g; s/[_-–(),]//g' ' I am not sure if this speeds things up, but i presume it does: sed reads in one line after the other, applying one command after the other to each line. It follows, that having less commands it doesn't have to go through the pattern space that often and therefore be faster. Still, i haven't tested this so i this is just a presumption. Find out yourself, you might use a somewhat smaller file for testing. Still, i would be interested in your findings, so please report back when you carried out these tests and tell us what you found. I hope this helps. bakunin |
| The Following User Says Thank You to bakunin For This Useful Post: | ||
bobylapointe (06-12-2012) | ||
|
#4
|
|||
|
|||
|
Quote:
In this case, a simple, global substitution of space with newline, it's easier to use the y command, which supports the \n sequence: y/ /\n/ A small nit: ^M is a carriage return character. ^J is newline/linefeed Regards, Alister |
| Sponsored Links | |
|
|
#5
|
|||
|
|||
|
Thanks bakunin, it definitely helped. The syntax errors you pointed out were errors indeed. That's pretty much what made the whole thing not work as expected.
It's still a bit slow, but I can live with that ![]() Thanks for your explanations. Btw: I did not manage to remove / with sed, even when escaping the character... I had to use tr instead. |
| Sponsored Links | ||
|
![]() |
| Tags |
| shell sed awk slow script |
| Thread Tools | Search this Thread |
| Display Modes | |
More UNIX and Linux Forum Topics You Might Find Helpful
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Faster way to use this awk command | SkySmart | Shell Programming and Scripting | 8 | 05-25-2012 06:02 AM |
| Faster command for file copy than cp ? | shipra_31 | HP-UX | 9 | 02-07-2012 04:45 AM |
| **HELP** need to split this line faster than cut-command | daytripper1021 | Shell Programming and Scripting | 9 | 10-29-2009 03:52 AM |
| Which command will be faster? y? | karthi_g | UNIX for Dummies Questions & Answers | 4 | 07-30-2009 06:31 PM |
| command faster in crontab.. | silverlocket | Shell Programming and Scripting | 2 | 07-01-2009 04:10 AM |
|
|