Sponsored Content
Top Forums UNIX for Dummies Questions & Answers A faster equivalent for this sed command Post 302654429 by bobylapointe on Monday 11th of June 2012 11:19:05 PM
Old 06-12-2012
A faster equivalent for this sed command

Hello guys,

I'm cleaning out big XML files (we're talking about 1GB at least), most of them contain words written in a non-latin alphabet.

The command I'm using is so slow it's not even funny:

Code:
cat $1 | sed -e :a -e 's/&lt;[^&gt;]*&gt;//g;/&lt;/N;//ba;s/</ /g;s/>/ /g;s/_//g;s/-//g;s/–//g;s/(//g;s/)//g;s/,//g' | tr " " "\n" | sort | uniq >


I've tried to use tr -d but it breaks my files for some reason... some of my non-latin characters are completely messed up.

Do you guys know to optimize this command to make it a bit faster? Could I use awk to get the exact same result I get with the sed command above?

Thank you very much !

Last edited by Scrutinizer; 06-12-2012 at 12:50 AM..
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

command faster in crontab..

Hi all you enlightened unix people, I've been trying to execute a perl script that contains the following line within backticks: `grep -f patternfile.txt otherfile.txt`;It takes normally 2 minutes to execute this command from the bash shell by hand. I noticed that when i run this command... (2 Replies)
Discussion started by: silverlocket
2 Replies

2. UNIX for Dummies Questions & Answers

Which command will be faster? y?

i)wc -c/etc/passwd|awk'{print $1}' ii)ls -al/etc/passwd|awk'{print $5}' (4 Replies)
Discussion started by: karthi_g
4 Replies

3. Shell Programming and Scripting

**HELP** need to split this line faster than cut-command

Hi, A datafile containing lines such as below needs to be split: 500000000000932491683600000000000000000000000000016800000GS0000000000932491683600*HOME I need to get the 2-5, 11-20, and 35-40 characters and I can do it via cut command. cut -c 2-5 file > temp1.txt cut -c 11-20 file >... (9 Replies)
Discussion started by: daytripper1021
9 Replies

4. Shell Programming and Scripting

faster command than find for sorting?

I'm sorting files from a source directory by size into 4 categories then copying them into 4 corresponding folders, just wondering if there's a faster/better/more_elegant way to do this: find /home/user/sourcefiles -type f -size -400000k -exec /bin/cp -uv {} /home/user/medfiles/ \; find... (0 Replies)
Discussion started by: unclecameron
0 Replies

5. HP-UX

Faster command for file copy than cp ?

we have 30 GB files on our filesystem which we need to copy daily to 25 location on the same machine (but different filesystem). cp is taking 20 min to do the copy and we have 5 different thread doing the copy. so in all its taking around 2 hr and we need to reduce it. Is there any... (9 Replies)
Discussion started by: shipra_31
9 Replies

6. Shell Programming and Scripting

Multi thread awk command for faster performance

Hi, I have a script below for extracting xml from a file. for i in *.txt do echo $i awk '/<.*/ , /.*<\/.*>/' "$i" | tr -d '\n' echo -ne '\n' done . I read about using multi threading to speed up the script. I do not know much about it but read it on this forum. Is it a... (21 Replies)
Discussion started by: chetan.c
21 Replies

7. Shell Programming and Scripting

Faster way to use this awk command

awk "/May 23, 2012 /,0" /var/tmp/datafile the above command pulls out information in the datafile. the information it pulls is from the date specified to the end of the file. now, how can i make this faster if the datafile is huge? even if it wasn't huge, i feel there's a better/faster way to... (8 Replies)
Discussion started by: SkySmart
8 Replies

8. Shell Programming and Scripting

solaris sed equivalent

Hi Experts, I am using this command to edit the file contents and also add the header to the existing file. I prepared this command on my VM (Linux) and it worked as I wanted it to work. But on solaris its not working :(. Please help as it is quite urgent. sample File: a b Output... (5 Replies)
Discussion started by: sugarcane
5 Replies

9. Shell Programming and Scripting

sed Equivalent for awk/grep

Any equivalent command using awk or grep? sed -n "/^$(date --date='10 minutes ago' '+%b %_d %H:%M')/,\$p" /abc.log (7 Replies)
Discussion started by: timmywong
7 Replies

10. Shell Programming and Scripting

How to make awk command faster?

I have the below command which is referring a large file and it is taking 3 hours to run. Can something be done to make this command faster. awk -F ',' '{OFS=","}{ if ($13 == "9999") print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12 }' ${NLAP_TEMP}/hist1.out|sort -T ${NLAP_TEMP} |uniq>... (13 Replies)
Discussion started by: Peu Mukherjee
13 Replies
FORTUNE(6)							 BSD Games Manual							FORTUNE(6)

NAME
fortune -- print a random, hopefully interesting, adage SYNOPSIS
fortune [-aDefilosw] [-m pattern] [[N%] file/directory/all] DESCRIPTION
When fortune is run with no arguments it prints out a random epigram. Epigrams are divided into several categories, where each category is subdivided into those which are potentially offensive and those which are not. The options are as follows: -a Choose from all lists of maxims, both offensive and not. (See the -o option for more information on offensive fortunes.) -D Enable additional debugging output. Specify this option multiple times for more verbose output. Only available if compiled with -DDEBUG. -e Consider all fortune files to be of equal size (see discussion below on multiple files). -f Print out the list of files which would be searched, but do not print a fortune. -l Long dictums only. -m pattern Print out all fortunes which match the regular expression pattern. See regex(3) for a description of patterns. -o Choose only from potentially offensive aphorisms. Please, please, please request a potentially offensive fortune if and only if you believe, deep down in your heart, that you are willing to be offended. (And that if you are not willing, you will just quit using -o rather than give us grief about it, okay?) ... let us keep in mind the basic governing philosophy of The Brotherhood, as handsomely summarized in these words: we believe in healthy, hearty laughter -- at the expense of the whole human race, if needs be. Needs be. --H. Allen Smith, "Rude Jokes" -s Short apothegms only. -i Ignore case for -m patterns. -w Wait before termination for an amount of time calculated from the number of characters in the message. This is useful if it is exe- cuted as part of the logout procedure to guarantee that the message can be read before the screen is cleared. The user may specify alternate sayings. You can specify a specific file, a directory which contains one or more files, or the special word all which says to use all the standard databases. Any of these may be preceded by a percentage, which is a number N between 0 and 100 inclu- sive, followed by a '%'. If it is, there will be an N percent probability that an adage will be picked from that file or directory. If the percentages do not sum to 100, and there are specifications without percentages, the remaining percent will apply to those files and/or directories, in which case the probability of selecting from one of them will be based on their relative sizes. As an example, given two databases funny and not-funny, with funny twice as big, saying fortune funny not-funny will get you fortunes out of funny two-thirds of the time. The command fortune 90% funny 10% not-funny will pick out 90% of its fortunes from funny (the ``10% not-funny'' is unnecessary, since 10% is all that is left). The -e option says to consider all files equal; thus fortune -e funny not-funny is equivalent to fortune 50% funny 50% not-funny ENVIRONMENT
FORTUNE_PATH The search path for the data files. It is a colon-separated list of directories in which fortune looks for data files. If not set it will default to /usr/share/games/fortune. If none of the directories specified exist, it will print a warning and exit. FORTUNE_SAVESTATE If set, fortune will save some state about what fortune it was up to on disk. FILES
/usr/games/fortune /usr/share/games/fortune/* the fortunes databases (those files ending ``-o'' contain the offensive fortunes) SEE ALSO
arc4random_uniform(3), regcomp(3), regex(3), strfile(8) BSD
November 7, 2007 BSD
All times are GMT -4. The time now is 02:30 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy