I don't get why you would need a unique sort and the sed substitutions are not removing non-latin characters. Can you give a sample of input and desired output?
I'm cleaning out big XML files (we're talking about 1GB at least), most of them contain words written in a non-latin alphabet.
The command I'm using is so slow it's not even funny:
For clarity i decompose your sed-script first:
The first is: i can't understand what lines 2 and 3 in your sed-statement do. They look like syntax errors to me. Please clarify.
Then: every call in a pipeline will have to shovel through the whole data anew, so i might help to minimize the number of program calls in your pipeline.
is a useless use of cat. Use
instead.
Then the "tr"-command. This could be done inside sed too, yes? You want to change all space chars to newlines:
Replace "<spc>" with a literal space char and "^M" with <ENTER>. To enter special characters like the newline char press "<CTRL>-V" in vi before. The result will look like "^M" (one character), but in fact be a quoted newline.
might as well be shortened to
Lets get to the sed-command itself. You want some characters to be replaced by space chars, some to be deleted (line 4-end of your sed-script). This could be stated easier:
'
I am not sure if this speeds things up, but i presume it does: sed reads in one line after the other, applying one command after the other to each line. It follows, that having less commands it doesn't have to go through the pattern space that often and therefore be faster. Still, i haven't tested this so i this is just a presumption. Find out yourself, you might use a somewhat smaller file for testing.
Still, i would be interested in your findings, so please report back when you carried out these tests and tell us what you found.
Then the "tr"-command. This could be done inside sed too, yes? You want to change all space chars to newlines:
Replace "<spc>" with a literal space char and "^M" with <ENTER>. To enter special characters like the newline char press "<CTRL>-V" in vi before. The result will look like "^M" (one character), but in fact be a quoted newline.
To insert a newline with sed's s command, the newline needs to be preceded by a backslash.
In this case, a simple, global substitution of space with newline, it's easier to use the y command, which supports the \n sequence: y/ /\n/
A small nit: ^M is a carriage return character. ^J is newline/linefeed
Thanks bakunin, it definitely helped. The syntax errors you pointed out were errors indeed. That's pretty much what made the whole thing not work as expected.
It's still a bit slow, but I can live with that
Thanks for your explanations.
Btw: I did not manage to remove / with sed, even when escaping the character... I had to use tr instead.
I have the below command which is referring a large file and it is taking 3 hours to run. Can something be done to make this command faster.
awk -F ',' '{OFS=","}{ if ($13 == "9999") print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12 }' ${NLAP_TEMP}/hist1.out|sort -T ${NLAP_TEMP} |uniq>... (13 Replies)
Hi Experts,
I am using this command to edit the file contents and also add the header to the existing file.
I prepared this command on my VM (Linux) and it worked as I wanted it to work. But on solaris its not working :(. Please help as it is quite urgent.
sample File:
a
b
Output... (5 Replies)
awk "/May 23, 2012 /,0" /var/tmp/datafile
the above command pulls out information in the datafile. the information it pulls is from the date specified to the end of the file.
now, how can i make this faster if the datafile is huge? even if it wasn't huge, i feel there's a better/faster way to... (8 Replies)
Hi,
I have a script below for extracting xml from a file.
for i in *.txt
do
echo $i
awk '/<.*/ , /.*<\/.*>/' "$i" | tr -d '\n'
echo -ne '\n'
done
.
I read about using multi threading to speed up the script.
I do not know much about it but read it on this forum.
Is it a... (21 Replies)
we have 30 GB files on our filesystem which we need to copy daily to 25 location on the same machine (but different filesystem).
cp is taking 20 min to do the copy and we have 5 different thread doing the copy.
so in all its taking around 2 hr and we need to reduce it.
Is there any... (9 Replies)
I'm sorting files from a source directory by size into 4 categories then copying them into 4 corresponding folders, just wondering if there's a faster/better/more_elegant way to do this:
find /home/user/sourcefiles -type f -size -400000k -exec /bin/cp -uv {} /home/user/medfiles/ \;
find... (0 Replies)
Hi,
A datafile containing lines such as below needs to be split:
500000000000932491683600000000000000000000000000016800000GS0000000000932491683600*HOME
I need to get the 2-5, 11-20, and 35-40 characters and I can do it via cut command.
cut -c 2-5 file > temp1.txt
cut -c 11-20 file >... (9 Replies)
Hi all you enlightened unix people,
I've been trying to execute a perl script that contains the following line within backticks:
`grep -f patternfile.txt otherfile.txt`;It takes normally 2 minutes to execute this command from the bash shell by hand.
I noticed that when i run this command... (2 Replies)