We need to scramble data in a number of ASCII files. Some of these files are extremely large (1.2 GB). By scrambling, I mean that we need to substitute certain strings, which number around 400, with scrambled strings. An example has been given below
If "London" occurs in the file, then it needs to be substituted by "X1"
If "Frankfurt" occurs in the file, then it needs to be substituted by "X2".
We have written a Korn shell script, but there are huge performance problems as we need to check for 400 different strings. What is the best way of doing this ?.
The machine is HP-UX B.11.00 E 9000/800.
The solution suggested by Perderabo works...................
...............like LIGHTNING.
Thanks a lot for the help.
Last edited by SanjivNagraj; 07-04-2002 at 07:52 AM..
The exact best approach would depend on the details of your particular system. It always amazes me when folks ask questions without revealing what version of unix, what computer, etc. Well, I'll this a shot anyway.
The fastest way to do anything is to write a carefully designed assembly language program that will fully exploit the features available on your system. Following close behind would be writing the program in C.
As far as scripts go, the fastest way to to perform the two tranformations that you mentioned is this:
You might call it "scramble" and run it like this:
./scramble < inputfile > outputfile
But you want to do 400 substitutions. sed will have some limit on the number of commands that it can handle. It is not likely that you can get all 400 in one script. You can probably get 100, but the exact limit depends on your version of unix. You could have 4 of these, like this:
./scramble1 < input | ./scramble2 | ./scramble3 | ./scramble4 > output
If your computer has at least 4 cpu's this might still be unbeatable by any other scripted solution.
The latest version of ksh, ksh93, has much of sed built-in. A carefully written ksh93 script that relies only on built-ins could probably beat the pipeline of sed scripts. But most folks only have ksh88 available.
Try the sed solution and see where that leaves you.
Last edited by Perderabo; 07-03-2002 at 10:22 AM..
Using Sun OS 5.6..and for me the limit for sedfile usage is 199. Not 200 but 199 substitutions. I had a similar exercise once replacing a ceratin field with it's encrypted value - but I had around 10,000 substitutions to complete.
I'm not sure of the limitations on the -e flag...i.e. I have no idea howmany -e's you can have..but this may be high...(although I doubt it would be).
If you knew perl you could compile the similar with one pass of the file...although somewhat more effort to set up.
hi. I have a requirement where I need to REPLACE all alphabets from an alphanumeric input string into their respective ASCII decimal value. For example:
If the input string is ABAC123, the output should be 65666567123
I am seeking a single line command, and I was trying searching for options... (21 Replies)
Hey everyone! I am determining the best method to do what the subject of this thread says. I only have pieces to the puzzle right now. Namely this:
grep -rl "expression" . | xargs open
(I should mention that the intention is to grep through many files containing the "expression" and... (2 Replies)
When you are dealing with ASCII files it easy to check on line endings type. You can just use the file command. You are not always lucky enough to be dealing with ASCII files. So in the cases that you don't have ASCII files how can you check what type of line endings you have? Please list all... (5 Replies)
Hi,
I had to do something I could do in a way that worked fine, but I'm still wondering if there's a shorter way (which I think there is)...
I had to find only the ASCII files of a directory, and then work with them (that is, not only showing on screen)
What I did was
ls | xargs... (4 Replies)
I have the following output where I need to sort the second column numerically (starting with IBMULT3580-TD10 and ending in IBMULT3580-TD123)
Drv DriveName
0 IBMULT3580-TD13
1 IBMULT3580-TD18
2 IBMULT3580-TD14
3 IBMULT3580-TD10
4 IBMULT3580-TD11
5 IBMULT3580-TD17
... (8 Replies)
Hi All,
In the HP Unix that i'm using when i initialise a string as Stalled="'30¬G'"
Stalled=$Stalled" '30¬C'", it is taking the character ¬ as a comma. I need to grep for 30¬G 30¬C in a file and take its count. But since this character ¬ is not being understood, the count returns a zero.
The... (2 Replies)