Streamline script to search for numbers in a certain range
Hello all,
I need help making a script run faster since I have a huge file to sift through. The file I am running my script on looks like this:
The empty lines are random throughout the file. The script I wrote is used to add up the square of the 5th number in the line if the 2nd, 3rd, and 4th numbers are within a certain range and write it to a file. Here is the script:
The problem is how slow this script goes. It does 100 lines in 22sec, but there are 1,008,163 lines (adds up to 2.57 days) and I am going to need to do this on 160 different files of this size (adds up to 1.13 years!!).
I am using a supercomputer so I have access to plenty of processors, but I don't know how to write scripts to run parallell ( if that is even possible). Can you help me out? Please let me know if I've left out any pertinent information.
Just to be sure we aren't wasting our time here, when you invoke your script, the 1st argument passed to it is the name of your input file. What values do you pass in for the other two arguments?
Please show us the exact command line you use to invoke your script and show us the output that should be produced for your 4 line input file example when given those command line arguments.
PS. Does your script currently work correctly if the lines you tell it to process include any empty lines?
PPS. And, can you give us a few lines of input where the x, y, and z values are all in range (so we don't get 0 results for all of your input)?
Last edited by Don Cragun; 12-02-2013 at 07:05 PM..
Reason: Add postscripts.
Looks like the whole thing can be done with an awk 1-liner
Hi MadeInGermany,
I added a couple of newlines to your code above to reduce the window width.
I think that's where OP's headed too (except I think the <= in red above should be >= and the ymin initializer should be ymin=-2 instead of ymin=02). But, I'd still like to have a sample input line given where the x, y, and z values are all in range so we can be sure that we get the results the OP's trying to get. (I'm guessing the OP also wants to set OFMT="%.20f" since the OP's using scale=20 in calls to bc. But, I'm not sure awk's internal arithmetic will be sufficient for 20 decimal places of accuracy. The OP might have to have awk print out commands to be executed by bc to get that much precision.)
Assuming that this script with one awk command (and maybe one bc command) per file will run so much faster than the original script with up to 12 calls to sed, 4 calls to awk, and 5 calls to bc per line, you can probably also remove the NRmin and NRmax tests to make it slightly faster yet.
Last edited by Don Cragun; 12-02-2013 at 08:02 PM..
Reason: Fix typo.
Yes, Don, I apologize. The script is called "integrate" and the file I am searching through is called "density.mesh_index" this is how the command looks:
I realize it is easier to just put those numbers in the script but I was trying to divide up the work over many processors by having each one do a section of the file.
The script hasn't encountered anything in the input where they are all in range yet, so I can't say that I know it works (that is a good point, I will make a dummy input and check to see if it works and get back to you).
I have crossed some empty lines and I get a standard input error because of bc. I assume this is not fatal, since that makes "xbool" and empty string and the if statement can handle that. But will that slow it down?
Let me get back to you after I have ran the scenarios you suggested.
As far as the awk one liner, what kind of precision can that handle? I don't know, yet, that I will need 20 decimal places. I am running these at several different time steps and they will all be relative to the first. I just kept all the decimal places in case the change is small.
---------- Post updated at 11:56 PM ---------- Previous update was at 11:09 PM ----------
It works even with empty lines. I used the following input:
Here there are 3 lines out of range: 2 because of $x, 3 because of $z, and 5 because of $y.
The empty lines gave the following error, but it still worked.
I did find a typo in my input in the "totden" variable. I should have used "$dens" instead of "$den".
However, the question is about making it faster. Although the typo was fatal, the question still remains.
What kind of accuracy does the awk function give? I guess I could just try it out on it and see how it goes. I'll let you all know how that comes out. Again, thanks for the help.
---------- Post updated 12-03-13 at 12:04 AM ---------- Previous update was 12-02-13 at 11:56 PM ----------
Thank you MadeInGermany!!!! That one liner works fast!! Thank you Don for the corrections!! You guys saved me a headache!!
Yes, Don, I apologize. The script is called "integrate" and the file I am searching through is called "density.mesh_index" this is how the command looks:
I realize it is easier to just put those numbers in the script but I was trying to divide up the work over many processors by having each one do a section of the file.
The script hasn't encountered anything in the input where they are all in range yet, so I can't say that I know it works (that is a good point, I will make a dummy input and check to see if it works and get back to you).
I have crossed some empty lines and I get a standard input error because of bc. I assume this is not fatal, since that makes "xbool" and empty string and the if statement can handle that. But will that slow it down?
Let me get back to you after I have ran the scenarios you suggested.
As far as the awk one liner, what kind of precision can that handle? I don't know, yet, that I will need 20 decimal places. I am running these at several different time steps and they will all be relative to the first. I just kept all the decimal places in case the change is small.
You said you had a large number of files. My hope is that we'll be able to speed it up enough that you will be able to process each file in a reasonable amount of time and you can do your parallelization by processing different files on different CPUs instead of different parts of single files on different CPUs.
A simple:
shouldn't take long in your script (and won't be needed in an awk script) and will avoid the diagnostic messages from the failed comparisons in bc. The cost of starting up bc once when you know it won't work will easily take longer than running the above code in the shell several dozen (or more likely, several hundred or several thousand) times.
The awk utility (and 1993 or later versions of the Korn shell) use double-precision floating point arithmetic which gives you 15 to 17 decimal digits of precision. If you sum up the squares of 1,000,001 numbers in the range 3.32E+001 (1102.24) to 6.37E+001 (4057.69) (those were the biggest and smallest numbers in field 5 of your sample input in the 1st message in this thread) that could be up to 40576904057.69 which is 15 digits. If you have a number with a larger exponent (e.g., 9.87E+010) and a number with a smaller exponent (e.g., 1.23E-008) you will easily exceed double-precision limitations with just those two numbers. Having seen only three lines of your input data, I can't begin to guess whether awk can do what you need or whether you'll need to use awk to feed a stream of values to be added together by bc with as many digits preserved after the decimal point as you desire. Either way, you can do it with a single invocation of awk and no more than one invocation of bc to process an entire input file. And, awk should be able to do the conversion of exponential notation to a digit string rather than needing to ask bc to multiply by powers of 10.
This User Gave Thanks to Don Cragun For This Post:
I hate having to correct myself; but... I forgot about the translation errors in converting some decimal numbers to binary. Some trivial decimal numbers don't have exact binary values, for example:
which produces:
quicly demonstrates the meaning of 15-17 digits of precision in double-precision floating point values. If you need more than 15 digits of precision in decimal values, you need to have awk split the mantissa from the exponent and have bc do the decimal arithmetic to get arbitrary precision results.
Hi
I am getting an argument which specifies the range of numbers. eg: 7-15
Is there a way that i can easily (avoiding loop) print the range of number between and including the specified above.
The above example should translate to 7,8,9,10,11,12,13,14,15 (3 Replies)
Dear Friends,
I want to know how to grep for the lines that has a number between given range(start and end).
I have tried the following sed command.
sed -n -e '/20030101011442/,/20030101035519/p'
However this requires both start and end to be part of the content being grepped. However... (4 Replies)
Hello,
I have a column where there are values from 1 to 150.
I want to get the frequency of values in the following ranges:
1-5
6-10
11-15
....
....
....
146-150
How can I do this in a for loop?
Thanks,
Guss (1 Reply)
Howdy experts,
We have some ranges of number which belongs to particual group as below.
GroupNo StartRange EndRange
Group0125 935300 935399
Group2006 935400 935476
937430 937459
Group0324 935477 935549
... (6 Replies)
Hi im new to unix and need to find a way to grep the top 5 numbers in a file and put them into another file. For example my file looks like this
abcdef 50000
abcdef 45000
abcdef 40000
abcdef 35000
abcdef 30000
abcdef 25000
abcdef 20000
abcdef 15000
abcdef 10000
and so on...
How can... (1 Reply)
Ok, so this is a small part of a script I wrote to build disk groups using VXVM. The only problem is that I am limited to a count of 8 maximum. If I want more, I will have to add more lines of "if" statements. How can I accomplish the same thing, in a few lines, but not be limited in the max... (13 Replies)
Suppose u have a file like
1 30
ABCSAAHSNJQJALBALMKAANKAMLAMALK
4562676268836826826868268468368282972982
2863923792102370179372012792701739729291
31 60... (8 Replies)
I want to grep a range of numbers in a log file. My log file looks like this:
20050807070609Z;blah blah
That is a combination of yr,month,date,hours,minutes,seconds.
I want to search in the log file events that happened between a particular time.
like between 20050807070000 to 20050822070000... (1 Reply)