Streamline script to search for numbers in a certain range


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Streamline script to search for numbers in a certain range
# 1  
Old 12-02-2013
Streamline script to search for numbers in a certain range

Hello all,

I need help making a script run faster since I have a huge file to sift through. The file I am running my script on looks like this:
Code:
1   -1.9E+001  -1.8E-001  1.5E+001  3.32E+001
2   -1.7E+001  -1.0E-002  1.2E+001  6.37E+001

3   -1.5E+001  -3.8E-006  6.7E+001  4.81E+001

The empty lines are random throughout the file. The script I wrote is used to add up the square of the 5th number in the line if the 2nd, 3rd, and 4th numbers are within a certain range and write it to a file. Here is the script:
Code:
#!/bin/bash
xmin=-18
ymin=-2
zmin=-6

xmax=-8
ymax=8
zmax=4

totden=0

o=$2
s=$3

for (( e = $o; e <= $s; e++ ))
do
        x=$( sed -n ${e}p $1 | awk '{print $2}' | sed 's/E/\*10\^/' | sed 's/\+//' )
        xbool=$( echo "$x <= $xmax && $x >= $xmin" | bc )
        if [ "$xbool" == 1 ]
        then
                y=$( sed -n ${e}p $1 | awk '{print $3}' | sed 's/E/\*10\^/' | sed 's/\+//' )
                ybool=$( echo "$y <= $ymax && $y >= $ymin" | bc )
                if [ "$ybool" == 1 ]
                then
                        z=$( sed -n ${e}p $1 | awk '{print $4}' | sed 's/E/\*10\^/' | sed 's/\+//' )
                        zbool=$( echo "$z <= $zmax && $z >= $zmin" | bc )
                        if [ "$zbool" == 1 ]
                        then
                                psi=$( sed -n ${e}p $1 | awk '{print $5}' | sed 's/E/\*10\^/' | sed 's/\+//' )
                                dens=$( echo "scale=20; ($psi)^2" | bc )
                                totden=$( echo "scale=20;$totden + $den" | bc )

                        fi
                fi
        fi
done
echo $totden >> /home/butson/phys4010/research/octopus/CPF/RMMDIIS/dens/densities

The problem is how slow this script goes. It does 100 lines in 22sec, but there are 1,008,163 lines (adds up to 2.57 days) and I am going to need to do this on 160 different files of this size (adds up to 1.13 years!!).
I am using a supercomputer so I have access to plenty of processors, but I don't know how to write scripts to run parallell ( if that is even possible). Can you help me out? Please let me know if I've left out any pertinent information.
# 2  
Old 12-02-2013
Just to be sure we aren't wasting our time here, when you invoke your script, the 1st argument passed to it is the name of your input file. What values do you pass in for the other two arguments?

Please show us the exact command line you use to invoke your script and show us the output that should be produced for your 4 line input file example when given those command line arguments.

PS. Does your script currently work correctly if the lines you tell it to process include any empty lines?

PPS. And, can you give us a few lines of input where the x, y, and z values are all in range (so we don't get 0 results for all of your input)?

Last edited by Don Cragun; 12-02-2013 at 07:05 PM.. Reason: Add postscripts.
# 3  
Old 12-02-2013
Looks like the whole thing can be done with an awk 1-liner
Code:
awk 'NR <= NRmax && NR >= NRmin && $2 <= xmax && $2 >= xmin && $3 <= ymax && $3 <= ymin && $4 <= zmax && $4 >= zmin { totden += $5 ^ 2 } END {print totden+0}' xmin=-18 ymin=02 zmin=-6 xmax=-8 ymax=8 zmax=4 NRmin=1 NRmax=3 file

# 4  
Old 12-02-2013
Quote:
Originally Posted by MadeInGermany
Looks like the whole thing can be done with an awk 1-liner
Code:
awk '
NR <= NRmax && NR >= NRmin && $2 <= xmax && $2 >= xmin && $3 <= ymax && $3 <= ymin && $4 <= zmax && $4 >= zmin {
        totden += $5 ^ 2 }
END {print totden+0}
' xmin=-18 ymin=02 zmin=-6 xmax=-8 ymax=8 zmax=4 NRmin=1 NRmax=3 file

Hi MadeInGermany,
I added a couple of newlines to your code above to reduce the window width.

I think that's where OP's headed too (except I think the <= in red above should be >= and the ymin initializer should be ymin=-2 instead of ymin=02). But, I'd still like to have a sample input line given where the x, y, and z values are all in range so we can be sure that we get the results the OP's trying to get. (I'm guessing the OP also wants to set OFMT="%.20f" since the OP's using scale=20 in calls to bc. But, I'm not sure awk's internal arithmetic will be sufficient for 20 decimal places of accuracy. The OP might have to have awk print out commands to be executed by bc to get that much precision.)

Assuming that this script with one awk command (and maybe one bc command) per file will run so much faster than the original script with up to 12 calls to sed, 4 calls to awk, and 5 calls to bc per line, you can probably also remove the NRmin and NRmax tests to make it slightly faster yet.

Last edited by Don Cragun; 12-02-2013 at 08:02 PM.. Reason: Fix typo.
# 5  
Old 12-03-2013
Yes, Don, I apologize. The script is called "integrate" and the file I am searching through is called "density.mesh_index" this is how the command looks:
Code:
./integrate density.mesh_index 3 1008163

I realize it is easier to just put those numbers in the script but I was trying to divide up the work over many processors by having each one do a section of the file.
The script hasn't encountered anything in the input where they are all in range yet, so I can't say that I know it works (that is a good point, I will make a dummy input and check to see if it works and get back to you).
I have crossed some empty lines and I get a standard input error because of bc. I assume this is not fatal, since that makes "xbool" and empty string and the if statement can handle that. But will that slow it down?
Let me get back to you after I have ran the scenarios you suggested.
As far as the awk one liner, what kind of precision can that handle? I don't know, yet, that I will need 20 decimal places. I am running these at several different time steps and they will all be relative to the first. I just kept all the decimal places in case the change is small.

---------- Post updated at 11:56 PM ---------- Previous update was at 11:09 PM ----------

It works even with empty lines. I used the following input:
Code:
1 -1.0E+001 0.0E+000 -3.0E+000 2.5E+001
2 -2.0E-001 1.0E+000 -2.0E+000 2.5E+001
3 -8.4E+000 2.0E+000 -1.0E+002 2.5E+001
4 -1.40E+001 3.0E+000 0.0E+000 2.5E+001
5 -8.3E+000 4.0E+002 1.0E+000 2.5E+001

6 -1.60E+001 5.0E+000 -6.0E+000 2.5E+001
7 -1.70E+001 6.0E+000 -5.0E+000 2.5E+001
8 -8.0E+000 7.0E+000 -4.0E+000 2.5E+001

Here there are 3 lines out of range: 2 because of $x, 3 because of $z, and 5 because of $y.
The empty lines gave the following error, but it still worked.
Code:
(standard_in) 1: syntax error
(standard_in) 1: syntax error

I did find a typo in my input in the "totden" variable. I should have used "$dens" instead of "$den".
However, the question is about making it faster. Although the typo was fatal, the question still remains.
What kind of accuracy does the awk function give? I guess I could just try it out on it and see how it goes. I'll let you all know how that comes out. Again, thanks for the help.

---------- Post updated 12-03-13 at 12:04 AM ---------- Previous update was 12-02-13 at 11:56 PM ----------

Thank you MadeInGermany!!!! That one liner works fast!! Thank you Don for the corrections!! You guys saved me a headache!!
# 6  
Old 12-03-2013
Quote:
Originally Posted by butson
Yes, Don, I apologize. The script is called "integrate" and the file I am searching through is called "density.mesh_index" this is how the command looks:
Code:
./integrate density.mesh_index 3 1008163

I realize it is easier to just put those numbers in the script but I was trying to divide up the work over many processors by having each one do a section of the file.
The script hasn't encountered anything in the input where they are all in range yet, so I can't say that I know it works (that is a good point, I will make a dummy input and check to see if it works and get back to you).
I have crossed some empty lines and I get a standard input error because of bc. I assume this is not fatal, since that makes "xbool" and empty string and the if statement can handle that. But will that slow it down?
Let me get back to you after I have ran the scenarios you suggested.
As far as the awk one liner, what kind of precision can that handle? I don't know, yet, that I will need 20 decimal places. I am running these at several different time steps and they will all be relative to the first. I just kept all the decimal places in case the change is small.
You said you had a large number of files. My hope is that we'll be able to speed it up enough that you will be able to process each file in a reasonable amount of time and you can do your parallelization by processing different files on different CPUs instead of different parts of single files on different CPUs.

A simple:
Code:
        if [ -z "$x" ]
        then    continue
        fi
            or
        [ -z "$x" ] && continue

shouldn't take long in your script (and won't be needed in an awk script) and will avoid the diagnostic messages from the failed comparisons in bc. The cost of starting up bc once when you know it won't work will easily take longer than running the above code in the shell several dozen (or more likely, several hundred or several thousand) times.

The awk utility (and 1993 or later versions of the Korn shell) use double-precision floating point arithmetic which gives you 15 to 17 decimal digits of precision. If you sum up the squares of 1,000,001 numbers in the range 3.32E+001 (1102.24) to 6.37E+001 (4057.69) (those were the biggest and smallest numbers in field 5 of your sample input in the 1st message in this thread) that could be up to 40576904057.69 which is 15 digits. If you have a number with a larger exponent (e.g., 9.87E+010) and a number with a smaller exponent (e.g., 1.23E-008) you will easily exceed double-precision limitations with just those two numbers. Having seen only three lines of your input data, I can't begin to guess whether awk can do what you need or whether you'll need to use awk to feed a stream of values to be added together by bc with as many digits preserved after the decimal point as you desire. Either way, you can do it with a single invocation of awk and no more than one invocation of bc to process an entire input file. And, awk should be able to do the conversion of exponential notation to a digit string rather than needing to ask bc to multiply by powers of 10.
This User Gave Thanks to Don Cragun For This Post:
# 7  
Old 12-03-2013
I hate having to correct myself; but... I forgot about the translation errors in converting some decimal numbers to binary. Some trivial decimal numbers don't have exact binary values, for example:
Code:
awk 'BEGIN{printf("%.20f\n%.20f\n", 1E-001, 1e100);exit}'

which produces:
Code:
0.10000000000000000555
10000000000000000159028911097599180468360808563945281389781327557747838772170381060813469985856815104.00000000000000000000

quicly demonstrates the meaning of 15-17 digits of precision in double-precision floating point values. If you need more than 15 digits of precision in decimal values, you need to have awk split the mantissa from the exponent and have bc do the decimal arithmetic to get arbitrary precision results.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Print range of numbers

Hi I am getting an argument which specifies the range of numbers. eg: 7-15 Is there a way that i can easily (avoiding loop) print the range of number between and including the specified above. The above example should translate to 7,8,9,10,11,12,13,14,15 (3 Replies)
Discussion started by: tostay2003
3 Replies

2. Shell Programming and Scripting

grep for a range of numbers

Dear Friends, I want to know how to grep for the lines that has a number between given range(start and end). I have tried the following sed command. sed -n -e '/20030101011442/,/20030101035519/p' However this requires both start and end to be part of the content being grepped. However... (4 Replies)
Discussion started by: tamil.pamaran
4 Replies

3. UNIX for Dummies Questions & Answers

How to count how many numbers in a certain range?

Hi I have a data file with two columns which looks like: 1 42 2 40 3 55 4 50 5 38 6 49 7 33 8 46 9 39 10 33 11 33 12 26 13 46 14 44 15 55 16 54 17 30 18 32 (7 Replies)
Discussion started by: marhuu
7 Replies

4. UNIX for Dummies Questions & Answers

Frequency of a range of numbers

Hello, I have a column where there are values from 1 to 150. I want to get the frequency of values in the following ranges: 1-5 6-10 11-15 .... .... .... 146-150 How can I do this in a for loop? Thanks, Guss (1 Reply)
Discussion started by: Gussifinknottle
1 Replies

5. UNIX for Dummies Questions & Answers

List-to-Range of Numbers

Hello, I have two columns with data that look like this: Col1 Col2 ------ ----- a 1 a 2 a 3 a 4 a 7 a 8 a 9 a 10 a 11 b 6 b 7 b 8 b 9 b 14 (5 Replies)
Discussion started by: Gussifinknottle
5 Replies

6. Shell Programming and Scripting

read numbers from file and output which numbers belongs to which range

Howdy experts, We have some ranges of number which belongs to particual group as below. GroupNo StartRange EndRange Group0125 935300 935399 Group2006 935400 935476 937430 937459 Group0324 935477 935549 ... (6 Replies)
Discussion started by: thepurple
6 Replies

7. UNIX for Dummies Questions & Answers

Using grep on a range of numbers

Hi im new to unix and need to find a way to grep the top 5 numbers in a file and put them into another file. For example my file looks like this abcdef 50000 abcdef 45000 abcdef 40000 abcdef 35000 abcdef 30000 abcdef 25000 abcdef 20000 abcdef 15000 abcdef 10000 and so on... How can... (1 Reply)
Discussion started by: ProgChick2oo9
1 Replies

8. Shell Programming and Scripting

Help me streamline this counting part of my script.

Ok, so this is a small part of a script I wrote to build disk groups using VXVM. The only problem is that I am limited to a count of 8 maximum. If I want more, I will have to add more lines of "if" statements. How can I accomplish the same thing, in a few lines, but not be limited in the max... (13 Replies)
Discussion started by: LinuxRacr
13 Replies

9. Shell Programming and Scripting

Shell script to search through numbers and print the output

Suppose u have a file like 1 30 ABCSAAHSNJQJALBALMKAANKAMLAMALK 4562676268836826826868268468368282972982 2863923792102370179372012792701739729291 31 60... (8 Replies)
Discussion started by: cdfd123
8 Replies

10. Shell Programming and Scripting

grep numbers range

I want to grep a range of numbers in a log file. My log file looks like this: 20050807070609Z;blah blah That is a combination of yr,month,date,hours,minutes,seconds. I want to search in the log file events that happened between a particular time. like between 20050807070000 to 20050822070000... (1 Reply)
Discussion started by: azmathshaikh
1 Replies
Login or Register to Ask a Question