Streamline script to search for numbers in a certain range

12-02-2013

Registered User

21, 0

Join Date: May 2013

Last Activity: 4 December 2013, 2:31 PM EST

Posts: 21

Thanks Given: 3

Thanked 0 Times in 0 Posts

Streamline script to search for numbers in a certain range

Hello all,

I need help making a script run faster since I have a huge file to sift through. The file I am running my script on looks like this:

Code:

1   -1.9E+001  -1.8E-001  1.5E+001  3.32E+001
2   -1.7E+001  -1.0E-002  1.2E+001  6.37E+001

3   -1.5E+001  -3.8E-006  6.7E+001  4.81E+001

The empty lines are random throughout the file. The script I wrote is used to add up the square of the 5th number in the line if the 2nd, 3rd, and 4th numbers are within a certain range and write it to a file. Here is the script:

Code:

#!/bin/bash
xmin=-18
ymin=-2
zmin=-6

xmax=-8
ymax=8
zmax=4

totden=0

o=$2
s=$3

for (( e = $o; e <= $s; e++ ))
do
        x=$( sed -n ${e}p $1 | awk '{print $2}' | sed 's/E/\*10\^/' | sed 's/\+//' )
        xbool=$( echo "$x <= $xmax && $x >= $xmin" | bc )
        if [ "$xbool" == 1 ]
        then
                y=$( sed -n ${e}p $1 | awk '{print $3}' | sed 's/E/\*10\^/' | sed 's/\+//' )
                ybool=$( echo "$y <= $ymax && $y >= $ymin" | bc )
                if [ "$ybool" == 1 ]
                then
                        z=$( sed -n ${e}p $1 | awk '{print $4}' | sed 's/E/\*10\^/' | sed 's/\+//' )
                        zbool=$( echo "$z <= $zmax && $z >= $zmin" | bc )
                        if [ "$zbool" == 1 ]
                        then
                                psi=$( sed -n ${e}p $1 | awk '{print $5}' | sed 's/E/\*10\^/' | sed 's/\+//' )
                                dens=$( echo "scale=20; ($psi)^2" | bc )
                                totden=$( echo "scale=20;$totden + $den" | bc )

                        fi
                fi
        fi
done
echo $totden >> /home/butson/phys4010/research/octopus/CPF/RMMDIIS/dens/densities

The problem is how slow this script goes. It does 100 lines in 22sec, but there are 1,008,163 lines (adds up to 2.57 days) and I am going to need to do this on 160 different files of this size (adds up to 1.13 years!!).
I am using a supercomputer so I have access to plenty of processors, but I don't know how to write scripts to run parallell ( if that is even possible). Can you help me out? Please let me know if I've left out any pertinent information.

butson

View Public Profile for butson

Find all posts by butson

12-02-2013

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Just to be sure we aren't wasting our time here, when you invoke your script, the 1st argument passed to it is the name of your input file. What values do you pass in for the other two arguments?

Please show us the exact command line you use to invoke your script and show us the output that should be produced for your 4 line input file example when given those command line arguments.

PS. Does your script currently work correctly if the lines you tell it to process include any empty lines?

PPS. And, can you give us a few lines of input where the x, y, and z values are all in range (so we don't get 0 results for all of your input)?

Last edited by Don Cragun; 12-02-2013 at 07:05 PM.. Reason: Add postscripts.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

12-02-2013

Registered User

5,091, 1,931

Join Date: May 2012

Last Activity: 15 July 2020, 4:46 AM EDT

Location: Simplicity

Posts: 5,091

Thanks Given: 565

Thanked 1,931 Times in 1,668 Posts

Looks like the whole thing can be done with an awk 1-liner

Code:

awk 'NR <= NRmax && NR >= NRmin && $2 <= xmax && $2 >= xmin && $3 <= ymax && $3 <= ymin && $4 <= zmax && $4 >= zmin { totden += $5 ^ 2 } END {print totden+0}' xmin=-18 ymin=02 zmin=-6 xmax=-8 ymax=8 zmax=4 NRmin=1 NRmax=3 file

MadeInGermany

View Public Profile for MadeInGermany

Find all posts by MadeInGermany

12-02-2013

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by MadeInGermany

Looks like the whole thing can be done with an awk 1-liner

Code:

awk '
NR <= NRmax && NR >= NRmin && $2 <= xmax && $2 >= xmin && $3 <= ymax && $3 <= ymin && $4 <= zmax && $4 >= zmin {
        totden += $5 ^ 2 }
END {print totden+0}
' xmin=-18 ymin=02 zmin=-6 xmax=-8 ymax=8 zmax=4 NRmin=1 NRmax=3 file

Hi MadeInGermany,
I added a couple of newlines to your code above to reduce the window width.

I think that's where OP's headed too (except I think the <= in red above should be >= and the ymin initializer should be ymin=-2 instead of ymin=02). But, I'd still like to have a sample input line given where the x, y, and z values are all in range so we can be sure that we get the results the OP's trying to get. (I'm guessing the OP also wants to set OFMT="%.20f" since the OP's using scale=20 in calls to bc. But, I'm not sure awk's internal arithmetic will be sufficient for 20 decimal places of accuracy. The OP might have to have awk print out commands to be executed by bc to get that much precision.)

Assuming that this script with one awk command (and maybe one bc command) per file will run so much faster than the original script with up to 12 calls to sed, 4 calls to awk, and 5 calls to bc per line, you can probably also remove the NRmin and NRmax tests to make it slightly faster yet.

Last edited by Don Cragun; 12-02-2013 at 08:02 PM.. Reason: Fix typo.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

12-03-2013

Registered User

21, 0

Join Date: May 2013

Last Activity: 4 December 2013, 2:31 PM EST

Posts: 21

Thanks Given: 3

Thanked 0 Times in 0 Posts

Yes, Don, I apologize. The script is called "integrate" and the file I am searching through is called "density.mesh_index" this is how the command looks:

Code:

./integrate density.mesh_index 3 1008163

I realize it is easier to just put those numbers in the script but I was trying to divide up the work over many processors by having each one do a section of the file.
The script hasn't encountered anything in the input where they are all in range yet, so I can't say that I know it works (that is a good point, I will make a dummy input and check to see if it works and get back to you).
I have crossed some empty lines and I get a standard input error because of bc. I assume this is not fatal, since that makes "xbool" and empty string and the if statement can handle that. But will that slow it down?
Let me get back to you after I have ran the scenarios you suggested.
As far as the awk one liner, what kind of precision can that handle? I don't know, yet, that I will need 20 decimal places. I am running these at several different time steps and they will all be relative to the first. I just kept all the decimal places in case the change is small.

---------- Post updated at 11:56 PM ---------- Previous update was at 11:09 PM ----------

It works even with empty lines. I used the following input:

Code:

1 -1.0E+001 0.0E+000 -3.0E+000 2.5E+001
2 -2.0E-001 1.0E+000 -2.0E+000 2.5E+001
3 -8.4E+000 2.0E+000 -1.0E+002 2.5E+001
4 -1.40E+001 3.0E+000 0.0E+000 2.5E+001
5 -8.3E+000 4.0E+002 1.0E+000 2.5E+001

6 -1.60E+001 5.0E+000 -6.0E+000 2.5E+001
7 -1.70E+001 6.0E+000 -5.0E+000 2.5E+001
8 -8.0E+000 7.0E+000 -4.0E+000 2.5E+001

Here there are 3 lines out of range: 2 because of $x, 3 because of $z, and 5 because of $y.
The empty lines gave the following error, but it still worked.

Code:

(standard_in) 1: syntax error
(standard_in) 1: syntax error

I did find a typo in my input in the "totden" variable. I should have used "$dens" instead of "$den".
However, the question is about making it faster. Although the typo was fatal, the question still remains.
What kind of accuracy does the awk function give? I guess I could just try it out on it and see how it goes. I'll let you all know how that comes out. Again, thanks for the help.

---------- Post updated 12-03-13 at 12:04 AM ---------- Previous update was 12-02-13 at 11:56 PM ----------

Thank you MadeInGermany!!!! That one liner works fast!! Thank you Don for the corrections!! You guys saved me a headache!!

butson

View Public Profile for butson

Find all posts by butson

12-03-2013

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by butson

Yes, Don, I apologize. The script is called "integrate" and the file I am searching through is called "density.mesh_index" this is how the command looks:

Code:

./integrate density.mesh_index 3 1008163

I realize it is easier to just put those numbers in the script but I was trying to divide up the work over many processors by having each one do a section of the file.
The script hasn't encountered anything in the input where they are all in range yet, so I can't say that I know it works (that is a good point, I will make a dummy input and check to see if it works and get back to you).
I have crossed some empty lines and I get a standard input error because of bc. I assume this is not fatal, since that makes "xbool" and empty string and the if statement can handle that. But will that slow it down?
Let me get back to you after I have ran the scenarios you suggested.
As far as the awk one liner, what kind of precision can that handle? I don't know, yet, that I will need 20 decimal places. I am running these at several different time steps and they will all be relative to the first. I just kept all the decimal places in case the change is small.

You said you had a large number of files. My hope is that we'll be able to speed it up enough that you will be able to process each file in a reasonable amount of time and you can do your parallelization by processing different files on different CPUs instead of different parts of single files on different CPUs.

A simple:

Code:

        if [ -z "$x" ]
        then    continue
        fi
            or
        [ -z "$x" ] && continue

shouldn't take long in your script (and won't be needed in an awk script) and will avoid the diagnostic messages from the failed comparisons in bc. The cost of starting up bc once when you know it won't work will easily take longer than running the above code in the shell several dozen (or more likely, several hundred or several thousand) times.

The awk utility (and 1993 or later versions of the Korn shell) use double-precision floating point arithmetic which gives you 15 to 17 decimal digits of precision. If you sum up the squares of 1,000,001 numbers in the range 3.32E+001 (1102.24) to 6.37E+001 (4057.69) (those were the biggest and smallest numbers in field 5 of your sample input in the 1st message in this thread) that could be up to 40576904057.69 which is 15 digits. If you have a number with a larger exponent (e.g., 9.87E+010) and a number with a smaller exponent (e.g., 1.23E-008) you will easily exceed double-precision limitations with just those two numbers. Having seen only three lines of your input data, I can't begin to guess whether awk can do what you need or whether you'll need to use awk to feed a stream of values to be added together by bc with as many digits preserved after the decimal point as you desire. Either way, you can do it with a single invocation of awk and no more than one invocation of bc to process an entire input file. And, awk should be able to do the conversion of exponential notation to a digit string rather than needing to ask bc to multiply by powers of 10.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

12-03-2013

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

I hate having to correct myself; but... I forgot about the translation errors in converting some decimal numbers to binary. Some trivial decimal numbers don't have exact binary values, for example:

Code:

awk 'BEGIN{printf("%.20f\n%.20f\n", 1E-001, 1e100);exit}'

which produces:

Code:

0.10000000000000000555
10000000000000000159028911097599180468360808563945281389781327557747838772170381060813469985856815104.00000000000000000000

quicly demonstrates the meaning of 15-17 digits of precision in double-precision floating point values. If you need more than 15 digits of precision in decimal values, you need to have awk split the mantissa from the exponent and have bc do the decimal arithmetic to get arbitrary precision results.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

Shell Programming and Scripting

Streamline script to search for numbers in a certain range

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Print range of numbers

Discussion started by: tostay2003

2. Shell Programming and Scripting

grep for a range of numbers

Discussion started by: tamil.pamaran

3. UNIX for Dummies Questions & Answers

How to count how many numbers in a certain range?

Discussion started by: marhuu

4. UNIX for Dummies Questions & Answers

Frequency of a range of numbers

Discussion started by: Gussifinknottle

5. UNIX for Dummies Questions & Answers

List-to-Range of Numbers

Discussion started by: Gussifinknottle

6. Shell Programming and Scripting

read numbers from file and output which numbers belongs to which range

Discussion started by: thepurple

7. UNIX for Dummies Questions & Answers

Using grep on a range of numbers

Discussion started by: ProgChick2oo9

8. Shell Programming and Scripting

Help me streamline this counting part of my script.

Discussion started by: LinuxRacr

9. Shell Programming and Scripting

Shell script to search through numbers and print the output

Discussion started by: cdfd123

10. Shell Programming and Scripting

grep numbers range

Discussion started by: azmathshaikh