speeding up bash script with "while read line"


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting speeding up bash script with "while read line"
# 1  
Old 05-31-2011
speeding up bash script with "while read line"

Hello everybody,
I'm still slowly treading my way into bash scripting (without any prior programming experience) and hence my code is mostly what some might call "creative" if they meant well Smilie

I have created a script that serves its purpose but it does so very slowly, since it needs to work its way through ~1million lines of text file input and does so with a "while read line" loop, which slows down the process terribly.

If it's in any way possible I'd like to speed up the script and would appreciate any suggestions you may have.
Code:
x=0
MOLprev=0
while read line
do
    MOLcur=$(echo $line | awk '{print $5}')    
    if [[ "$MOLprev" -eq 9999 && "$MOLprev" -ne "$MOLcur" ]]
    then
        x=$((x+1))
        SCALE=$((x*10000))
    fi
    echo $line | awk -v var1=`echo $MOLprev` -v var2=`echo $SCALE` '{ 
        if (var1 -ge $5) 
            printf("%4s%7s%5s%4s%6s%12s%8s%8s\n",$1,$2,$3,$4,$5+var2,$6,$7,$8)
        else 
            printf("%4s%7s%5s%4s%6s%12s%8s%8s\n",$1,$2,$3,$4,$5,$6,$7,$8); 
        }'
    MOLprev=$MOLcur
done < atoms_done > $OUTPUT

It is supposed to work through a text file (.pdb - protein data bank file) called "atoms_done" in column 5, which consists of integer values that represent a molecule number. There are several molecule types (column 4) in the file (see example below).

The molecule number (column 4) increases until the value 9999 is reached, after which the number jumps back to 0 and is incremented again until 9999. I want to rewrite or edit the existing file and correct just that and have the numbering increase above 9999 until the end of the file is reached.

ATOM 11393 OW SOL 1997 10.570 25.370 66.140
ATOM 11394 HW1 SOL 1997 10.990 24.510 65.850
ATOM 11395 HW2 SOL 1997 11.260 25.970 66.540
ATOM 11396 OW SOL 1998 26.270 16.020 58.330
ATOM 11397 HW1 SOL 1998 27.210 16.140 58.670
ATOM 11398 HW2 SOL 1998 25.800 16.900 58.370
ATOM 11399 OW SOL 1999 7.760 28.120 61.090
ATOM 11400 HW1 SOL 1999 6.970 28.740 61.090
ATOM 11401 HW2 SOL 1999 8.260 28.210 61.950
ATOM 11402 OW SOL 2000 36.170 4.250 62.330
ATOM 11403 HW1 SOL 2000 35.280 3.810 62.490
ATOM 11404 HW2 SOL 2000 36.030 5.100 61.830
ATOM 11405 C1 MeO 2001 19.100 14.520 124.300
ATOM 11406 O1 MeO 2001 19.850 14.620 123.120
ATOM 11407 HO1 MeO 2001 19.520 15.360 122.630
ATOM 11408 HC1 MeO 2001 18.190 13.930 124.210
ATOM 11409 HC2 MeO 2001 19.740 14.210 125.120
ATOM 11410 HC3 MeO 2001 18.730 15.500 124.600
ATOM 11411 C1 MeO 2002 19.280 3.410 94.800
ATOM 11412 O1 MeO 2002 20.380 3.410 95.710
ATOM 11413 HO1 MeO 2002 21.020 3.970 95.290
ATOM 11414 HC1 MeO 2002 18.320 3.220 95.290


Thank you for any helpful comments and suggestions.

Cheers.
# 2  
Old 05-31-2011
Quote:
Originally Posted by origamisven
Hello everybody,
I'm still slowly treading my way into bash scripting (without any prior programming experience) and hence my code is mostly what some might call "creative" if they meant well Smilie

I have created a script that serves its purpose but it does so very slowly, since it needs to work its way through ~1million lines of text file input and does so with a "while read line" loop, which slows down the process terribly.
It's not the read that's slow, it's "echo stuff | awk". You do that twice, so for a file with a million lines, you're running two million separate instances of awk! awk is a full-fledged language in its own right which you're loading, running, and quitting each time you use it. It's capable of processing more than one line, which is the efficient way to use it -- instead of 99% time spent loading/quitting, most time will be spent actually processing. You might as well do it all in awk.

You also have lots of useless use of backticks. Why do var=`echo stuff` when you can just do var=stuff ?

Incidentally, the shell can split variables by itself. You could do while read V1 V2 V3 V4 V5 V6 V7 to get rid of that first awk.

Code:
awk '
# The BEGIN block gets run once, before any lines are read
BEGIN { x=0 MOLprev=0 }
# This block gets run once per line
{
  if ((MOLprev == 9999) && (MOLprev != $5))
  {
    x++;
    SCALE=x*10000;
  }

  if (MOLprev > $5) 
            printf("%4s%7s%5s%4s%6s%12s%8s%8s\n",$1,$2,$3,$4,$5+SCALE,$6,$7,$8)
  else 
            printf("%4s%7s%5s%4s%6s%12s%8s%8s\n",$1,$2,$3,$4,$5,$6,$7,$8); 

  MOLprev=MOLcur;
}' < inputfile > outputfile

This User Gave Thanks to Corona688 For This Post:
# 3  
Old 05-31-2011
See if this will improve your script:
Code:
#!/usr/bin/ksh
typeset -i mPrev=7777
typeset -i mCurr=-1
while read mFld1 mFld2 mFld3 mFld4 mFld5 mOther
do
  if [[ ${mFld5} -ne ${mPrev} ]]; then
    mCurr=${mCurr}+1
  fi
  echo ${mFld1} ${mFld2} ${mFld3} ${mFld4} ${mCurr} ${mOther}
  mPrev=${mFld5}
done < input_file

This User Gave Thanks to Shell_Life For This Post:
# 4  
Old 05-31-2011
For the sake of argument...

You could change:
Code:
mCurr=${mCurr}+1

to:
Code:
((mCurr+=1))

to have the korn shell perform integer math inline and speed things up even more. It's actually quite surprising the difference it makes. consider this example:
Code:
#!/bin/ksh

integer limit=100000
integer ctr=0

function method1 {
ctr=0
print "\nfunction 1 starting $ctr"
while ((ctr < $limit))
do
   ctr=${ctr}+1
done
print "function 1 ending $ctr"
}

function method2 {
ctr=0
print "\nfunction 2 starting $ctr"
while ((ctr < $limit))
do
  ((ctr+=1))
done
print "function 2 ending $ctr"
}

time method1

time method2

exit 0

Output:
Code:
 $ ./aa

function 1 starting 0
function 1 ending 100000

real    0m3.46s
user    0m3.45s
sys     0m0.00s

function 2 starting 0
function 2 ending 100000

real    0m2.16s
user    0m2.16s
sys     0m0.00s
  $

While not a huge difference for this simple example, it could make an impact if multiple calculations are being done on millions of records.
This User Gave Thanks to gary_w For This Post:
# 5  
Old 05-31-2011
Quote:
Originally Posted by Corona688
It's not the read that's slow, it's "echo stuff | awk".(...) It's capable of processing more than one line, which is the efficient way to use it -- instead of 99% time spent loading/quitting, most time will be spent actually processing. (...) You also have lots of useless use of backticks. (...) Incidentally, the shell can split variables by itself. You could do while read V1 V2 V3 V4 V5 V6 V7 to get rid of that first awk.
This was extremely helpful and will certainly improve my scripting attemps in the future. Thanks a bunch.

Just for the sake of some other rookie searching on the same issue, this is the code that worked for me...

Code:
awk '
# The BEGIN block gets run once, before any lines are read
BEGIN { x=0; MOLprev=0; }
# This block gets run once per line
{

  if ((MOLprev == 9999) && (MOLprev != $5))
  {
    x++;
    SCALE=x*10000;
  }

  if (MOLprev -ge $5)
            printf("%4s%7s%5s%4s%6s%12s%8s%8s\n",$1,$2,$3,$4,$5+SCALE,$6,$7,$8)
  else 
            printf("%4s%7s%5s%4s%6s%12s%8s%8s\n",$1,$2,$3,$4,$5,$6,$7,$8); 
            
  MOLprev=$5;
}' < atoms_done > $OUTPUT

This User Gave Thanks to origamisven For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Bash script - Print an ascii file using specific font "Latin Modern Mono 12" "regular" "9"

Hello. System : opensuse leap 42.3 I have a bash script that build a text file. I would like the last command doing : print_cmd -o page-left=43 -o page-right=22 -o page-top=28 -o page-bottom=43 -o font=LatinModernMono12:regular:9 some_file.txt where : print_cmd ::= some printing... (1 Reply)
Discussion started by: jcdole
1 Replies

2. Shell Programming and Scripting

Failure: if grep "$Var" "$line" inside while read line loop

Hi everybody, I am new at Unix/Bourne shell scripting and with my youngest experiences, I will not become very old with it :o My code: #!/bin/sh set -e set -u export IFS= optl="Optl" LOCSTORCLI="/opt/lsi/storcli/storcli" ($LOCSTORCLI /c0 /vall show | grep RAID | cut -d " "... (5 Replies)
Discussion started by: Subsonic66
5 Replies

3. UNIX for Dummies Questions & Answers

Using "mailx" command to read "to" and "cc" email addreses from input file

How to use "mailx" command to do e-mail reading the input file containing email address, where column 1 has name and column 2 containing “To” e-mail address and column 3 contains “cc” e-mail address to include with same email. Sample input file, email.txt Below is an sample code where... (2 Replies)
Discussion started by: asjaiswal
2 Replies

4. UNIX for Dummies Questions & Answers

"Help with bash script" - "License Server and Patch Updates"

Hi All, I'm completely new to bash scripting and still learning my way through albeit vey slowly. I need to know where to insert my server names', my ip address numbers through out the script alas to no avail. I'm also searching on how to save .sh (bash shell) script properly.... (25 Replies)
Discussion started by: profileuser
25 Replies

5. Shell Programming and Scripting

awk command to replace ";" with "|" and ""|" at diferent places in line of file

Hi, I have line in input file as below: 3G_CENTRAL;INDONESIA_(M)_TELKOMSEL;SPECIAL_WORLD_GRP_7_FA_2_TELKOMSEL My expected output for line in the file must be : "1-Radon1-cMOC_deg"|"LDIndex"|"3G_CENTRAL|INDONESIA_(M)_TELKOMSEL"|LAST|"SPECIAL_WORLD_GRP_7_FA_2_TELKOMSEL" Can someone... (7 Replies)
Discussion started by: shis100
7 Replies

6. Shell Programming and Scripting

Simplify Bash Script Using "sed" Or "awk"

Input file: 2 aux003.net3.com error12 6 awn0117.net1.com error13 84 aux008 error14 29 aux001.ha.ux.isd.com error12 209 aux002.vm.ux.isd.com error34 21 alx0027.vm.net2.com error12 227 dux001.net5.com error123 22 us008.dot.net2.com error121 13 us009.net2.com error129Expected Output: 2... (4 Replies)
Discussion started by: sQew
4 Replies

7. Shell Programming and Scripting

read -n1 -r -p "Press..." key / produces error in bash shell script

Hello! Sorry, for my not so perfect english! I want to stop bash shell script execution until any key is pressed. This line in a bash shell script read -n1 -r -p "Press any key to continue..." key produces this error When I run this from the command line usera@lynx:~$ read... (4 Replies)
Discussion started by: linuxinho
4 Replies

8. Shell Programming and Scripting

read -p "prompt text" foo say "read: bad option(s)" in Bourne-Shell

Hallo, i need a Prompting read in my script: read -p "Enter your command: " command But i always get this Error: -p: is not an identifier When I run these in c-shell i get this error /usr/bin/read: read: bad option(s) How can I use a Prompt in the read command? (9 Replies)
Discussion started by: wiseguy
9 Replies

9. Shell Programming and Scripting

script to read a line with spaces bet " " and write to a file

Hi, I need a command in UNIX KSH below is the description... MAPPING DESCRIPTION ="Test Mapping for the calid inputs" ISVALID ="YES" NAME ="m_test_xml" OBJECTVERSION ="1" VERSIONNUMBER ="1" unix ksh command to read the DESCRIPTION and write to a file Test Mapping for the calid inputs... (3 Replies)
Discussion started by: perlamohan
3 Replies

10. UNIX for Dummies Questions & Answers

Explain the line "mn_code=`env|grep "..mn"|awk -F"=" '{print $2}'`"

Hi Friends, Can any of you explain me about the below line of code? mn_code=`env|grep "..mn"|awk -F"=" '{print $2}'` Im not able to understand, what exactly it is doing :confused: Any help would be useful for me. Lokesha (4 Replies)
Discussion started by: Lokesha
4 Replies
Login or Register to Ask a Question