Speed up bash loop?


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Speed up bash loop?
# 1  
Old 11-17-2015
Speed up bash loop?

I am running the below bash loop on all the files of a specific type (highlighted in bold) in a directory. There are 4 awk commands that use the input files to search another and look for a match. The input files range from 27 - 259 and are a list of names. The file that is searched is 11,137,660 lines. The loop does run, however, it takes ~20 hours to complete on a computer with 64GB and a xeon 8 core processor. Is this normal and can it be made faster (more efficient)? Thank you Smilie.

Code:
for f in /home/cmccabe/Desktop/HiQ/*base_counts.txt ; do
     bname=`basename $f`
     pref=${bname%%.txt}
     awk -f /home/cmccabe/Desktop/match.awk /home/cmccabe/Desktop/panels/PCD_unix_corrected.bed $f > /home/cmccabe/Desktop/HiQ/${pref}_PCD_coverage.
     awk -f /home/cmccabe/Desktop/match.awk /home/cmccabe/Desktop/panels/BMF_unix_corrected.bed $f > /home/cmccabe/Desktop/HiQ/${pref}_BMF_coverage.bed
     awk -f /home/cmccabe/Desktop/match.awk /home/cmccabe/Desktop/panels/PAH_unix_corrected.bed $f > /home/cmccabe/Desktop/HiQ/${pref}_PAH_coverage.bed
     awk -f /home/cmccabe/Desktop/match.awk /home/cmccabe/Desktop/panels/PID_unix_corrected.bed $f > /home/cmccabe/Desktop/HiQ/${pref}_PID_coverage.bed
done

awk
Code:
BEGIN {
  FS="[ \t|]*"
}
# Read search terms from file1 into 's'
FNR==NR {
    s[$0]
    next
}
{
    # Check if $5 matches one of the search terms
    for(i in s) {
        if($5 ~ i) {

            # Store first two fields for later usage
            a[$5]=$1
            b[$5]=$2

            # Add $9 to total of $9 per $5
            t[$5]+=$8
            # Increment count of occurences of $5
            c[$5]++

            next
        }
    }
}
END {

    # Calculate average and print output for all search terms
    # that has been found
    for( i in t ) {
        avg = t[i] / c[i]
        printf "%s:%s\t%s\t%s\n", a[i], b[i], i, avg | "sort -k3,3n"
    }
}

# 2  
Old 11-17-2015
Can you post sample files?

Is the $5 string longer than the searched pattern?
# 3  
Old 11-17-2015
the search string (the 27 - 259 names in a file):

Code:
ABCA3
ACVRL1
AGRN

the file to search in $5 (the 11,137,660 line file)
Code:
chr1    955543    955763    chr1:955543    AGRN-6|gc=75    1    0
chr1    955543    955763    chr1:955543    AGRN-6|gc=75    2    2
chr1    955543    955763    chr1:955543    AGRN-6|gc=75    3    2

So the expected output would be:

Code:
chr1:955543    AGRN-6|gc=75     3

only $4, $5 where the match was found and the average of $7 are printed

Thank you very much Smilie.
# 4  
Old 11-17-2015
If $5 can always be spilt on the hyphen i.e. AGRN-6|gc=75 to AGRN-6|gc=75 this could speed up the process.

To put you on track:

Code:
BEGIN{FS="[\t| -]+"}

FNR==NR {
	s[$0]=1
	next
}

# if s[$5] exists --> do something
s[$5] {
	# do something
}

If mawk is available on your box, it's usually faster.
# 5  
Old 11-17-2015
Quote:
Originally Posted by cmccabe
the search string (the 27 - 259 names in a file):

Code:
ABCA3
ACVRL1
AGRN

the file to search in $5 (the 11,137,660 line file)
Code:
chr1    955543    955763    chr1:955543    AGRN-6|gc=75    1    0
chr1    955543    955763    chr1:955543    AGRN-6|gc=75    2    2
chr1    955543    955763    chr1:955543    AGRN-6|gc=75    3    2

So the expected output would be:

Code:
chr1:955543    AGRN-6|gc=75     3

only $4, $5 where the match was found and the average of $7 are printed

Thank you very much Smilie.
In your code, you are saving $1 in a[] and $2 in b[] and at the end you are printing them with a colon between them. In you sample data above, $4 is always the same as $1:$2. Does that same relationship occur in all lines in your file? (Saving and printing $4 in an array will be faster than saving $1 in an array, saving $2 in another array, and printing both of them.) And, you say above that you want the output to be $4, $5, and the average, but you show the output being $4, $5, a "|", $6, and the average??? Please clarify!

Your sample output above shows that the average of 1, 2, and 3 is 3. Why not 2 (i.e., (1+2+3)/3)? How many decimal places do you want printed in the average?

Are your search strings always to be exactly matched by the string starting with the 1st character of $5 and ending with the character before the <minus-sign> character in $5? (Your script will run MUCH faster if you perform one test to determine if a string is a subscript in an array instead of an average of 14-130 regular expression matches.)
# 6  
Old 11-17-2015
You're reading a 3/4 GB file four times - I don't know if disk I/O buffering will easily cater for that. Why dont you read your four .bed files into four different (multidimensional?) arrays ( 259 is not too large an array element count), then do your four independent calculations on each large file's input line, and then output to the four different result files?
# 7  
Old 11-18-2015
@Don Cragun
Quote:
In your code, you are saving $1 in a[] and $2 in b[] and at the end you are printing them with a colon between them. In you sample data above, $4 is always the same as $1:$2 . Does that same relationship occur in all lines in your file? (Saving and printing $4 in an array will be faster than saving $1 in an array, saving $2 in another array, and printing both of them.)
Yes $4 is always the same as $1:$2

Quote:
And, you say above that you want the output to be $4, $5, and the average, but you show the output being $4 , $5 , a "|", $6 , and the average??? Please clarify!
The output should be $4 , $5 , a "|", $6 , and the average

Quote:
Your sample output above shows that the average of 1 , 2 , and 3 is 3 . Why not 2 (i.e., (1+2+3)/3) ? How many decimal places do you want printed in the average?
You aree correct in that 2 (i.e., (1+2+3)/3) is better and just one decimal place in the average.

Quote:
Are your search strings always to be exactly matched by the string starting with the 1st character of $5 and ending with the character before the <minus-sign> character in $5 ? (Your script will run MUCH faster if you perform one test to determine if a string is a subscript in an array instead of an average of 14-130 regular expression matches.)
Yes it is the first character in $5 to the "-" sign ( so in AGRN-6|gc=75) it is AGRN.

Thank you Smilie.

---------- Post updated at 10:23 AM ---------- Previous update was at 10:01 AM ----------

@RudiC

I'm not sure what you mean, sorry and thank you Smilie.

Last edited by cmccabe; 11-18-2015 at 12:22 PM.. Reason: fixed format
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Speed up extraction od tar.bz2 files using bash

The below bash will untar each tar.bz2 folder in the directory, then remove the tar.bz2. Each of the tar.bz2 folders ranges from 40-75GB and currently takes ~2 hours to extract. Is there a way to speed up the extraction process? I am using a xeon processor with 12 cores. Thank you :). ... (7 Replies)
Discussion started by: cmccabe
7 Replies

2. Shell Programming and Scripting

Help on for loop in bash

Hi, In the code "for loop" has been used to search for files (command line arguments) in directories and then produce the result to the standard output. However, I want when no files are named on the command line, it should read a list of files from standard input and it should use the command... (7 Replies)
Discussion started by: Ra26k
7 Replies

3. Shell Programming and Scripting

Speed up the loop in shell script

Hi I have written a shell script which will test 300 to 500 IPs to find which are pinging and which are not pinging. the script which give output as 10.x.x.x is pining 10.x.x.x. is not pining - - - 10.x.x.x is pining like above. But, this script is taking... (6 Replies)
Discussion started by: kumar85shiv
6 Replies

4. Shell Programming and Scripting

If loop in bash

Hello, I have a script that runs a series of commands. Halfway through the script, I want it to check whether everything is going alright: if it is, to proceed with the script, if it isn't to repeat the last step until it gets it right. My code so far looks like this, simplified a bit: ... (3 Replies)
Discussion started by: Leo_Boon
3 Replies

5. Shell Programming and Scripting

BASH loop inside a loop question

Hi all Sorry for the basic question, but i am writing a shell script to get around a slightly flaky binary that ships with one of our servers. This particular utility randomly generates the correct information and could work first time or may work on the 12th or 100th attempt etc !.... (4 Replies)
Discussion started by: rethink
4 Replies

6. Filesystems, Disks and Memory

data from blktrace: read speed V.S. write speed

I analysed disk performance with blktrace and get some data: read: 8,3 4 2141 2.882115217 3342 Q R 195732187 + 32 8,3 4 2142 2.882116411 3342 G R 195732187 + 32 8,3 4 2144 2.882117647 3342 I R 195732187 + 32 8,3 4 2145 ... (1 Reply)
Discussion started by: W.C.C
1 Replies

7. Shell Programming and Scripting

Using variables created sequentially in a loop while still inside of the loop [bash]

I'm trying to understand if it's possible to create a set of variables that are numbered based on another variable (using eval) in a loop, and then call on it before the loop ends. As an example I've written a script called question (The fist command is to show what is the contents of the... (2 Replies)
Discussion started by: DeCoTwc
2 Replies

8. Shell Programming and Scripting

any way to speed up calculations in bash script

hi i have a script that is taking the difference of multiple columns in a file from a value from a single row..so far i have a loop to do that.. all the data is floating point..fin has the difference between array1 and array2..array1 has 700 x 300= 210000 values and array2 has 700 values.. ... (11 Replies)
Discussion started by: npatwardhan
11 Replies

9. Shell Programming and Scripting

bash and ksh: variable lost in loop in bash?

Hi, I use AIX (ksh) and Linux (bash) servers. I'm trying to do scripts to will run in both ksh and bash, and most of the time it works. But this time I don't get it in bash (I'm more familar in ksh). The goal of my script if to read a "config file" (like "ini" file), and make various report.... (2 Replies)
Discussion started by: estienne
2 Replies

10. Filesystems, Disks and Memory

dmidecode, RAM speed = "Current Speed: Unknown"

Hello, I have a Supermicro server with a P4SCI mother board running Debian Sarge 3.1. This is the "dmidecode" output related to RAM info: RAM speed information is incomplete.. "Current Speed: Unknown", is there anyway/soft to get the speed of installed RAM modules? thanks!! Regards :)... (0 Replies)
Discussion started by: Santi
0 Replies
Login or Register to Ask a Question