Speed up bash loop?

Login or Register to Ask a Question and Join Our Community

Speed up bash loop?

Tags

bash awk, shell scripts

Login to Discuss or Reply to this Discussion in Our Community

Top Forums Shell Programming and Scripting Speed up bash loop?

11-17-2015

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

Speed up bash loop?

I am running the below bash loop on all the files of a specific type (highlighted in bold) in a directory. There are 4 awk commands that use the input files to search another and look for a match. The input files range from 27 - 259 and are a list of names. The file that is searched is 11,137,660 lines. The loop does run, however, it takes ~20 hours to complete on a computer with 64GB and a xeon 8 core processor. Is this normal and can it be made faster (more efficient)? Thank you

.

Code:

for f in /home/cmccabe/Desktop/HiQ/*base_counts.txt ; do
     bname=`basename $f`
     pref=${bname%%.txt}
     awk -f /home/cmccabe/Desktop/match.awk /home/cmccabe/Desktop/panels/PCD_unix_corrected.bed $f > /home/cmccabe/Desktop/HiQ/${pref}_PCD_coverage.
     awk -f /home/cmccabe/Desktop/match.awk /home/cmccabe/Desktop/panels/BMF_unix_corrected.bed $f > /home/cmccabe/Desktop/HiQ/${pref}_BMF_coverage.bed
     awk -f /home/cmccabe/Desktop/match.awk /home/cmccabe/Desktop/panels/PAH_unix_corrected.bed $f > /home/cmccabe/Desktop/HiQ/${pref}_PAH_coverage.bed
     awk -f /home/cmccabe/Desktop/match.awk /home/cmccabe/Desktop/panels/PID_unix_corrected.bed $f > /home/cmccabe/Desktop/HiQ/${pref}_PID_coverage.bed
done

awk

Code:

BEGIN {
  FS="[ \t|]*"
}
# Read search terms from file1 into 's'
FNR==NR {
    s[$0]
    next
}
{
    # Check if $5 matches one of the search terms
    for(i in s) {
        if($5 ~ i) {

            # Store first two fields for later usage
            a[$5]=$1
            b[$5]=$2

            # Add $9 to total of $9 per $5
            t[$5]+=$8
            # Increment count of occurences of $5
            c[$5]++

            next
        }
    }
}
END {

    # Calculate average and print output for all search terms
    # that has been found
    for( i in t ) {
        avg = t[i] / c[i]
        printf "%s:%s\t%s\t%s\n", a[i], b[i], i, avg | "sort -k3,3n"
    }
}

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

11-17-2015

Registered User

544, 43

Join Date: Oct 2006

Last Activity: 27 March 2017, 3:00 AM EDT

Location: Belgium

Posts: 544

Thanks Given: 5

Thanked 43 Times in 29 Posts

Can you post sample files?

Is the $5 string longer than the searched pattern?

ripat

View Public Profile for ripat

Find all posts by ripat

11-17-2015

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

the search string (the 27 - 259 names in a file):

Code:

ABCA3
ACVRL1
AGRN

the file to search in $5 (the 11,137,660 line file)

Code:

chr1    955543    955763    chr1:955543    AGRN-6|gc=75    1    0
chr1    955543    955763    chr1:955543    AGRN-6|gc=75    2    2
chr1    955543    955763    chr1:955543    AGRN-6|gc=75    3    2

So the expected output would be:

Code:

chr1:955543    AGRN-6|gc=75     3

only $4, $5 where the match was found and the average of $7 are printed

Thank you very much

.

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

11-17-2015

Registered User

544, 43

Join Date: Oct 2006

Last Activity: 27 March 2017, 3:00 AM EDT

Location: Belgium

Posts: 544

Thanks Given: 5

Thanked 43 Times in 29 Posts

If $5 can always be spilt on the hyphen i.e. AGRN-6|gc=75 to AGRN-6|gc=75 this could speed up the process.

To put you on track:

Code:

BEGIN{FS="[\t| -]+"}

FNR==NR {
	s[$0]=1
	next
}

# if s[$5] exists --> do something
s[$5] {
	# do something
}

If mawk is available on your box, it's usually faster.

ripat

View Public Profile for ripat

Find all posts by ripat

11-17-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by cmccabe

the search string (the 27 - 259 names in a file):

Code:

ABCA3
ACVRL1
AGRN

the file to search in $5 (the 11,137,660 line file)

Code:

chr1    955543    955763    chr1:955543    AGRN-6|gc=75    1    0
chr1    955543    955763    chr1:955543    AGRN-6|gc=75    2    2
chr1    955543    955763    chr1:955543    AGRN-6|gc=75    3    2

So the expected output would be:

Code:

chr1:955543    AGRN-6|gc=75     3

only $4, $5 where the match was found and the average of $7 are printed

Thank you very much Smilie

Smilie

.

In your code, you are saving $1 in a[] and $2 in b[] and at the end you are printing them with a colon between them. In you sample data above, $4 is always the same as $1:$2. Does that same relationship occur in all lines in your file? (Saving and printing $4 in an array will be faster than saving $1 in an array, saving $2 in another array, and printing both of them.) And, you say above that you want the output to be $4, $5, and the average, but you show the output being $4, $5, a "|", $6, and the average??? Please clarify!

Your sample output above shows that the average of 1, 2, and 3 is 3. Why not 2 (i.e., (1+2+3)/3)? How many decimal places do you want printed in the average?

Are your search strings always to be exactly matched by the string starting with the 1st character of $5 and ending with the character before the <minus-sign> character in $5? (Your script will run MUCH faster if you perform one test to determine if a string is a subscript in an array instead of an average of 14-130 regular expression matches.)

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

11-17-2015

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

You're reading a 3/4 GB file four times - I don't know if disk I/O buffering will easily cater for that. Why dont you read your four .bed files into four different (multidimensional?) arrays ( 259 is not too large an array element count), then do your four independent calculations on each large file's input line, and then output to the four different result files?

RudiC

View Public Profile for RudiC

Find all posts by RudiC

11-18-2015

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

@Don Cragun

Quote:

In your code, you are saving $1 in a[] and $2 in b[] and at the end you are printing them with a colon between them. In you sample data above, $4 is always the same as $1:$2 . Does that same relationship occur in all lines in your file? (Saving and printing $4 in an array will be faster than saving $1 in an array, saving $2 in another array, and printing both of them.)

Yes $4 is always the same as $1:$2

Quote:

And, you say above that you want the output to be $4, $5, and the average, but you show the output being $4 , $5 , a "|", $6 , and the average??? Please clarify!

The output should be $4 , $5 , a "|", $6 , and the average

Quote:

Your sample output above shows that the average of 1 , 2 , and 3 is 3 . Why not 2 (i.e., (1+2+3)/3) ? How many decimal places do you want printed in the average?

You aree correct in that 2 (i.e., (1+2+3)/3) is better and just one decimal place in the average.

Quote:

Are your search strings always to be exactly matched by the string starting with the 1st character of $5 and ending with the character before the <minus-sign> character in $5 ? (Your script will run MUCH faster if you perform one test to determine if a string is a subscript in an array instead of an average of 14-130 regular expression matches.)

Yes it is the first character in $5 to the "-" sign ( so in AGRN-6|gc=75) it is AGRN.

Thank you

.

---------- Post updated at 10:23 AM ---------- Previous update was at 10:01 AM ----------

@RudiC

I'm not sure what you mean, sorry and thank you

.

Last edited by cmccabe; 11-18-2015 at 12:22 PM.. Reason: fixed format

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Speed up extraction od tar.bz2 files using bash

The below bash will untar each tar.bz2 folder in the directory, then remove the tar.bz2. Each of the tar.bz2 folders ranges from 40-75GB and currently takes ~2 hours to extract. Is there a way to speed up the extraction process? I am using a xeon processor with 12 cores. Thank you :). ...

2. Shell Programming and Scripting

Help on for loop in bash

Hi, In the code "for loop" has been used to search for files (command line arguments) in directories and then produce the result to the standard output. However, I want when no files are named on the command line, it should read a list of files from standard input and it should use the command...

3. Shell Programming and Scripting

Speed up the loop in shell script

Hi I have written a shell script which will test 300 to 500 IPs to find which are pinging and which are not pinging. the script which give output as 10.x.x.x is pining 10.x.x.x. is not pining - - - 10.x.x.x is pining like above. But, this script is taking...

4. Shell Programming and Scripting

If loop in bash

Hello, I have a script that runs a series of commands. Halfway through the script, I want it to check whether everything is going alright: if it is, to proceed with the script, if it isn't to repeat the last step until it gets it right. My code so far looks like this, simplified a bit: ...

5. Shell Programming and Scripting

BASH loop inside a loop question

Hi all Sorry for the basic question, but i am writing a shell script to get around a slightly flaky binary that ships with one of our servers. This particular utility randomly generates the correct information and could work first time or may work on the 12th or 100th attempt etc !....

6. Filesystems, Disks and Memory

data from blktrace: read speed V.S. write speed

I analysed disk performance with blktrace and get some data: read: 8,3 4 2141 2.882115217 3342 Q R 195732187 + 32 8,3 4 2142 2.882116411 3342 G R 195732187 + 32 8,3 4 2144 2.882117647 3342 I R 195732187 + 32 8,3 4 2145 ...

7. Shell Programming and Scripting

Using variables created sequentially in a loop while still inside of the loop [bash]

I'm trying to understand if it's possible to create a set of variables that are numbered based on another variable (using eval) in a loop, and then call on it before the loop ends. As an example I've written a script called question (The fist command is to show what is the contents of the...

8. Shell Programming and Scripting

any way to speed up calculations in bash script

hi i have a script that is taking the difference of multiple columns in a file from a value from a single row..so far i have a loop to do that.. all the data is floating point..fin has the difference between array1 and array2..array1 has 700 x 300= 210000 values and array2 has 700 values.. ...

9. Shell Programming and Scripting

bash and ksh: variable lost in loop in bash?

Hi, I use AIX (ksh) and Linux (bash) servers. I'm trying to do scripts to will run in both ksh and bash, and most of the time it works. But this time I don't get it in bash (I'm more familar in ksh). The goal of my script if to read a "config file" (like "ini" file), and make various report....

10. Filesystems, Disks and Memory

dmidecode, RAM speed = "Current Speed: Unknown"

Hello, I have a Supermicro server with a P4SCI mother board running Debian Sarge 3.1. This is the "dmidecode" output related to RAM info: RAM speed information is incomplete.. "Current Speed: Unknown", is there anyway/soft to get the speed of installed RAM modules? thanks!! Regards :)...

Login or Register to Ask a Question