Speed up bash loop?


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Speed up bash loop?
# 8  
Old 11-18-2015
Quote:
Originally Posted by cmccabe
Yes it is the first character in $5 to the "-" sign ( so in AGRN-6|gc=75) it is AGRN.
In that case the example that I gave above should put you on track. Split the AGRN or whatever part of $5 that could possibly been *exactly* matched by the search pattern. This will avoid searching a pattern in a longer string which is rather greedy.
# 9  
Old 11-18-2015
Instead of reading 750 MB, you are reading 3GB to operate on. With the four input files in arrays and an extended algorithm, the performance might be way faster.
If we had some meaningful samples, we could work out a small test script...
# 10  
Old 11-19-2015
one of the 4 files used is below (all four are 1 field and a list of names)

PAH.bed
Code:
AGRN
CCDC39 
CCDC40 
CFTR
DNAAF1
DNAAF2 
DNAAF3 
DNAH11 
DNAH5 
DNAI1 
DNAI2 
DNAL1 
DYX1C1
HEATR2 
HYDIN 
LRRC6 
NME8 
OFD1
RPGR
RSPH4A 
RSPH9

The file that is searched in is 11,137,660 lines in the format below:
Code:
chr1    955543    955763    chr1:955543    AGRN-6|gc=75    1    0
chr1    955543    955763    chr1:955543    AGRN-6|gc=75    2    2
chr1    955543    955763    chr1:955543    AGRN-6|gc=75    3    2

Thank you Smilie.
# 11  
Old 11-19-2015
Part of what is confusing is that you have input files with the filename extension .bed that you show as having a single field such as:
Code:
AGRN
CCDC39 
CCDC40 
CFTR
DNAAF1
...

and you have four output files (three of which have the same filename extension in a completely different format):
Code:
chr1:955543    AGRN-6|gc=75     3

and one other output file has the filename extension .. Why aren't the names of your output files consistent? Why aren't all files with the filename extension .bed in the same format?

And we have an unknown number of files matching the pattern /home/cmccabe/Desktop/HiQ/*base_counts.txt and no indication of what is actually matched by the asterisk. Please give us some actual sample pathnames that this pattern might match.

You have said your input files have more than 11 million lines each and have shown us the 3 line sample:
Code:
chr1    955543    955763    chr1:955543    AGRN-6|gc=75    1    0
chr1    955543    955763    chr1:955543    AGRN-6|gc=75    2    2
chr1    955543    955763    chr1:955543    AGRN-6|gc=75    3    2

and your code accumulates totals based on the string AGRN-6 and prints results assuming that AGRN-6 and AGRN-6|gc=75 select the same set of lines from your huge input files. Please give us a few more lines (some with strings that will be selected for output from the .bed input file and some that won't. And show us the exact output you hope to get in your four output files for that sample input. (Note that that means we need to see four sample .bed input files and four corresponding output files in your sample.)

From your description I am assuming that there could be multiple AGRN-x values in the input but for a given AGRN-x the string following the | will be a constant. I.e., for $5 in your code having the value AGRN-6 the only value for$6 will be gc=75, but there could be an AGRN-otherstring and all AGRN-otherstring entries would have a string something like xyz=somenumber but xyz and somenumber would always be the same for any given AGRN-otherstring. Is this assumption correct?

Adding | as a field separator character seems to be creating unneeded work for you. It would seem that using - as a field separator instead of using | as a field separator would help. Will there ever be more than one - in an input line?
# 12  
Old 11-19-2015
Well, I obviously missed that: there are n *base_counts.txt files, and for each of them you scan four times through the huge file, so 4 * n * 11,137,660 lines are read.
As has been offered before, with some meaningful samples we perhaps could give some decent help.
Please post at least two ???_unix_corrected.bed (partly) , two *base_counts.txt (partly) files, and a representative part of the huge file so we can build a meaningful test scenario.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Speed up extraction od tar.bz2 files using bash

The below bash will untar each tar.bz2 folder in the directory, then remove the tar.bz2. Each of the tar.bz2 folders ranges from 40-75GB and currently takes ~2 hours to extract. Is there a way to speed up the extraction process? I am using a xeon processor with 12 cores. Thank you :). ... (7 Replies)
Discussion started by: cmccabe
7 Replies

2. Shell Programming and Scripting

Help on for loop in bash

Hi, In the code "for loop" has been used to search for files (command line arguments) in directories and then produce the result to the standard output. However, I want when no files are named on the command line, it should read a list of files from standard input and it should use the command... (7 Replies)
Discussion started by: Ra26k
7 Replies

3. Shell Programming and Scripting

Speed up the loop in shell script

Hi I have written a shell script which will test 300 to 500 IPs to find which are pinging and which are not pinging. the script which give output as 10.x.x.x is pining 10.x.x.x. is not pining - - - 10.x.x.x is pining like above. But, this script is taking... (6 Replies)
Discussion started by: kumar85shiv
6 Replies

4. Shell Programming and Scripting

If loop in bash

Hello, I have a script that runs a series of commands. Halfway through the script, I want it to check whether everything is going alright: if it is, to proceed with the script, if it isn't to repeat the last step until it gets it right. My code so far looks like this, simplified a bit: ... (3 Replies)
Discussion started by: Leo_Boon
3 Replies

5. Shell Programming and Scripting

BASH loop inside a loop question

Hi all Sorry for the basic question, but i am writing a shell script to get around a slightly flaky binary that ships with one of our servers. This particular utility randomly generates the correct information and could work first time or may work on the 12th or 100th attempt etc !.... (4 Replies)
Discussion started by: rethink
4 Replies

6. Filesystems, Disks and Memory

data from blktrace: read speed V.S. write speed

I analysed disk performance with blktrace and get some data: read: 8,3 4 2141 2.882115217 3342 Q R 195732187 + 32 8,3 4 2142 2.882116411 3342 G R 195732187 + 32 8,3 4 2144 2.882117647 3342 I R 195732187 + 32 8,3 4 2145 ... (1 Reply)
Discussion started by: W.C.C
1 Replies

7. Shell Programming and Scripting

Using variables created sequentially in a loop while still inside of the loop [bash]

I'm trying to understand if it's possible to create a set of variables that are numbered based on another variable (using eval) in a loop, and then call on it before the loop ends. As an example I've written a script called question (The fist command is to show what is the contents of the... (2 Replies)
Discussion started by: DeCoTwc
2 Replies

8. Shell Programming and Scripting

any way to speed up calculations in bash script

hi i have a script that is taking the difference of multiple columns in a file from a value from a single row..so far i have a loop to do that.. all the data is floating point..fin has the difference between array1 and array2..array1 has 700 x 300= 210000 values and array2 has 700 values.. ... (11 Replies)
Discussion started by: npatwardhan
11 Replies

9. Shell Programming and Scripting

bash and ksh: variable lost in loop in bash?

Hi, I use AIX (ksh) and Linux (bash) servers. I'm trying to do scripts to will run in both ksh and bash, and most of the time it works. But this time I don't get it in bash (I'm more familar in ksh). The goal of my script if to read a "config file" (like "ini" file), and make various report.... (2 Replies)
Discussion started by: estienne
2 Replies

10. Filesystems, Disks and Memory

dmidecode, RAM speed = "Current Speed: Unknown"

Hello, I have a Supermicro server with a P4SCI mother board running Debian Sarge 3.1. This is the "dmidecode" output related to RAM info: RAM speed information is incomplete.. "Current Speed: Unknown", is there anyway/soft to get the speed of installed RAM modules? thanks!! Regards :)... (0 Replies)
Discussion started by: Santi
0 Replies
Login or Register to Ask a Question