Help with a bash loop script

07-30-2019

Registered User

3, 0

Join Date: Jul 2019

Last Activity: 11 August 2019, 2:34 AM EDT

Posts: 3

Thanks Given: 2

Thanked 0 Times in 0 Posts

Help with a bash loop script

Create a single bash script that does the following:
a. Print out the number of occurrences for each motif that is found in the bacterial genome and output to a file called motif_count.txt
b. Create a fasta file for each motif (so 3 in total) which contains all of the genes and their corresponding sequences that have that motif. Each file should be named after the motif (ie ATG.txt) and outputted to a new directory called motifs

motif file is a txt file with these motifs : ATG, GGGGG, ATTTT
the bacterial genome file is fasta file with the following lines

Code:

 >gene1
GAAACTCGGTGTTGGCTTACCGGTCATTCCGAGCGTCATTTGGTTTTCGCGTCGTGGCGAAATGTGGTTCTACTACTCGTGGTGTATGCACTATTTATCCGGAATGTTCAGAGCGAGTAGACAATGGGTGCTCCACAATTGTGGCGGTCCCTAAGGGACTCACATATAGTGAGACACGCGTGAAATTCTGCTCACCACGTCCGAATCCGACAAATCATCTACTTCGACGGTA
>gene2
CGGAGATAAAGGACCCATACTGTACGACATTGTATTGCTCACCATGGTCAATCTTTGCGAGTTGTTGCAGCTCGCAGCTTCGTTCTGTCAATATAGCTTAGATACTGAGAAGAAGTTGCAGAGAAAGTCGCA

Moderator's Comments:

edit by bakunin: please use CODE-tags to let data, code and terminal output stand out. Thank you.

Last edited by bakunin; 07-30-2019 at 10:48 AM..

dre

View Public Profile for dre

Find all posts by dre

07-30-2019

Registered User

6,384, 2,214

Join Date: May 2005

Last Activity: 28 October 2019, 4:59 PM EDT

Location: In the leftmost byte of /dev/kmem

Posts: 6,384

Thanks Given: 143

Thanked 2,214 Times in 1,548 Posts

Here we go again analysing FASTA-files. I wonder if we are going to be mentioned in the dozens of biological research papers we have helped to create over time.

Quote:

Originally Posted by dre

Create a single bash script that does the following:

Sorry, but this is not the way that works.

First: you come across like a teacher posing the homework for us. Actually we are all professionals here and have done our homework meticulously which is why we don't have to do more homework any more. In case it was homework given to you by your teacher: there is a special forum for this with special rules in place. Please re-create the thread there and provide the necessary information.

Second, in case this is not homework but actually your work: we do not insist on a lot of social conventions here (after all, we are sysadmins - being somewhat autistic is part of the requirement for this job), but still a well-placed "please" here and there, along with some of the common niceties called "good manners" oils the social machinery. If you ever wonder who writes the answers to your questions: it is not some clever machine and a fat server - it is actually living, breathing persons.

Third and foremost: we are here to help you help yourself. If you want a shell-script - WRITE IT! Attempts, however failing, will count. If you tried and it didn't work, post what you did and the error you got trying to run it. We will, regardless of how long it will take, explain to you what was wrong and how to do it better (actually quite like the way i explain to you right now why you need to change your problem statement) until you understand. We will also point out ways to better solve the problem, different tools, suggest sources to find valuable information and so on and so on. But we will NOT write your code for you. We are a help forum, not your unpaid programming staff.

I hope this helps.

bakunin

These 2 Users Gave Thanks to bakunin For This Post:

bakunin

View Public Profile for bakunin

Find all posts by bakunin

07-30-2019

Registered User

2,019, 606

Join Date: Apr 2009

Last Activity: 27 February 2021, 12:15 PM EST

Location: India

Posts: 2,019

Thanks Given: 50

Thanked 606 Times in 567 Posts

1. Could you please show what you have tried?

2. Please understand that forum members may not have working knowledge of biology. So when you say, "Print out the number of occurrences for each motif that is found in the bacterial genome", this makes no sense to me (and may not to a lot others too)
3. A good post requesting assistance should in my humble opinion have the following information:

Clearly state the problem without any ambiguity. Break down your problem into the smallest part where you need help. Posting a question 100 lines long will yield no result. People will yawn and go back to doing their day job.
Show your attempt at solving the problem. This will help members focus on the exact place where you need help; and not have to dig about and assume where you might be facing the issue.
One line about your OS, your shell, preferred scripting language..

balajesuri

View Public Profile for balajesuri

Find all posts by balajesuri

07-30-2019

Registered User

3, 0

Join Date: Jul 2019

Last Activity: 11 August 2019, 2:34 AM EDT

Posts: 3

Thanks Given: 2

Thanked 0 Times in 0 Posts

Sorry didn't mean to offend anyone and thanks for the advice.&lt;br /&gt; 
&lt;br /&gt; 
&lt;br /&gt;

Quote:

Originally Posted by bakunin

Here we go again analysing FASTA-files. I wonder if we are going to be mentioned in the dozens of biological research papers we have helped to create over time.&lt;br /&gt; 
&lt;br /&gt; 
&lt;br /&gt; 
&lt;br /&gt; 
Sorry, but this is not the way that works.&lt;br /&gt; 
&lt;br /&gt; 
First: you come across like a teacher posing the homework for us. Actually we are all professionals here and have done our homework meticulously which is why we don't have to do more homework any more. In case it was homework given to &lt;i&gt;you&lt;/i&gt; by &lt;i&gt;your&lt;/i&gt; teacher: there is a special forum for this with special rules in place. Please re-create the thread there and provide the necessary information.&lt;br /&gt; 
&lt;br /&gt; 
Second, in case this is not homework but actually your work: we do not insist on a lot of social conventions here (after all, we are sysadmins - being somewhat autistic is part of the requirement for this job), but still a well-placed &amp;quot;please&amp;quot; here and there, along with some of the common niceties called &amp;quot;good manners&amp;quot; oils the social machinery. If you ever wonder who writes the answers to your questions: it is not some clever machine and a fat server - it is actually living, breathing persons.&lt;br /&gt; 
&lt;br /&gt; 
Third and foremost: we are here to help you help yourself. If you want a shell-script - WRITE IT! Attempts, however failing, will count. If you tried and it didn't work, post what you did and the error you got trying to run it. We will, regardless of how long it will take, explain to you what was wrong and how to do it better (actually quite like the way i explain to you right now why you need to change your problem statement) until you understand. We will also point out ways to better solve the problem, different tools, suggest sources to find valuable information and so on and so on. But we will NOT write your code for you. We are a help forum, not your unpaid programming staff.&lt;br /&gt; 
&lt;br /&gt; 
I hope this helps.&lt;br /&gt; 
&lt;br /&gt; 
bakunin

&lt;br /&gt; 
&lt;br /&gt; 
&lt;span style=&quot;color:#738fbf;&quot;&gt;&lt;span style=&quot;font-size:1;&quot;&gt;--- Post updated at 08:32 PM ---&lt;/span&gt;&lt;/span&gt;&lt;br /&gt; 
&lt;br /&gt; 
 
[ QUOTE=dre;303037350]Create a single bash script that does the following:&lt;br /&gt; 
a. Print out the number of occurrences for each motif that is found in the bacterial genome and output to a file called motif_count.txt&lt;br /&gt; 
b. Create a fasta file for each motif (so 3 in total) which contains all of the genes and their corresponding sequences that have that motif. Each file should be named after the motif (ie ATG.txt) and outputted to a new directory called motifs&lt;br /&gt; 
&lt;br /&gt; 
motif file is a txt file with these motifs : ATG, GGGGG, ATTTT&lt;br /&gt; 
the bacterial genome file is fasta file with the following lines&lt;br /&gt;

Code:

 &amp;amp;gt;gene1&amp;lt;br /&amp;gt;&lt;br /&gt;<br />
GAAACTCGGTGTTGGCTTACCGGTCATTCCGAGCGTCATTTGGTTTTCGCGTCGTGGCGAAATGTGGTTCTACTACTCGTGGTGTATGCACTATTTATCCGGAATGTTCAGAGCGAGTAGACAATGGGTGCTCCACAATTGTGGCGGTCCCTAAGGGACTCACATATAGTGAGACACGCGTGAAATTCTGCTCACCACGTCCGAATCCGACAAATCATCTACTTCGACGGTA&amp;lt;br /&amp;gt;&lt;br /&gt;<br />
&amp;amp;gt;gene2&amp;lt;br /&amp;gt;&lt;br /&gt;<br />
CGGAGATAAAGGACCCATACTGTACGACATTGTATTGCTCACCATGGTCAATCTTTGCGAGTTGTTGCAGCTCGCAGCTTCGTTCTGTCAATATAGCTTAGATACTGAGAAGAAGTTGCAGAGAAAGTCGCA

Moderator's Comments:

edit by bakunin: please use CODE-tags to let data, code and terminal output stand out. Thank you.

[/QUOTE] 
 
--- Post updated at 08:54 PM --- 
 
This is my attempt 
#!/bin/bash 
 
# create a new directory motifs if it doesn't exist 
mkdir -p motifs 
cd motifs 
touch motif_count.txt 
> motif_count.txt 
for motif in ATG GGGGG ATTTT 
do 
echo $motif >> motif_count.txt 
grep $motif -c r_bifella.fasta >> motif_count.txt 
touch $motif.fasta 
grep $motif -B 1 r_bifella.fasta > $motif.fasta 
done 
 
]

Quote:

Originally Posted by dre

Create a single bash script that does the following: 
a. Print out the number of occurrences for each motif that is found in the bacterial genome and output to a file called motif_count.txt 
b. Create a fasta file for each motif (so 3 in total) which contains all of the genes and their corresponding sequences that have that motif. Each file should be named after the motif (ie ATG.txt) and outputted to a new directory called motifs 
 
motif file is a txt file with these motifs : ATG, GGGGG, ATTTT 
the bacterial genome file is fasta file with the following lines

Code:

 &gt;gene1<br />
GAAACTCGGTGTTGGCTTACCGGTCATTCCGAGCGTCATTTGGTTTTCGCGTCGTGGCGAAATGTGGTTCTACTACTCGTGGTGTATGCACTATTTATCCGGAATGTTCAGAGCGAGTAGACAATGGGTGCTCCACAATTGTGGCGGTCCCTAAGGGACTCACATATAGTGAGACACGCGTGAAATTCTGCTCACCACGTCCGAATCCGACAAATCATCTACTTCGACGGTA<br />
&gt;gene2<br />
CGGAGATAAAGGACCCATACTGTACGACATTGTATTGCTCACCATGGTCAATCTTTGCGAGTTGTTGCAGCTCGCAGCTTCGTTCTGTCAATATAGCTTAGATACTGAGAAGAAGTTGCAGAGAAAGTCGCA

Moderator's Comments:

edit by bakunin: please use CODE-tags to let data, code and terminal output stand out. Thank you.

dre

View Public Profile for dre

Find all posts by dre

07-30-2019

Registered User

6,384, 2,214

Join Date: May 2005

Last Activity: 28 October 2019, 4:59 PM EDT

Location: In the leftmost byte of /dev/kmem

Posts: 6,384

Thanks Given: 143

Thanked 2,214 Times in 1,548 Posts

Quote:

Originally Posted by dre

Sorry didn't mean to offend anyone and thanks for the advice.

Fair enough, no problem. You are welcome.

Quote:

Originally Posted by dre

This is my attempt

Code:

 #!/bin/bash

# create a new directory motifs if it doesn't exist
mkdir -p motif
cd motifs
touch motif_count.txt
> motif_count.txt
for motif in ATG GGGGG ATTTT
do
   echo $motif >> motif_count.txt
   grep $motif -c r_bifella.fasta >> motif_count.txt
   touch $motif.fasta
   grep  $motif -B 1 r_bifella.fasta > $motif.fasta
done

So, that is a start. Let us go over it. Notice that i will address some general points about script development which may not help you for this script but in the long run.

First, you should make it a habit to declare the variables you use. Not because this is always necessary (unlike in C or other languages you can "make variables up" on the fly simply by using it) but because this is a good start to think over the algorithm you are going to employ, what information every part of it needs and so on. Furthermore you get some documentation for free.

So, let us start: you want a directory to put your results there and you want to create it. Question: what should happen if the directory already exists, i.e. from a former run of the script? Use it again? Create a new one? Overwrite the files there? Number the files so that results from diferent runs can exist alongside?

Second: this is a typo:

Code:

 #!/bin/bash

# create a new directory motifs if it doesn't exist
mkdir -p motif
cd motifs

Don't worry - typos happen to all of us. But wouldn't it be nice to avoid such typos? Actually you can, by using a variable instead of a fixed name. And, by the way, is it really a good idea to put the directory in the current directory? Wouldn't it be better to create a directory in your HOME, regardless of where you currently are when you call the script? So, how about doing it like this:

Code:

 #!/bin/bash

declare targetdir="/home/youruser/mywork/motifs"           # directory for motifs

mkdir -p "$targetdir"
cd "$targetdir"

You see, now you can use "$targetdir" in your script and if you want to change the location you will have to do it only in one place - and it is easy to understand where that is because of the comment! Well written scripts are easy to read and easy to maintain.

Another point: don't use "cd" in a script! Use absolute pathes so that the script works regardless of where you stand. or from where you call it always in the same way.

Code:

touch motif_count.txt
> motif_count.txt

you actually do not need the first line because the redirection will create the (empty) file if it doesn't exist. Adding the path (instead of the "cd") we get:

Code:

 #!/bin/bash

declare targetdir="/home/youruser/mywork/motifs"         # directory for motifs
declare countfile="motif_count.txt"                      # count file
declare motif=""                                         # buffer
declare allmotifs="ATG GGGGG ATTTT"                      # list of motifs to process

mkdir -p "$targetdir"
> "${targetdir}/${countfile}"
for motif in $allmotifs ; do
     ....
done

Let us pause here, it is getting quite late for me. More on the script tomorrow, but you might want to go over the question above and us what you wnat the script to do. Further, you may want to explain what your script does not do or does wrongly. Finally, a bit more information about your environment: OS, version, .... - might also help because some systems have special provisios others do lack.

I hope this helps.

bakunin

This User Gave Thanks to bakunin For This Post:

bakunin

View Public Profile for bakunin

Find all posts by bakunin

07-31-2019

Registered User

6,384, 2,214

Join Date: May 2005

Last Activity: 28 October 2019, 4:59 PM EDT

Location: In the leftmost byte of /dev/kmem

Posts: 6,384

Thanks Given: 143

Thanked 2,214 Times in 1,548 Posts

Quote:

Originally Posted by dre

a. Print out the number of occurrences for each motif that is found in the bacterial genome and output to a file called motif_count.txt

I will continue with this requirement: if i understand you correctly you want to count all the occurrences of the sequence in each gene, like this (the numbers are made up):

Code:

gene1: 37
gene2: 21
...

for all the genes in your your bacterial genome file. Is that correct?

If so, here is a shell algorithm which will do that:

read two lines (?) from the genome file, the first one holds the name of the gene:

Code:

> gene1

the second one holds the gene sequence itself:

Code:

GAAACTCGGTGTTGGCTTACCGGTCATTCCGAGCGTC[....]

we will try to "subtract" (that is: cut out from the string) one occurrence of the pattern we look for from the gene: if it changes we have found such a pattern - we increase the counter and repeat that. Once the string remains unchanged we could find no more occurrence of the pattern, so we end this and output the final result. You need "parameter expansion" for this and i suggest you read it up because this is a versatile tool to your toolkit. I put the code in form of a function which you can call:

Code:

pGetNumber ()
{
local chGene="$1"                         # content of the gene, first parameter
local chMotif="$2"                         # pattern we look for, second parameter
local iCnt=0                                   # counter

while [ "${chGene/${chMotif}/}" != "${chGene}" ] ; do
     (( iCnt++ ))
     chGene="${chGene/${chMotif}/}"
done

printf "%u\n" $iCnt

return 0
}

Notice that to read the genome file correctly we need a few additional bits of information: 1) is the name of the gene always on a line starting with a ">"? 2) the genes content is in one line in your sample. Is that always so or could that be broken into several lines?

I assume for the moment that 1) is the case and the answer to 2) is that it always on one line. Note that the script will break if this is not the case but it could be easily adapted.

Now let us include that into the script start i showed you already:

Code:

#!/bin/bash

pGetNumber ()
{
local chGene="$1"                         # content of the gene, first parameter
local chMotif="$2"                         # pattern we look for, second parameter
local iCnt=0                                   # counter

while [ "${chGene/${chMotif}/}" != "${chGene}" ] ; do
     (( iCnt++ ))
     chGene="${chGene/${chMotif}/}"
done

printf "%u\n" $iCnt

return 0
}

# ------------------------ main ()
declare targetdir="/home/youruser/mywork/motifs"         # directory for motifs
declare countfile="motif_count.txt"                      # count file
declare chMotif=""                                       # buffer
declare chGene=""                                       # buffer
declare chGeneName=""                          # buffer
declare chAllmotifs="ATG GGGGG ATTTT"                    # list of motifs to process
declare fInput="/path/to/your/genome.file"              # the input file with your genome

mkdir -p "$targetdir"
> "${targetdir}/${countfile}"

while read chGeneName ; do
     read chGene
     chGeneName="${chGeneName#> }"                     # cut off the "> " from the name
     for chMotif in $chAllmotifs ; do
          printf "%20s: %u\n" "$chGeneName" $(pGetNumber "$chGene" "$chMotif") >> "${targetdir}/${countfile}.${chMotif}"
     done
done < "$fInput"

exit 0

More to come, but you should answer the questions i asked.

I hope this helps.

bakunin

This User Gave Thanks to bakunin For This Post:

bakunin

View Public Profile for bakunin

Find all posts by bakunin

08-03-2019

Registered User

3, 0

Join Date: Jul 2019

Last Activity: 11 August 2019, 2:34 AM EDT

Posts: 3

Thanks Given: 2

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by bakunin

I will continue with this requirement: if i understand you correctly you want to count all the occurrences of the sequence in each gene, like this (the numbers are made up):

Code:

gene1: 37
gene2: 21
...

Code:

> gene1

the second one holds the gene sequence itself:

Code:

GAAACTCGGTGTTGGCTTACCGGTCATTCCGAGCGTC[....]

Code:

pGetNumber ()
{
local chGene="$1"                         # content of the gene, first parameter
local chMotif="$2"                         # pattern we look for, second parameter
local iCnt=0                                   # counter

while [ "${chGene/${chMotif}/}" != "${chGene}" ] ; do
     (( iCnt++ ))
     chGene="${chGene/${chMotif}/}"
done

printf "%u\n" $iCnt

return 0
}

Code:

#!/bin/bash

pGetNumber ()
{
local chGene="$1"                         # content of the gene, first parameter
local chMotif="$2"                         # pattern we look for, second parameter
local iCnt=0                                   # counter

while [ "${chGene/${chMotif}/}" != "${chGene}" ] ; do
     (( iCnt++ ))
     chGene="${chGene/${chMotif}/}"
done

printf "%u\n" $iCnt

return 0
}

# ------------------------ main ()
declare targetdir="/home/youruser/mywork/motifs"         # directory for motifs
declare countfile="motif_count.txt"                      # count file
declare chMotif=""                                       # buffer
declare chGene=""                                       # buffer
declare chGeneName=""                          # buffer
declare chAllmotifs="ATG GGGGG ATTTT"                    # list of motifs to process
declare fInput="/path/to/your/genome.file"              # the input file with your genome

mkdir -p "$targetdir"
> "${targetdir}/${countfile}"

while read chGeneName ; do
     read chGene
     chGeneName="${chGeneName#> }"                     # cut off the "> " from the name
     for chMotif in $chAllmotifs ; do
          printf "%20s: %u\n" "$chGeneName" $(pGetNumber "$chGene" "$chMotif") >> "${targetdir}/${countfile}.${chMotif}"
     done
done < "$fInput"

exit 0

More to come, but you should answer the questions i asked.

I hope this helps.

bakunin

I appreciate very much the guidance and assistance you are giving me on the code. The following are the answers to your questions:
a.

Quote:

So, let us start: you want a directory to put your results there and you want to create it. Question: what should happen if the directory already exists, i.e. from a former run of the script? Use it again?
Create a new one? Overwrite the files there? Number the files so that results from different runs can exist alongside?

If the directory exists I would like the files present to be overwritten.
b.

Quote:

Finally, a bit more information about your environment: OS, version, .... - might also help because some systems have special provisions others do lack.

The current OS I am using is Ubuntu 16.04 LTS
c.

Quote:

1) is the name of the gene always on a line starting with a ">"? 2) the genes content is in one line in your sample. Is that always so or could that be broken into several lines?

The name of the gene is always on a line with a ">" and the gene contents can be broken into several lines.

dre

View Public Profile for dre

Find all posts by dre

UNIX for Beginners Questions & Answers

Help with a bash loop script

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Help with date in bash script for loop from YYYYMMDDHHMM

Discussion started by: kl1ngac1k

2. Shell Programming and Scripting

How to use grep in a loop using a bash script?

Discussion started by: aberg

3. Shell Programming and Scripting

Loop through multiple files in bash script

Discussion started by: BabyNuke

4. Shell Programming and Scripting

While loop with input in a bash script

Discussion started by: faizlo

5. Shell Programming and Scripting

Bash script - loop question

Discussion started by: nitrohuffer2001

6. Shell Programming and Scripting

Expect script called in loop from Bash Script

Discussion started by: cbo0485

7. Shell Programming and Scripting

Getting error on for loop - bash script

Discussion started by: arundhati_s

8. Shell Programming and Scripting

Whitespace in filenames in for loop in bash script

Discussion started by: triplemaya

9. Shell Programming and Scripting

error in bash script 'if' loop

Discussion started by: DILEEP410

10. Shell Programming and Scripting

loop does not execute in bash script?

Discussion started by: fedora