Duplication | awk | result


 
Thread Tools Search this Thread
Top Forums UNIX for Beginners Questions & Answers Duplication | awk | result
# 1  
Old 06-02-2019
Duplication | awk | result

Dear forum members,

I want the script to count ALA as one (an example in quotes) and return an integer as 1 and not return 5 as an integer as it does now (look bash script). So how can I upgrade my script that it first checks or after finding all instances of ALA checks whether it is the same number associated with ALA (56 (r<56>) in quotes example associated with ALA five times, which can be 13 counted five times associated witH ALA word, 99 "-" with ALA word, 88 "-" ALA word, etc.). So in general I want the script to count how many times ALA is used and then return an integer of it but if it duplicates I would love for it to count only once.

The data example that I want to count as one:

Code:
c<B>r<56>a<2643>R<ALA>A<N> 5.654 16.636 -19.419 1.55 el=N oc=1;tf=27.14
c<B>r<56>a<2644>R<ALA>A<CA> 4.306 16.969 -19.795 1.7 el=C oc=1;tf=27.77
c<B>r<56>a<2645>R<ALA>A<C> 4.139 18.435 -20.144 1.7 el=C oc=1;tf=29.41
c<B>r<56>a<2646>R<ALA>A<O> 3.619 18.808 -21.204 1.52 el=O oc=1;tf=30.63
c<B>r<56>a<2647>R<ALA>A<CB> 3.373 16.628 -18.664 1.7 el=C oc=1;tf=28.99
c<B>r<88>a<2887>R<ALA>A<N> -3.023 7.753 -19.907 1.55 el=N oc=1;tf=20.84
c<B>r<88>a<2888>R<ALA>A<CA> -3.018 7.206 -18.575 1.7 el=C oc=1;tf=17.38
c<B>r<88>a<2889>R<ALA>A<C> -1.627 6.647 -18.364 1.7 el=C oc=1;tf=18.59
c<B>r<88>a<2890>R<ALA>A<O> -1.086 5.92 -19.197 1.52 el=O oc=1;tf=14.88
c<B>r<88>a<2891>R<ALA>A<CB> -4.015 6.09 -18.472 1.7 el=C oc=1;tf=18.6

Here is the code:

Code:
#!/bin/bash
read -p "amino acid: " AAA
if [ "$AAA" == "ALA" ] || [ "$AAA" == "VAL" ] || [ "$AAA" == "ARG" ] || [ "$AAA" == "ASN" ] || [ "$AAA" == "ASP" ] || 
   [ "$AAA" == "CYS" ] || [ "$AAA" == "GLY" ] || [ "$AAA" == "ILE" ] || [ "$AAA" == "LEU" ] || [ "$AAA" == "LYS" ] ||
   [ "$AAA" == "MET" ] || [ "$AAA" == "PHE" ] || [ "$AAA" == "PRO" ] || [ "$AAA" == "SER" ] || [ "$AAA" == "THR" ] || 
   [ "$AAA" == "TRP" ] || [ "$AAA" == "TYR" ] || [ "$AAA" == "HIS" ] || [ "$AAA" == "GLN" ] || [ "$AAA" == "GLU" ]
then 
#for AAA in ALA ARG ASN ASP CYS GLN GLY GLU HIS ILE \
#		LEU LYS MET PHE PRO SER THR TRP TYR VAl; do
	for i in HS_*.pdb; do
		cat $i | voronota get-balls-from-atoms-file --annotated | \
                grep -o -i "$AAA" | wc -l | awk '{print $1}' 
#    	       | voronota calculate-contacts --annotated \
#    	       | voronota query-contacts --inter-residue --match-first "R<$AAA>" --match-second "R<A,DC,DG,DT>" \
    done
else
	exit 1
fi

I will look forward to your reponses.

Sincerely,
Aurimas

Last edited by Scrutinizer; 06-06-2019 at 12:24 PM.. Reason: quote tags -> code tags
# 2  
Old 06-02-2019
Not sure I understand what you're after. Is it the count of distinct occurrences of the $AAA value combined with that r<...> "number associated with ALA" (the $AAA value)? Where comes the 88"-"into play? Where the 99"-"? What be the result of above sample code and data?
# 3  
Old 06-02-2019
So the output for the given script is 10. When the script is run it first asks to enter amino acid (in this case ALA) that is taken the value of AAA with which the script works with. AAA can be any of the value as given in the if statement like AAA == ALA, AAA == ARG and any of the written values in the if statement.

Let's take the quote:

Code:
c<B>r<56>a<2643>R<ALA>A<N> 5.654 16.636 -19.419 1.55 el=N oc=1;tf=27.14
c<B>r<56>a<2644>R<ALA>A<CA> 4.306 16.969 -19.795 1.7 el=C oc=1;tf=27.77
c<B>r<56>a<2645>R<ALA>A<C> 4.139 18.435 -20.144 1.7 el=C oc=1;tf=29.41
c<B>r<56>a<2646>R<ALA>A<O> 3.619 18.808 -21.204 1.52 el=O oc=1;tf=30.63
c<B>r<56>a<2647>R<ALA>A<CB> 3.373 16.628 -18.664 1.7 el=C oc=1;tf=28.99
c<B>r<88>a<2887>R<ALA>A<N> -3.023 7.753 -19.907 1.55 el=N oc=1;tf=20.84
c<B>r<88>a<2888>R<ALA>A<CA> -3.018 7.206 -18.575 1.7 el=C oc=1;tf=17.38
c<B>r<88>a<2889>R<ALA>A<C> -1.627 6.647 -18.364 1.7 el=C oc=1;tf=18.59
c<B>r<88>a<2890>R<ALA>A<O> -1.086 5.92 -19.197 1.52 el=O oc=1;tf=14.88
c<B>r<88>a<2891>R<ALA>A<CB> -4.015 6.09 -18.472 1.7 el=C oc=1;tf=18.6

This output ( the one given in quotes above) is made after the command
Code:
 voronota get-balls-from-atoms-file --annotated

and then through pipes are pased to
Code:
 grep -o -i "$AAA" | wc -l | awk '{print $1}'

which in turn gives an output as an integer 10 showing how many times ALA does occur in the dataset (look in quote above).

However I want once the output from
Code:
 voronota get-balls-from-atoms-file --annotated

is passed to pipes
Code:
 grep -o -i "$AAA" | wc -l | awk '{print $1}'

to return me a numerical value of 2.

Why 2? If you look at the quote above the first 5 lines have ALA ( given 5 times in the lines ...r<>a..R<ALA>...) in them but these ALA words are numbered exactly the same (as belonging to 56): (
Code:
c<B>r<56>a<2644>R<ALA>A<CA> 4.306 16.969 -19.795 1.7 el=C oc=1;tf=27.77
c<B>r<56>a<2644>R<ALA>A<CA> 4.306 16.969 -19.795 1.7 el=C oc=1;tf=27.77
c<B>r<56>a<2645>R<ALA>A<C> 4.139 18.435 -20.144 1.7 el=C oc=1;tf=29.41
c<B>r<56>a<2646>R<ALA>A<O> 3.619 18.808 -21.204 1.52 el=O oc=1;tf=30.63
c<B>r<56>a<2647>R<ALA>A<CB> 3.373 16.628 -18.664 1.7 el=C oc=1;tf=28.99

The same reasoning goes for ALA and number 88 (quote below) which also occurs 5 times with ALA. So total occurence of ALA (in the quote above) would be 10 but I want it to be 2 (as occurence of ALA is 1 for each number (56 & 88)). The ALA (or any of the chosen 20 AAA values) is constant in this case while numbers change and can take any integer from 1 to 1000.
Code:
c<B>r<88>a<2887>R<ALA>A<N> -3.023 7.753 -19.907 1.55 el=N oc=1;tf=20.84
c<B>r<88>a<2888>R<ALA>A<CA> -3.018 7.206 -18.575 1.7 el=C oc=1;tf=17.38
c<B>r<88>a<2889>R<ALA>A<C> -1.627 6.647 -18.364 1.7 el=C oc=1;tf=18.59
c<B>r<88>a<2890>R<ALA>A<O> -1.086 5.92 -19.197 1.52 el=O oc=1;tf=14.88
c<B>r<88>a<2891>R<ALA>A<CB> -4.015 6.09 -18.472 1.7 el=C oc=1;tf=18.6

Quote:
Is it the count of distinct occurrences of the $AAA value combined with that r<...> "number associated with ALA" (the $AAA value)?
I want the count of AAA to be printed to terminal as 2 (regarding the qoute above which has 56 and 88) and not 10 when the script is run. If the line r<56> is repeated 5 times (as in the quote above regarding 56) it also gives AAA 5 times (in this case ...R<ALA>...), but I want the count for all occurences of ALA at 56 to be 1. All occurences of specific AAA at 88 be given the value 1 as well and so on. Regarding the number 88, 99 or any other it is arbitrary and can change, hence your question about 88, 99 as these are just different numbers as 56.

Hope it is clearer now :-) If not let me know I'll try to rephrase everything.

Last edited by Scrutinizer; 06-06-2019 at 12:36 PM..
# 4  
Old 06-02-2019
OK, that's what I inferred. How about (the first line will work with a recent shell only)
Code:
if [[ "ALA ARG ASN ASP CYS GLN GLY GLU HIS ILE LEU LYS MET PHE PRO SER THR TRP TYR VAL" =~ $AAA ]]
  then  for i in HS_*.pdb
          do  cat $i | voronota get-balls-from-atoms-file --annotated | \
                  awk -F"[<>]" -vSRCH="$AAA" '$0 ~ SRCH && !OCC[$4]++ {CNT++ } END {print CNT+0}' $i 
          done
  else  echo exit 1
fi
2

# 5  
Old 06-02-2019
When I run your proposed code

Code:
#!/bin/bash
read -p "amino acid: " AAA
if [[ "ALA ARG ASN ASP CYS GLN GLY GLU HIS ILE LEU LYS MET PHE PRO SER THR TRP TYR VAL" =~ $AAA ]]
  then  for i in HS_*.pdb
          do  cat $i | voronota get-balls-from-atoms-file --annotated | \
                  awk -F"[<>]" -vSRCH="$AAA" '$0 ~ SRCH && !OCC[$4]++ {CNT++ } END {print CNT+0}' $i 
          done
  else  echo exit 1
fi

on MAC OS Mojave terminal I get this error:
Code:
awk: invalid -v option


Last edited by Scrutinizer; 06-06-2019 at 12:37 PM.. Reason: quote tags -> code tags
# 6  
Old 06-02-2019
The Mac OS man page (your friend, btw) for awk says it knows the -v option:
Quote:
-v var=value Assign values before prog is executed
Try a space?
# 7  
Old 06-02-2019
Thank you, it works. However, the output now is 1 for all the cases, which is incorrect as I need to calculate AAA instances for 28 complexes (1st complex should give 8 as there are 8 times ALA is repeated (40 if we include that ALA is calculated 5 times with specific number (56 as was in our case)) , 2nd - 8 as well, 3rd - 8, 4th - 9 and so on) and it varies from 2 instances of ALA (10 times repeated (2 x 5)) and can get any value from 2 to 37 when AAA is chosen as ALA. Should I use cat $i | and some specification of grep before giving it to awk?
The output I get now is:
Quote:
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
How is it possible to solve that so I get:
Quote:
8
8
8
9
18
15
9
14
19
7
14
11
8
11
18
11
10
19
34
5
2
12
7
16
7
4
29
3
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Unexpected result from awk

Hello, Giving those commands: cat > myfile 1 2 3 ^D cat myfile | awk '{ s=s+$1 ; print s}' The output is: 1 3 6 It seems like this command iterates each time on a different row so $1 is the first field of each row.. But what caused it to refer to each row ?. What I mean... (3 Replies)
Discussion started by: uniran
3 Replies

2. UNIX for Beginners Questions & Answers

Line duplication with awk?!

So while this seemed totally trivial it turned out to be much more difficult than I had thought. I have a file with 3 rows, and I "just" want to add each field n number of times. E.g. > cat file.txt 0.5 -0.1 0.6 for n=3 into: cat newfile.txt 0.5 0.5 0.5 -0.1 -0.1 -0.1 0.6 0.6 0.6 I... (4 Replies)
Discussion started by: Glorp
4 Replies

3. Linux

De-Duplication Problem

Hi all, I download and install lessfs for deduplication, I copy files in /SharedFiles directory and lessfs work right and not store again copy files, but, when i delete all files in /SharedFiles , not return free space to total space, files not show in /SharedFiles , but not copy new files in... (3 Replies)
Discussion started by: saeedha
3 Replies

4. Programming

Table Duplication in PHP

Hey, I am making a Facebook like Page system as my first project, So far it's been bate in mind I did it from my 3DS at the same time as my PC gets replaced, So far it's turned out great. Now I am on to creation the blocking system I need to get the code to say If the user already likes the... (0 Replies)
Discussion started by: AimyThomas
0 Replies

5. UNIX for Advanced & Expert Users

File Descriptor redirection and duplication

i have many questions concerning the FD. it was stated that "to redirect Error to output std, you have to write the following code" # ls -alt FileNotThere File > logfile 2>&1 # cat logfile ls: cannot access FileNotThere: No such file or directory -rw-r--r-- 1 root root 0 2010-02-26... (9 Replies)
Discussion started by: ahmad.zuhd
9 Replies

6. Shell Programming and Scripting

How to avoid duplication within 2 files?

Hi all, Actually 2 files are there - file1, file2. file1 contains ---> london mosco america russia mosco file2 contains --> europe india japan mosco england london Question is I want to print all the city names without duplication cities in those... (10 Replies)
Discussion started by: balan_mca
10 Replies

7. Shell Programming and Scripting

File Duplication Script?

I have a file, let's say 1.jpg, and I have a text file that contains a list of filenames I would like to duplicate 1.jpg as (i.e., 2.jpg, 3.jpg, 4.jpg, etc.). The filenames that I want to create are all on separate lines, one per line. I'm sure there's a simple solution, but I'm not claiming to... (7 Replies)
Discussion started by: futurestar
7 Replies

8. UNIX for Advanced & Expert Users

mount LVM duplication drives

Hi, I'm stuck in an awkward situation please help :) I have two identical Seagate 80GB harddrives. My objective is a bit strange. 1.I want to have a cloned disk as bootable backup 2.When booting using the master drive, I also want to mount the cloned backup disk so I can do incremental... (6 Replies)
Discussion started by: onthetopo
6 Replies

9. HP-UX

awk to output cmd result

I was wondering if it was possible to tell awk to print the output of a command in the print. .... | awk '{print $0}' I would like it to print the date right before $0, so something like (this doesn't work though) .... | awk '{print date $0}' (4 Replies)
Discussion started by: IMTheNachoMan
4 Replies

10. Windows & DOS: Issues & Discussions

File Duplication

hi all how to find the file duplication in a windows 2000 server as usual replies are sincerely appreciated. thanks raguram R (3 Replies)
Discussion started by: raguramtgr
3 Replies
Login or Register to Ask a Question