Duplication | awk | result

06-02-2019

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Show the input to awk, i.e. the output of voronota get-balls-from-atoms-file --annotated for a certain .pdb file. Does it have the structure you gave in your input data samples?

RudiC

View Public Profile for RudiC

Find all posts by RudiC

06-02-2019

Registered User

27, 2

Join Date: May 2019

Last Activity: 8 July 2019, 6:22 AM EDT

Posts: 27

Thanks Given: 5

Thanked 2 Times in 1 Post

The output of the output of voronota get-balls-from-atoms-file --annotated for a certain .pdb file or input to awk looks like this (a sample, not full dataset) by your given code script:

Code:

c<B>r<4>a<2302>R<ARG>A<N> 6.162 10.557 -25.517 1.55 el=N oc=1;tf=23.17
c<B>r<4>a<2303>R<ARG>A<CA> 6.64 9.248 -25.115 1.7 el=C oc=1;tf=22.23
c<B>r<4>a<2304>R<ARG>A<C> 8.15 9.068 -25.176 1.7 el=C oc=1;tf=24.84
c<B>r<4>a<2305>R<ARG>A<O> 8.73 8.531 -24.211 1.52 el=O oc=1;tf=28.35
c<B>r<4>a<2306>R<ARG>A<CB> 5.934 8.193 -25.972 1.7 el=C oc=1;tf=22.92
c<B>r<4>a<2307>R<ARG>A<CG> 6.32 6.769 -25.634 1.7 el=C oc=1;tf=24.7
c<B>r<4>a<2308>R<ARG>A<CD> 5.685 5.754 -26.618 1.7 el=C oc=1;tf=25.75
c<B>r<4>a<2309>R<ARG>A<NE> 6.077 4.394 -26.252 1.55 el=N oc=1;tf=26.94
c<B>r<4>a<2310>R<ARG>A<CZ> 5.167 3.427 -26.057 1.7 el=C oc=1;tf=26.16
c<B>r<4>a<2311>R<ARG>A<NH1> 3.872 3.686 -26.225 1.55 el=N oc=1;tf=27.03
c<B>r<4>a<2312>R<ARG>A<NH2> 5.545 2.228 -25.597 1.55 el=N oc=1;tf=25.32
c<B>r<5>a<2313>R<SER>A<N> 8.88 9.496 -26.216 1.55 el=N oc=1;tf=24.85
c<B>r<5>a<2314>R<SER>A<CA> 10.339 9.288 -26.237 1.7 el=C oc=1;tf=22.66
c<B>r<5>a<2315>R<SER>A<C> 11.054 10.197 -25.23 1.7 el=C oc=1;tf=20.28
c<B>r<5>a<2316>R<SER>A<O> 12.051 9.828 -24.609 1.52 el=O oc=1;tf=17.58
c<B>r<5>a<2317>R<SER>A<CB> 10.87 9.554 -27.656 1.7 el=C oc=1;tf=21.53
c<B>r<5>a<2318>R<SER>A<OG> 10.523 10.853 -28.157 1.52 el=O oc=1;tf=18.21
c<B>r<6>a<2319>R<ASP>A<N> 10.514 11.396 -25.053 1.55 el=N oc=1;tf=22.4
c<B>r<6>a<2320>R<ASP>A<CA> 11.025 12.347 -24.083 1.7 el=C oc=1;tf=25.02
c<B>r<6>a<2321>R<ASP>A<C> 10.878 11.83 -22.661 1.7 el=C oc=1;tf=27.57
c<B>r<6>a<2322>R<ASP>A<O> 11.874 11.789 -21.917 1.52 el=O oc=1;tf=27.98
c<B>r<6>a<2323>R<ASP>A<CB> 10.289 13.663 -24.272 1.7 el=C oc=1;tf=26.54
c<B>r<6>a<2324>R<ASP>A<CG> 10.869 14.511 -25.413 1.7 el=C oc=1;tf=28.35
c<B>r<6>a<2325>R<ASP>A<OD1> 11.913 14.194 -25.989 1.52 el=O oc=1;tf=30.42
c<B>r<6>a<2326>R<ASP>A<OD2> 10.264 15.523 -25.731 1.52 el=O oc=1;tf=29.8

but I want the output to awk through grep -o -i "$AAA" | wc -l | be a string that is later converted to integer, but if it is possible to avoid it - then that would be great.

I need to extract count of amino acids (ARG, SER, ASP in this case) but maybe it is possible from the script you shown before?

Hope this is what you asked

Last edited by Scrutinizer; 06-06-2019 at 12:38 PM.. Reason: quote tags -> code tags and icode tags

Aurimas

View Public Profile for Aurimas

Find all posts by Aurimas

06-03-2019

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Hmmm - I don't see any "ALA" in your recent sample data file - what be the desired result from it? I see one each of the 4 / ARG, 5 / SER, and 6 / ASP combinations...

Please be aware that everybody in here can only see (and work with) what you explicitly (!) write / post, don't assume ANY knowledge of the topic - genetics / biology assumed, in this case - that allows inference of non- given background info from allusions in the text. Describe the (data) problem as profoundly as possible, backed by representative, consistent, and as broad as possible input and desired output samples.
Don't "change horses" between posts - does your sample in post #9 lead to the "false" output in post #7? Or, where should the

Code:

8
8
8
9 ...

result come from?

Last edited by RudiC; 06-03-2019 at 06:27 AM..

RudiC

View Public Profile for RudiC

Find all posts by RudiC

06-03-2019

Registered User

27, 2

Join Date: May 2019

Last Activity: 8 July 2019, 6:22 AM EDT

Posts: 27

Thanks Given: 5

Thanked 2 Times in 1 Post

Quote:

Originally Posted by RudiC

Hmmm - I don't see any "ALA" in your recent sample data file - what be the desired result from it? I see one each of the 4 / ARG, 5 / SER, and 6 / ASP combinations...

As I told in the posts before the AAA value can be any of the 20 amino acids (ALA, ARG, ASN, ASP, CYS, GLN, GLY, GLU, HIS, ILE, LEU, LYS, MET, PHE, PRO, SER, THR, TRP, TYR or VAL). In the first example it was ALA which I posted, now I posted with ARG, SER and ASP, but basically the most important value is AAA which can get any value of the 20 amino acids and for that I need to calculate the count for that specific amino acid (it can be chosen as ALA, ARG, ASN or any other of the 20 and for that I need to calculate the count of that amino acid without duplication). To make it clearer from the recent example I need to get only 1 ARG from the same numbering as r(4) that is given 11 times. For SER and ARG it also has to be 1 each even though they are repeated r<5> 6 times and r<6> 8 times respectively. However in the data file these specific AAA occurences are repeated in the data set with different integer r<x> values where x is from 1 to 1000.

Last edited by Aurimas; 06-03-2019 at 07:26 AM..

Aurimas

View Public Profile for Aurimas

Find all posts by Aurimas

06-03-2019

Registered User

27, 2

Join Date: May 2019

Last Activity: 8 July 2019, 6:22 AM EDT

Posts: 27

Thanks Given: 5

Thanked 2 Times in 1 Post

Understood. I'll try to explain my situation in as much details as possible. The script I have now is:

Code:

#!/bin/bash
read -p "amino acid: " AAA
if [[ "ALA ARG ASN ASP CYS GLN GLY GLU HIS ILE \
	   LEU LYS MET PHE PRO SER THR TRP TYR VAL" =~ $AAA ]]
  then  for i in HS_*.pdb
          do  cat $i | voronota get-balls-from-atoms-file --annotated | \
              grep -o -i "$AAA" | wc -l | awk '{print $1}'
          done
  else  echo exit 1
fi

It is run through terminal at MAC OS Mojave. When I write ./BSA (the name of the script) in terminal it asks me to enter the amino acid (that is a capitalised three letters code such as ALA, ARG, ASN, ASP, CYS, GLN, GLY, GLU, HIS, ILE, LEU, LYS, MET, PHE, PRO, SER, THR, TRP, TYR or VAL) as an input: amino acid: It takes the value of AAA in the script
Let's for this case choose to enter ALA that becomes $AAA in the script, so it would in terminal be like this: amino acid: ALA
Then I press enter and get the output to be:

Code:

The output from the script above is how many ALA values occur per one .pdb complex. In total there are 28 .pdb files/complexes. That's why we have 28 lines of the output. However that is not what I want for ALA values per complex. The output I expect should be something like this:

Code:

Here ALA values are calculated without duplication. To understand better how to achieve this let's look at the shortened output example of the first .pdb file (complex) using command voronota get-balls-from-atoms-file --annotated that includes 40 ALA values:

Code:

c<B>r<10>a<2351>R<ALA>A<N> 13.856 10.83 -20.161 1.55 el=N oc=1;tf=27.93
c<B>r<10>a<2352>R<ALA>A<CA> 13.893 11.449 -18.853 1.7 el=C oc=1;tf=27.45
c<B>r<10>a<2353>R<ALA>A<C> 13.899 10.389 -17.757 1.7 el=C oc=1;tf=29.99
c<B>r<10>a<2354>R<ALA>A<O> 14.653 10.538 -16.788 1.52 el=O oc=1;tf=30.44
c<B>r<10>a<2355>R<ALA>A<CB> 12.686 12.323 -18.679 1.7 el=C oc=1;tf=26.9
c<B>r<26>a<2423>R<ALA>A<N> 11.645 18.555 7.864 1.55 el=N oc=1;tf=32.06
c<B>r<26>a<2424>R<ALA>A<CA> 11.938 19.955 7.579 1.7 el=C oc=1;tf=35.4
c<B>r<26>a<2425>R<ALA>A<C> 13.08 20.496 8.431 1.7 el=C oc=1;tf=37.27
c<B>r<26>a<2426>R<ALA>A<O> 13.742 21.478 8.087 1.52 el=O oc=1;tf=39.36
c<B>r<26>a<2427>R<ALA>A<CB> 10.716 20.815 7.844 1.7 el=C oc=1;tf=34.56
C<B>r<56>a<2643>R<ALA>A<N> 5.654 16.636 -19.419 1.55 el=N oc=1;tf=27.14
c<B>r<56>a<2644>R<ALA>A<CA> 4.306 16.969 -19.795 1.7 el=C oc=1;tf=27.77
c<B>r<56>a<2645>R<ALA>A<C> 4.139 18.435 -20.144 1.7 el=C oc=1;tf=29.41
c<B>r<56>a<2646>R<ALA>A<O> 3.619 18.808 -21.204 1.52 el=O oc=1;tf=30.63
c<B>r<56>a<2647>R<ALA>A<CB> 3.373 16.628 -18.664 1.7 el=C oc=1;tf=28.99
c<B>r<88>a<2887>R<ALA>A<N> -3.023 7.753 -19.907 1.55 el=N oc=1;tf=20.84
c<B>r<88>a<2888>R<ALA>A<CA> -3.018 7.206 -18.575 1.7 el=C oc=1;tf=17.38
c<B>r<88>a<2889>R<ALA>A<C> -1.627 6.647 -18.364 1.7 el=C oc=1;tf=18.59
c<B>r<88>a<2890>R<ALA>A<O> -1.086 5.92 -19.197 1.52 el=O oc=1;tf=14.88
c<B>r<88>a<2891>R<ALA>A<CB> -4.015 6.09 -18.472 1.7 el=C oc=1;tf=18.6
c<B>r<130>a<3187>R<ALA>A<N> -4.398 5.962 -24.62 1.55 el=N oc=1;tf=22.4
c<B>r<130>a<3188>R<ALA>A<CA> -3.225 5.141 -24.341 1.7 el=C oc=1;tf=20.7
c<B>r<130>a<3189>R<ALA>A<C> -3.17 4.921 -22.854 1.7 el=C oc=1;tf=19.83
c<B>r<130>a<3190>R<ALA>A<O> -3.725 5.716 -22.066 1.52 el=O oc=1;tf=17.31
c<B>r<130>a<3191>R<ALA>A<CB> -1.913 5.797 -24.7 1.7 el=C oc=1;tf=22.82
c<B>r<177>a<3516>R<ALA>A<N> 0.656 -7.277 -20.93 1.55 el=N oc=1;tf=19.87
c<B>r<177>a<3517>R<ALA>A<CA> -0.367 -8.059 -20.25 1.7 el=C oc=1;tf=19.38
c<B>r<177>a<3518>R<ALA>A<C> -0.263 -9.541 -20.59 1.7 el=C oc=1;tf=20.35
c<B>r<177>a<3519>R<ALA>A<O> 0.029 -9.962 -21.72 1.52 el=O oc=1;tf=19.92
c<B>r<177>a<3520>R<ALA>A<CB> -1.747 -7.592 -20.659 1.7 el=C oc=1;tf=15.99
c<B>r<181>a<3541>R<ALA>A<N> -4.381 -14.273 -14.076 1.55 el=N oc=1;tf=16.9
c<B>r<181>a<3542>R<ALA>A<CA> -4.649 -13.158 -13.194 1.7 el=C oc=1;tf=16.14
c<B>r<181>a<3543>R<ALA>A<C> -3.446 -12.893 -12.306 1.7 el=C oc=1;tf=18.15
c<B>r<181>a<3544>R<ALA>A<O> -2.692 -13.819 -12.014 1.52 el=O oc=1;tf=20.6
c<B>r<181>a<3545>R<ALA>A<CB> -5.817 -13.463 -12.335 1.7 el=C oc=1;tf=15.23
c<B>r<194>a<3626>R<ALA>A<N> 8.308 -12.434 -17.665 1.55 el=N oc=1;tf=29.11
c<B>r<194>a<3627>R<ALA>A<CA> 9.387 -12.364 -18.631 1.7 el=C oc=1;tf=28.89
c<B>r<194>a<3628>R<ALA>A<C> 10.604 -11.653 -18.089 1.7 el=C oc=1;tf=31.02
c<B>r<194>a<3629>R<ALA>A<O> 10.592 -11.177 -16.949 1.52 el=O oc=1;tf=31.88
c<B>r<194>a<3630>R<ALA>A<CB> 8.92 -11.616 -19.844 1.7 el=C oc=1;tf=25.66

As you can see from the voronota output example (in quotes) there are 40 lines with ALA name in it. Thus the output I am getting now from the 1st script as shown above is 40. However the problem with this is that there are only 8 specific ALA values. What I mean by that is that there are 5 times of ALA value repeated and this repetition is shown as r<10> 5 times, same goes for ALA at r<26>, ALA for r<56> and so on (look quote) but I want that those 5 times of r<10> for ALA, r<26> for ALA and so on would be counted as 1 ALA: 1 ALA for 5 times of r<10>, 1 ALA for 5 times of r<26>, etc. and then all those ALA be added together to give 8 ALA values for the first .pdb file instead of 40. Also please note that 1 specific ALA value here comes from 5 times of r<x> where x is a number from 1 to 1000. However it might be that 1 ALA value can come from 2, 3, 4, 6 ,7 , 8 or more times of r<x> that are associated with ALA in the line. Above it is 5 ALA per 5 lines with r<x>, but it can be 8 ALA per 8 r<x> lines or other integer values. However I need to get 1 ALA per 5 times of r<x>, 8 times of r<x> or 1 ALA per less or more of r<x>

The script was then changed to:

Code:

#!/bin/bash
read -p "amino acid: " AAA
if [[ "ALA ARG ASN ASP CYS GLN GLY GLU HIS ILE \
	   LEU LYS MET PHE PRO SER THR TRP TYR VAL" =~ $AAA ]]
  then  for i in HS_*.pdb
          do  cat $i | voronota get-balls-from-atoms-file --annotated | \
                 awk -F"[<>]" -v SRCH="$AAA" '$0 ~ SRCH && !OCC[$4]++ {CNT++ } END {print CNT+0}' $i 
          done
  else  echo exit 1
fi

However the output I get now is this:

Code:

The problem here is that all specific .pdb files analysed count all occurrences of ALA as 1 per complex (HS_some_complex.pdb) which means 1 ALA for all 40 times of r<10,26,56,88,etc> in the first .pdb file and so on for the other 27 .pdb complexes. That's not what I need. I want ALA to be calculated as occurring 8 times for the first complex as explained above and not 40 which I am getting from my first script. Thus the question is how is that possible? Should I change grep command or awk or both?

I hope now it is clearer but do let me know if you are still not understanding something

Last edited by Scrutinizer; 06-06-2019 at 12:40 PM.. Reason: quote tags -> code tags

Aurimas

View Public Profile for Aurimas

Find all posts by Aurimas

06-03-2019

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Hmmm, I think I found a logical error in my proposal: adding the $i after the awk script made it immediately read the respective .pdb file, not voronota's output from that file. Remove the $i:

Code:

               cat $i | voronota get-balls-from-atoms-file --annotated | \
                 awk -F"[<>]" -v SRCH="$AAA" '$0 ~ SRCH && !OCC[$4]++ {CNT++ } END {print CNT+0}'   $i

and report back.

Still, I'm convinced there will be an apter / better solution to the overall problem dealing with ALL .pdb files, and ALL amino acids in one go if needed...

And, please use CODE, not ICODE, tags for data as well. You may want to edit your former post.

RudiC

View Public Profile for RudiC

Find all posts by RudiC

06-03-2019

Registered User

27, 2

Join Date: May 2019

Last Activity: 8 July 2019, 6:22 AM EDT

Posts: 27

Thanks Given: 5

Thanked 2 Times in 1 Post

Thank you it works and I edited my previous post. I also have a similar question for this code then:

Code:

#!/bin/bash
read -p "amino acid: " AAA
if [[ "ALA ARG ASN ASP CYS GLN GLY GLU HIS ILE \
	   LEU LYS MET PHE PRO SER THR TRP TYR VAL" =~ $AAA ]]
then 
	for i in qac_"$AAA"_HS_*.pdb.txt; 
		do
			cat $i | grep -o -i $AAA | wc -l | awk '{print $1}'
    	       done
else
	exit 1
fi

.pdb.txt file looks like this:

Code:

c<A>r<134>R<ALA> c<C>r<7>R<DC> 12.9516 3.80289 . .
c<A>r<134>R<ALA> c<C>r<8>R<DG> 5.92777 4.58004 . .
c<A>r<138>R<ALA> c<C>r<7>R<DC> 2.65391 4.55194 . .
c<A>r<248>R<ALA> c<C>r<10>R<DG> 9.10674 3.59363 . .
c<A>r<248>R<ALA> c<C>r<11>R<DT> 0.0228499 5.34781 . .
c<A>r<248>R<ALA> c<W>r<4>R<DC> 21.2356 2.61229 . .
c<A>r<260>R<ALA> c<C>r<5>R<DC> 6.66863 5.26436 . .

You have 7 lines of r<x> where x is a number from 1 to 100 and 7 occurrences of ALA. However, how I could change grep to awk in the code so it would count 4 ALA instead of 7?

Last edited by Scrutinizer; 06-06-2019 at 12:23 PM.. Reason: quote tags -> code tags

Aurimas

View Public Profile for Aurimas

Find all posts by Aurimas

UNIX for Beginners Questions & Answers

Duplication | awk | result

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Unexpected result from awk

Discussion started by: uniran

2. UNIX for Beginners Questions & Answers

Line duplication with awk?!

Discussion started by: Glorp

3. Linux

De-Duplication Problem

Discussion started by: saeedha

4. Programming

Table Duplication in PHP

Discussion started by: AimyThomas

5. UNIX for Advanced & Expert Users

File Descriptor redirection and duplication

Discussion started by: ahmad.zuhd

6. Shell Programming and Scripting

How to avoid duplication within 2 files?

Discussion started by: balan_mca

7. Shell Programming and Scripting

File Duplication Script?

Discussion started by: futurestar

8. UNIX for Advanced & Expert Users

mount LVM duplication drives

Discussion started by: onthetopo

9. HP-UX

awk to output cmd result

Discussion started by: IMTheNachoMan

10. Windows & DOS: Issues & Discussions

File Duplication

Discussion started by: raguramtgr