Duplication | awk | result


Login or Register to Reply

 
Thread Tools Search this Thread
# 8  
Show the input to awk, i.e. the output of voronota get-balls-from-atoms-file --annotated for a certain .pdb file. Does it have the structure you gave in your input data samples?
# 9  
The output of the output of voronota get-balls-from-atoms-file --annotated for a certain .pdb file or input to awk looks like this (a sample, not full dataset) by your given code script:

Code:
c<B>r<4>a<2302>R<ARG>A<N> 6.162 10.557 -25.517 1.55 el=N oc=1;tf=23.17
c<B>r<4>a<2303>R<ARG>A<CA> 6.64 9.248 -25.115 1.7 el=C oc=1;tf=22.23
c<B>r<4>a<2304>R<ARG>A<C> 8.15 9.068 -25.176 1.7 el=C oc=1;tf=24.84
c<B>r<4>a<2305>R<ARG>A<O> 8.73 8.531 -24.211 1.52 el=O oc=1;tf=28.35
c<B>r<4>a<2306>R<ARG>A<CB> 5.934 8.193 -25.972 1.7 el=C oc=1;tf=22.92
c<B>r<4>a<2307>R<ARG>A<CG> 6.32 6.769 -25.634 1.7 el=C oc=1;tf=24.7
c<B>r<4>a<2308>R<ARG>A<CD> 5.685 5.754 -26.618 1.7 el=C oc=1;tf=25.75
c<B>r<4>a<2309>R<ARG>A<NE> 6.077 4.394 -26.252 1.55 el=N oc=1;tf=26.94
c<B>r<4>a<2310>R<ARG>A<CZ> 5.167 3.427 -26.057 1.7 el=C oc=1;tf=26.16
c<B>r<4>a<2311>R<ARG>A<NH1> 3.872 3.686 -26.225 1.55 el=N oc=1;tf=27.03
c<B>r<4>a<2312>R<ARG>A<NH2> 5.545 2.228 -25.597 1.55 el=N oc=1;tf=25.32
c<B>r<5>a<2313>R<SER>A<N> 8.88 9.496 -26.216 1.55 el=N oc=1;tf=24.85
c<B>r<5>a<2314>R<SER>A<CA> 10.339 9.288 -26.237 1.7 el=C oc=1;tf=22.66
c<B>r<5>a<2315>R<SER>A<C> 11.054 10.197 -25.23 1.7 el=C oc=1;tf=20.28
c<B>r<5>a<2316>R<SER>A<O> 12.051 9.828 -24.609 1.52 el=O oc=1;tf=17.58
c<B>r<5>a<2317>R<SER>A<CB> 10.87 9.554 -27.656 1.7 el=C oc=1;tf=21.53
c<B>r<5>a<2318>R<SER>A<OG> 10.523 10.853 -28.157 1.52 el=O oc=1;tf=18.21
c<B>r<6>a<2319>R<ASP>A<N> 10.514 11.396 -25.053 1.55 el=N oc=1;tf=22.4
c<B>r<6>a<2320>R<ASP>A<CA> 11.025 12.347 -24.083 1.7 el=C oc=1;tf=25.02
c<B>r<6>a<2321>R<ASP>A<C> 10.878 11.83 -22.661 1.7 el=C oc=1;tf=27.57
c<B>r<6>a<2322>R<ASP>A<O> 11.874 11.789 -21.917 1.52 el=O oc=1;tf=27.98
c<B>r<6>a<2323>R<ASP>A<CB> 10.289 13.663 -24.272 1.7 el=C oc=1;tf=26.54
c<B>r<6>a<2324>R<ASP>A<CG> 10.869 14.511 -25.413 1.7 el=C oc=1;tf=28.35
c<B>r<6>a<2325>R<ASP>A<OD1> 11.913 14.194 -25.989 1.52 el=O oc=1;tf=30.42
c<B>r<6>a<2326>R<ASP>A<OD2> 10.264 15.523 -25.731 1.52 el=O oc=1;tf=29.8

but I want the output to awk through grep -o -i "$AAA" | wc -l | be a string that is later converted to integer, but if it is possible to avoid it - then that would be great.

I need to extract count of amino acids (ARG, SER, ASP in this case) but maybe it is possible from the script you shown before?

Hope this is what you asked Smilie

Last edited by Scrutinizer; 1 Week Ago at 12:38 PM.. Reason: quote tags -> code tags and icode tags
# 10  
Hmmm - I don't see any "ALA" in your recent sample data file - what be the desired result from it? I see one each of the 4 / ARG, 5 / SER, and 6 / ASP combinations...


Please be aware that everybody in here can only see (and work with) what you explicitly (!) write / post, don't assume ANY knowledge of the topic - genetics / biology assumed, in this case - that allows inference of non- given background info from allusions in the text. Describe the (data) problem as profoundly as possible, backed by representative, consistent, and as broad as possible input and desired output samples.
Don't "change horses" between posts - does your sample in post #9 lead to the "false" output in post #7? Or, where should the

Code:
8
8
8
9 ...

result come from?

Last edited by RudiC; 2 Weeks Ago at 06:27 AM..
# 11  
Quote:
Originally Posted by RudiC
Hmmm - I don't see any "ALA" in your recent sample data file - what be the desired result from it? I see one each of the 4 / ARG, 5 / SER, and 6 / ASP combinations...
As I told in the posts before the AAA value can be any of the 20 amino acids (ALA, ARG, ASN, ASP, CYS, GLN, GLY, GLU, HIS, ILE, LEU, LYS, MET, PHE, PRO, SER, THR, TRP, TYR or VAL). In the first example it was ALA which I posted, now I posted with ARG, SER and ASP, but basically the most important value is AAA which can get any value of the 20 amino acids and for that I need to calculate the count for that specific amino acid (it can be chosen as ALA, ARG, ASN or any other of the 20 and for that I need to calculate the count of that amino acid without duplication). To make it clearer from the recent example I need to get only 1 ARG from the same numbering as r(4) that is given 11 times. For SER and ARG it also has to be 1 each even though they are repeated r<5> 6 times and r<6> 8 times respectively. However in the data file these specific AAA occurences are repeated in the data set with different integer r<x> values where x is from 1 to 1000.

Last edited by Aurimas; 2 Weeks Ago at 07:26 AM..
# 12  
Understood. I'll try to explain my situation in as much details as possible. The script I have now is:
Code:
#!/bin/bash
read -p "amino acid: " AAA
if [[ "ALA ARG ASN ASP CYS GLN GLY GLU HIS ILE \
	   LEU LYS MET PHE PRO SER THR TRP TYR VAL" =~ $AAA ]]
  then  for i in HS_*.pdb
          do  cat $i | voronota get-balls-from-atoms-file --annotated | \
              grep -o -i "$AAA" | wc -l | awk '{print $1}'
          done
  else  echo exit 1
fi

It is run through terminal at MAC OS Mojave. When I write ./BSA (the name of the script) in terminal it asks me to enter the amino acid (that is a capitalised three letters code such as ALA, ARG, ASN, ASP, CYS, GLN, GLY, GLU, HIS, ILE, LEU, LYS, MET, PHE, PRO, SER, THR, TRP, TYR or VAL) as an input: amino acid: It takes the value of AAA in the script
Let's for this case choose to enter ALA that becomes $AAA in the script, so it would in terminal be like this: amino acid: ALA
Then I press enter and get the output to be:
Code:
40
40
40
45
90
75
45
70
95
35
70
55
40
55
90
55
50
185
170
25
10
60
35
76
35
20
145
15

The output from the script above is how many ALA values occur per one .pdb complex. In total there are 28 .pdb files/complexes. That's why we have 28 lines of the output. However that is not what I want for ALA values per complex. The output I expect should be something like this:
Code:
8
8
8
9
18
15
9
14
19
7
14
11
8
11
18
11
10
19
34
5
2
12
7
16
7
4
29

Here ALA values are calculated without duplication. To understand better how to achieve this let's look at the shortened output example of the first .pdb file (complex) using command voronota get-balls-from-atoms-file --annotated that includes 40 ALA values:
Code:
c<B>r<10>a<2351>R<ALA>A<N> 13.856 10.83 -20.161 1.55 el=N oc=1;tf=27.93
c<B>r<10>a<2352>R<ALA>A<CA> 13.893 11.449 -18.853 1.7 el=C oc=1;tf=27.45
c<B>r<10>a<2353>R<ALA>A<C> 13.899 10.389 -17.757 1.7 el=C oc=1;tf=29.99
c<B>r<10>a<2354>R<ALA>A<O> 14.653 10.538 -16.788 1.52 el=O oc=1;tf=30.44
c<B>r<10>a<2355>R<ALA>A<CB> 12.686 12.323 -18.679 1.7 el=C oc=1;tf=26.9
c<B>r<26>a<2423>R<ALA>A<N> 11.645 18.555 7.864 1.55 el=N oc=1;tf=32.06
c<B>r<26>a<2424>R<ALA>A<CA> 11.938 19.955 7.579 1.7 el=C oc=1;tf=35.4
c<B>r<26>a<2425>R<ALA>A<C> 13.08 20.496 8.431 1.7 el=C oc=1;tf=37.27
c<B>r<26>a<2426>R<ALA>A<O> 13.742 21.478 8.087 1.52 el=O oc=1;tf=39.36
c<B>r<26>a<2427>R<ALA>A<CB> 10.716 20.815 7.844 1.7 el=C oc=1;tf=34.56
C<B>r<56>a<2643>R<ALA>A<N> 5.654 16.636 -19.419 1.55 el=N oc=1;tf=27.14
c<B>r<56>a<2644>R<ALA>A<CA> 4.306 16.969 -19.795 1.7 el=C oc=1;tf=27.77
c<B>r<56>a<2645>R<ALA>A<C> 4.139 18.435 -20.144 1.7 el=C oc=1;tf=29.41
c<B>r<56>a<2646>R<ALA>A<O> 3.619 18.808 -21.204 1.52 el=O oc=1;tf=30.63
c<B>r<56>a<2647>R<ALA>A<CB> 3.373 16.628 -18.664 1.7 el=C oc=1;tf=28.99
c<B>r<88>a<2887>R<ALA>A<N> -3.023 7.753 -19.907 1.55 el=N oc=1;tf=20.84
c<B>r<88>a<2888>R<ALA>A<CA> -3.018 7.206 -18.575 1.7 el=C oc=1;tf=17.38
c<B>r<88>a<2889>R<ALA>A<C> -1.627 6.647 -18.364 1.7 el=C oc=1;tf=18.59
c<B>r<88>a<2890>R<ALA>A<O> -1.086 5.92 -19.197 1.52 el=O oc=1;tf=14.88
c<B>r<88>a<2891>R<ALA>A<CB> -4.015 6.09 -18.472 1.7 el=C oc=1;tf=18.6
c<B>r<130>a<3187>R<ALA>A<N> -4.398 5.962 -24.62 1.55 el=N oc=1;tf=22.4
c<B>r<130>a<3188>R<ALA>A<CA> -3.225 5.141 -24.341 1.7 el=C oc=1;tf=20.7
c<B>r<130>a<3189>R<ALA>A<C> -3.17 4.921 -22.854 1.7 el=C oc=1;tf=19.83
c<B>r<130>a<3190>R<ALA>A<O> -3.725 5.716 -22.066 1.52 el=O oc=1;tf=17.31
c<B>r<130>a<3191>R<ALA>A<CB> -1.913 5.797 -24.7 1.7 el=C oc=1;tf=22.82
c<B>r<177>a<3516>R<ALA>A<N> 0.656 -7.277 -20.93 1.55 el=N oc=1;tf=19.87
c<B>r<177>a<3517>R<ALA>A<CA> -0.367 -8.059 -20.25 1.7 el=C oc=1;tf=19.38
c<B>r<177>a<3518>R<ALA>A<C> -0.263 -9.541 -20.59 1.7 el=C oc=1;tf=20.35
c<B>r<177>a<3519>R<ALA>A<O> 0.029 -9.962 -21.72 1.52 el=O oc=1;tf=19.92
c<B>r<177>a<3520>R<ALA>A<CB> -1.747 -7.592 -20.659 1.7 el=C oc=1;tf=15.99
c<B>r<181>a<3541>R<ALA>A<N> -4.381 -14.273 -14.076 1.55 el=N oc=1;tf=16.9
c<B>r<181>a<3542>R<ALA>A<CA> -4.649 -13.158 -13.194 1.7 el=C oc=1;tf=16.14
c<B>r<181>a<3543>R<ALA>A<C> -3.446 -12.893 -12.306 1.7 el=C oc=1;tf=18.15
c<B>r<181>a<3544>R<ALA>A<O> -2.692 -13.819 -12.014 1.52 el=O oc=1;tf=20.6
c<B>r<181>a<3545>R<ALA>A<CB> -5.817 -13.463 -12.335 1.7 el=C oc=1;tf=15.23
c<B>r<194>a<3626>R<ALA>A<N> 8.308 -12.434 -17.665 1.55 el=N oc=1;tf=29.11
c<B>r<194>a<3627>R<ALA>A<CA> 9.387 -12.364 -18.631 1.7 el=C oc=1;tf=28.89
c<B>r<194>a<3628>R<ALA>A<C> 10.604 -11.653 -18.089 1.7 el=C oc=1;tf=31.02
c<B>r<194>a<3629>R<ALA>A<O> 10.592 -11.177 -16.949 1.52 el=O oc=1;tf=31.88
c<B>r<194>a<3630>R<ALA>A<CB> 8.92 -11.616 -19.844 1.7 el=C oc=1;tf=25.66

As you can see from the voronota output example (in quotes) there are 40 lines with ALA name in it. Thus the output I am getting now from the 1st script as shown above is 40. However the problem with this is that there are only 8 specific ALA values. What I mean by that is that there are 5 times of ALA value repeated and this repetition is shown as r<10> 5 times, same goes for ALA at r<26>, ALA for r<56> and so on (look quote) but I want that those 5 times of r<10> for ALA, r<26> for ALA and so on would be counted as 1 ALA: 1 ALA for 5 times of r<10>, 1 ALA for 5 times of r<26>, etc. and then all those ALA be added together to give 8 ALA values for the first .pdb file instead of 40. Also please note that 1 specific ALA value here comes from 5 times of r<x> where x is a number from 1 to 1000. However it might be that 1 ALA value can come from 2, 3, 4, 6 ,7 , 8 or more times of r<x> that are associated with ALA in the line. Above it is 5 ALA per 5 lines with r<x>, but it can be 8 ALA per 8 r<x> lines or other integer values. However I need to get 1 ALA per 5 times of r<x>, 8 times of r<x> or 1 ALA per less or more of r<x>

The script was then changed to:
Code:
#!/bin/bash
read -p "amino acid: " AAA
if [[ "ALA ARG ASN ASP CYS GLN GLY GLU HIS ILE \
	   LEU LYS MET PHE PRO SER THR TRP TYR VAL" =~ $AAA ]]
  then  for i in HS_*.pdb
          do  cat $i | voronota get-balls-from-atoms-file --annotated | \
                 awk -F"[<>]" -v SRCH="$AAA" '$0 ~ SRCH && !OCC[$4]++ {CNT++ } END {print CNT+0}' $i 
          done
  else  echo exit 1
fi

However the output I get now is this:
Code:
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

The problem here is that all specific .pdb files analysed count all occurrences of ALA as 1 per complex (HS_some_complex.pdb) which means 1 ALA for all 40 times of r<10,26,56,88,etc> in the first .pdb file and so on for the other 27 .pdb complexes. That's not what I need. I want ALA to be calculated as occurring 8 times for the first complex as explained above and not 40 which I am getting from my first script. Thus the question is how is that possible? Should I change grep command or awk or both?

I hope now it is clearer but do let me know if you are still not understanding something Smilie

Last edited by Scrutinizer; 1 Week Ago at 12:40 PM.. Reason: quote tags -> code tags
# 13  
Hmmm, I think I found a logical error in my proposal: adding the $i after the awk script made it immediately read the respective .pdb file, not voronota's output from that file. Remove the $i:


Code:
               cat $i | voronota get-balls-from-atoms-file --annotated | \
                 awk -F"[<>]" -v SRCH="$AAA" '$0 ~ SRCH && !OCC[$4]++ {CNT++ } END {print CNT+0}'   $i  

and report back.

Still, I'm convinced there will be an apter / better solution to the overall problem dealing with ALL .pdb files, and ALL amino acids in one go if needed...

And, please use CODE, not ICODE, tags for data as well. You may want to edit your former post.
# 14  
Thank you it works and I edited my previous post. I also have a similar question for this code then:
Code:
#!/bin/bash
read -p "amino acid: " AAA
if [[ "ALA ARG ASN ASP CYS GLN GLY GLU HIS ILE \
	   LEU LYS MET PHE PRO SER THR TRP TYR VAL" =~ $AAA ]]
then 
	for i in qac_"$AAA"_HS_*.pdb.txt; 
		do
			cat $i | grep -o -i $AAA | wc -l | awk '{print $1}'
    	       done
else
	exit 1
fi

.pdb.txt file looks like this:
Code:
c<A>r<134>R<ALA> c<C>r<7>R<DC> 12.9516 3.80289 . .
c<A>r<134>R<ALA> c<C>r<8>R<DG> 5.92777 4.58004 . .
c<A>r<138>R<ALA> c<C>r<7>R<DC> 2.65391 4.55194 . .
c<A>r<248>R<ALA> c<C>r<10>R<DG> 9.10674 3.59363 . .
c<A>r<248>R<ALA> c<C>r<11>R<DT> 0.0228499 5.34781 . .
c<A>r<248>R<ALA> c<W>r<4>R<DC> 21.2356 2.61229 . .
c<A>r<260>R<ALA> c<C>r<5>R<DC> 6.66863 5.26436 . .

You have 7 lines of r<x> where x is a number from 1 to 100 and 7 occurrences of ALA. However, how I could change grep to awk in the code so it would count 4 ALA instead of 7?

Last edited by Scrutinizer; 1 Week Ago at 12:23 PM.. Reason: quote tags -> code tags
Login or Register to Reply

|
Thread Tools Search this Thread
Search this Thread:
Advanced Search

More UNIX and Linux Forum Topics You Might Find Helpful
Line duplication with awk?!
Glorp
So while this seemed totally trivial it turned out to be much more difficult than I had thought. I have a file with 3 rows, and I "just" want to add each field n number of times. E.g. > cat file.txt 0.5 -0.1 0.6 for n=3 into: cat newfile.txt 0.5 0.5 0.5 -0.1 -0.1 -0.1 0.6 0.6 0.6 I...... UNIX for Beginners Questions & Answers
4
UNIX for Beginners Questions & Answers
De-Duplication Problem
saeedha
Hi all, I download and install lessfs for deduplication, I copy files in /SharedFiles directory and lessfs work right and not store again copy files, but, when i delete all files in /SharedFiles , not return free space to total space, files not show in /SharedFiles , but not copy new files in...... Linux
3
Linux
How to avoid duplication within 2 files?
balan_mca
Hi all, Actually 2 files are there - file1, file2. file1 contains ---> london mosco america russia mosco file2 contains --> europe india japan mosco england london Question is I want to print all the city names without duplication cities in those...... Shell Programming and Scripting
10
Shell Programming and Scripting
File Duplication Script?
futurestar
I have a file, let's say 1.jpg, and I have a text file that contains a list of filenames I would like to duplicate 1.jpg as (i.e., 2.jpg, 3.jpg, 4.jpg, etc.). The filenames that I want to create are all on separate lines, one per line. I'm sure there's a simple solution, but I'm not claiming to...... Shell Programming and Scripting
7
Shell Programming and Scripting
File Duplication
raguramtgr
hi all how to find the file duplication in a windows 2000 server as usual replies are sincerely appreciated. thanks raguram R... Windows & DOS: Issues & Discussions
3
Windows & DOS: Issues & Discussions

Featured Tech Videos