Duplication | awk | result


 
Thread Tools Search this Thread
Top Forums UNIX for Beginners Questions & Answers Duplication | awk | result
# 8  
Old 06-02-2019
Show the input to awk, i.e. the output of voronota get-balls-from-atoms-file --annotated for a certain .pdb file. Does it have the structure you gave in your input data samples?
# 9  
Old 06-02-2019
The output of the output of voronota get-balls-from-atoms-file --annotated for a certain .pdb file or input to awk looks like this (a sample, not full dataset) by your given code script:

Code:
c<B>r<4>a<2302>R<ARG>A<N> 6.162 10.557 -25.517 1.55 el=N oc=1;tf=23.17
c<B>r<4>a<2303>R<ARG>A<CA> 6.64 9.248 -25.115 1.7 el=C oc=1;tf=22.23
c<B>r<4>a<2304>R<ARG>A<C> 8.15 9.068 -25.176 1.7 el=C oc=1;tf=24.84
c<B>r<4>a<2305>R<ARG>A<O> 8.73 8.531 -24.211 1.52 el=O oc=1;tf=28.35
c<B>r<4>a<2306>R<ARG>A<CB> 5.934 8.193 -25.972 1.7 el=C oc=1;tf=22.92
c<B>r<4>a<2307>R<ARG>A<CG> 6.32 6.769 -25.634 1.7 el=C oc=1;tf=24.7
c<B>r<4>a<2308>R<ARG>A<CD> 5.685 5.754 -26.618 1.7 el=C oc=1;tf=25.75
c<B>r<4>a<2309>R<ARG>A<NE> 6.077 4.394 -26.252 1.55 el=N oc=1;tf=26.94
c<B>r<4>a<2310>R<ARG>A<CZ> 5.167 3.427 -26.057 1.7 el=C oc=1;tf=26.16
c<B>r<4>a<2311>R<ARG>A<NH1> 3.872 3.686 -26.225 1.55 el=N oc=1;tf=27.03
c<B>r<4>a<2312>R<ARG>A<NH2> 5.545 2.228 -25.597 1.55 el=N oc=1;tf=25.32
c<B>r<5>a<2313>R<SER>A<N> 8.88 9.496 -26.216 1.55 el=N oc=1;tf=24.85
c<B>r<5>a<2314>R<SER>A<CA> 10.339 9.288 -26.237 1.7 el=C oc=1;tf=22.66
c<B>r<5>a<2315>R<SER>A<C> 11.054 10.197 -25.23 1.7 el=C oc=1;tf=20.28
c<B>r<5>a<2316>R<SER>A<O> 12.051 9.828 -24.609 1.52 el=O oc=1;tf=17.58
c<B>r<5>a<2317>R<SER>A<CB> 10.87 9.554 -27.656 1.7 el=C oc=1;tf=21.53
c<B>r<5>a<2318>R<SER>A<OG> 10.523 10.853 -28.157 1.52 el=O oc=1;tf=18.21
c<B>r<6>a<2319>R<ASP>A<N> 10.514 11.396 -25.053 1.55 el=N oc=1;tf=22.4
c<B>r<6>a<2320>R<ASP>A<CA> 11.025 12.347 -24.083 1.7 el=C oc=1;tf=25.02
c<B>r<6>a<2321>R<ASP>A<C> 10.878 11.83 -22.661 1.7 el=C oc=1;tf=27.57
c<B>r<6>a<2322>R<ASP>A<O> 11.874 11.789 -21.917 1.52 el=O oc=1;tf=27.98
c<B>r<6>a<2323>R<ASP>A<CB> 10.289 13.663 -24.272 1.7 el=C oc=1;tf=26.54
c<B>r<6>a<2324>R<ASP>A<CG> 10.869 14.511 -25.413 1.7 el=C oc=1;tf=28.35
c<B>r<6>a<2325>R<ASP>A<OD1> 11.913 14.194 -25.989 1.52 el=O oc=1;tf=30.42
c<B>r<6>a<2326>R<ASP>A<OD2> 10.264 15.523 -25.731 1.52 el=O oc=1;tf=29.8

but I want the output to awk through grep -o -i "$AAA" | wc -l | be a string that is later converted to integer, but if it is possible to avoid it - then that would be great.

I need to extract count of amino acids (ARG, SER, ASP in this case) but maybe it is possible from the script you shown before?

Hope this is what you asked Smilie

Last edited by Scrutinizer; 06-06-2019 at 12:38 PM.. Reason: quote tags -> code tags and icode tags
# 10  
Old 06-03-2019
Hmmm - I don't see any "ALA" in your recent sample data file - what be the desired result from it? I see one each of the 4 / ARG, 5 / SER, and 6 / ASP combinations...


Please be aware that everybody in here can only see (and work with) what you explicitly (!) write / post, don't assume ANY knowledge of the topic - genetics / biology assumed, in this case - that allows inference of non- given background info from allusions in the text. Describe the (data) problem as profoundly as possible, backed by representative, consistent, and as broad as possible input and desired output samples.
Don't "change horses" between posts - does your sample in post #9 lead to the "false" output in post #7? Or, where should the

Code:
8
8
8
9 ...

result come from?

Last edited by RudiC; 06-03-2019 at 06:27 AM..
# 11  
Old 06-03-2019
Quote:
Originally Posted by RudiC
Hmmm - I don't see any "ALA" in your recent sample data file - what be the desired result from it? I see one each of the 4 / ARG, 5 / SER, and 6 / ASP combinations...
As I told in the posts before the AAA value can be any of the 20 amino acids (ALA, ARG, ASN, ASP, CYS, GLN, GLY, GLU, HIS, ILE, LEU, LYS, MET, PHE, PRO, SER, THR, TRP, TYR or VAL). In the first example it was ALA which I posted, now I posted with ARG, SER and ASP, but basically the most important value is AAA which can get any value of the 20 amino acids and for that I need to calculate the count for that specific amino acid (it can be chosen as ALA, ARG, ASN or any other of the 20 and for that I need to calculate the count of that amino acid without duplication). To make it clearer from the recent example I need to get only 1 ARG from the same numbering as r(4) that is given 11 times. For SER and ARG it also has to be 1 each even though they are repeated r<5> 6 times and r<6> 8 times respectively. However in the data file these specific AAA occurences are repeated in the data set with different integer r<x> values where x is from 1 to 1000.

Last edited by Aurimas; 06-03-2019 at 07:26 AM..
# 12  
Old 06-03-2019
Understood. I'll try to explain my situation in as much details as possible. The script I have now is:
Code:
#!/bin/bash
read -p "amino acid: " AAA
if [[ "ALA ARG ASN ASP CYS GLN GLY GLU HIS ILE \
	   LEU LYS MET PHE PRO SER THR TRP TYR VAL" =~ $AAA ]]
  then  for i in HS_*.pdb
          do  cat $i | voronota get-balls-from-atoms-file --annotated | \
              grep -o -i "$AAA" | wc -l | awk '{print $1}'
          done
  else  echo exit 1
fi

It is run through terminal at MAC OS Mojave. When I write ./BSA (the name of the script) in terminal it asks me to enter the amino acid (that is a capitalised three letters code such as ALA, ARG, ASN, ASP, CYS, GLN, GLY, GLU, HIS, ILE, LEU, LYS, MET, PHE, PRO, SER, THR, TRP, TYR or VAL) as an input: amino acid: It takes the value of AAA in the script
Let's for this case choose to enter ALA that becomes $AAA in the script, so it would in terminal be like this: amino acid: ALA
Then I press enter and get the output to be:
Code:
40
40
40
45
90
75
45
70
95
35
70
55
40
55
90
55
50
185
170
25
10
60
35
76
35
20
145
15

The output from the script above is how many ALA values occur per one .pdb complex. In total there are 28 .pdb files/complexes. That's why we have 28 lines of the output. However that is not what I want for ALA values per complex. The output I expect should be something like this:
Code:
8
8
8
9
18
15
9
14
19
7
14
11
8
11
18
11
10
19
34
5
2
12
7
16
7
4
29

Here ALA values are calculated without duplication. To understand better how to achieve this let's look at the shortened output example of the first .pdb file (complex) using command voronota get-balls-from-atoms-file --annotated that includes 40 ALA values:
Code:
c<B>r<10>a<2351>R<ALA>A<N> 13.856 10.83 -20.161 1.55 el=N oc=1;tf=27.93
c<B>r<10>a<2352>R<ALA>A<CA> 13.893 11.449 -18.853 1.7 el=C oc=1;tf=27.45
c<B>r<10>a<2353>R<ALA>A<C> 13.899 10.389 -17.757 1.7 el=C oc=1;tf=29.99
c<B>r<10>a<2354>R<ALA>A<O> 14.653 10.538 -16.788 1.52 el=O oc=1;tf=30.44
c<B>r<10>a<2355>R<ALA>A<CB> 12.686 12.323 -18.679 1.7 el=C oc=1;tf=26.9
c<B>r<26>a<2423>R<ALA>A<N> 11.645 18.555 7.864 1.55 el=N oc=1;tf=32.06
c<B>r<26>a<2424>R<ALA>A<CA> 11.938 19.955 7.579 1.7 el=C oc=1;tf=35.4
c<B>r<26>a<2425>R<ALA>A<C> 13.08 20.496 8.431 1.7 el=C oc=1;tf=37.27
c<B>r<26>a<2426>R<ALA>A<O> 13.742 21.478 8.087 1.52 el=O oc=1;tf=39.36
c<B>r<26>a<2427>R<ALA>A<CB> 10.716 20.815 7.844 1.7 el=C oc=1;tf=34.56
C<B>r<56>a<2643>R<ALA>A<N> 5.654 16.636 -19.419 1.55 el=N oc=1;tf=27.14
c<B>r<56>a<2644>R<ALA>A<CA> 4.306 16.969 -19.795 1.7 el=C oc=1;tf=27.77
c<B>r<56>a<2645>R<ALA>A<C> 4.139 18.435 -20.144 1.7 el=C oc=1;tf=29.41
c<B>r<56>a<2646>R<ALA>A<O> 3.619 18.808 -21.204 1.52 el=O oc=1;tf=30.63
c<B>r<56>a<2647>R<ALA>A<CB> 3.373 16.628 -18.664 1.7 el=C oc=1;tf=28.99
c<B>r<88>a<2887>R<ALA>A<N> -3.023 7.753 -19.907 1.55 el=N oc=1;tf=20.84
c<B>r<88>a<2888>R<ALA>A<CA> -3.018 7.206 -18.575 1.7 el=C oc=1;tf=17.38
c<B>r<88>a<2889>R<ALA>A<C> -1.627 6.647 -18.364 1.7 el=C oc=1;tf=18.59
c<B>r<88>a<2890>R<ALA>A<O> -1.086 5.92 -19.197 1.52 el=O oc=1;tf=14.88
c<B>r<88>a<2891>R<ALA>A<CB> -4.015 6.09 -18.472 1.7 el=C oc=1;tf=18.6
c<B>r<130>a<3187>R<ALA>A<N> -4.398 5.962 -24.62 1.55 el=N oc=1;tf=22.4
c<B>r<130>a<3188>R<ALA>A<CA> -3.225 5.141 -24.341 1.7 el=C oc=1;tf=20.7
c<B>r<130>a<3189>R<ALA>A<C> -3.17 4.921 -22.854 1.7 el=C oc=1;tf=19.83
c<B>r<130>a<3190>R<ALA>A<O> -3.725 5.716 -22.066 1.52 el=O oc=1;tf=17.31
c<B>r<130>a<3191>R<ALA>A<CB> -1.913 5.797 -24.7 1.7 el=C oc=1;tf=22.82
c<B>r<177>a<3516>R<ALA>A<N> 0.656 -7.277 -20.93 1.55 el=N oc=1;tf=19.87
c<B>r<177>a<3517>R<ALA>A<CA> -0.367 -8.059 -20.25 1.7 el=C oc=1;tf=19.38
c<B>r<177>a<3518>R<ALA>A<C> -0.263 -9.541 -20.59 1.7 el=C oc=1;tf=20.35
c<B>r<177>a<3519>R<ALA>A<O> 0.029 -9.962 -21.72 1.52 el=O oc=1;tf=19.92
c<B>r<177>a<3520>R<ALA>A<CB> -1.747 -7.592 -20.659 1.7 el=C oc=1;tf=15.99
c<B>r<181>a<3541>R<ALA>A<N> -4.381 -14.273 -14.076 1.55 el=N oc=1;tf=16.9
c<B>r<181>a<3542>R<ALA>A<CA> -4.649 -13.158 -13.194 1.7 el=C oc=1;tf=16.14
c<B>r<181>a<3543>R<ALA>A<C> -3.446 -12.893 -12.306 1.7 el=C oc=1;tf=18.15
c<B>r<181>a<3544>R<ALA>A<O> -2.692 -13.819 -12.014 1.52 el=O oc=1;tf=20.6
c<B>r<181>a<3545>R<ALA>A<CB> -5.817 -13.463 -12.335 1.7 el=C oc=1;tf=15.23
c<B>r<194>a<3626>R<ALA>A<N> 8.308 -12.434 -17.665 1.55 el=N oc=1;tf=29.11
c<B>r<194>a<3627>R<ALA>A<CA> 9.387 -12.364 -18.631 1.7 el=C oc=1;tf=28.89
c<B>r<194>a<3628>R<ALA>A<C> 10.604 -11.653 -18.089 1.7 el=C oc=1;tf=31.02
c<B>r<194>a<3629>R<ALA>A<O> 10.592 -11.177 -16.949 1.52 el=O oc=1;tf=31.88
c<B>r<194>a<3630>R<ALA>A<CB> 8.92 -11.616 -19.844 1.7 el=C oc=1;tf=25.66

As you can see from the voronota output example (in quotes) there are 40 lines with ALA name in it. Thus the output I am getting now from the 1st script as shown above is 40. However the problem with this is that there are only 8 specific ALA values. What I mean by that is that there are 5 times of ALA value repeated and this repetition is shown as r<10> 5 times, same goes for ALA at r<26>, ALA for r<56> and so on (look quote) but I want that those 5 times of r<10> for ALA, r<26> for ALA and so on would be counted as 1 ALA: 1 ALA for 5 times of r<10>, 1 ALA for 5 times of r<26>, etc. and then all those ALA be added together to give 8 ALA values for the first .pdb file instead of 40. Also please note that 1 specific ALA value here comes from 5 times of r<x> where x is a number from 1 to 1000. However it might be that 1 ALA value can come from 2, 3, 4, 6 ,7 , 8 or more times of r<x> that are associated with ALA in the line. Above it is 5 ALA per 5 lines with r<x>, but it can be 8 ALA per 8 r<x> lines or other integer values. However I need to get 1 ALA per 5 times of r<x>, 8 times of r<x> or 1 ALA per less or more of r<x>

The script was then changed to:
Code:
#!/bin/bash
read -p "amino acid: " AAA
if [[ "ALA ARG ASN ASP CYS GLN GLY GLU HIS ILE \
	   LEU LYS MET PHE PRO SER THR TRP TYR VAL" =~ $AAA ]]
  then  for i in HS_*.pdb
          do  cat $i | voronota get-balls-from-atoms-file --annotated | \
                 awk -F"[<>]" -v SRCH="$AAA" '$0 ~ SRCH && !OCC[$4]++ {CNT++ } END {print CNT+0}' $i 
          done
  else  echo exit 1
fi

However the output I get now is this:
Code:
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

The problem here is that all specific .pdb files analysed count all occurrences of ALA as 1 per complex (HS_some_complex.pdb) which means 1 ALA for all 40 times of r<10,26,56,88,etc> in the first .pdb file and so on for the other 27 .pdb complexes. That's not what I need. I want ALA to be calculated as occurring 8 times for the first complex as explained above and not 40 which I am getting from my first script. Thus the question is how is that possible? Should I change grep command or awk or both?

I hope now it is clearer but do let me know if you are still not understanding something Smilie

Last edited by Scrutinizer; 06-06-2019 at 12:40 PM.. Reason: quote tags -> code tags
# 13  
Old 06-03-2019
Hmmm, I think I found a logical error in my proposal: adding the $i after the awk script made it immediately read the respective .pdb file, not voronota's output from that file. Remove the $i:


Code:
               cat $i | voronota get-balls-from-atoms-file --annotated | \
                 awk -F"[<>]" -v SRCH="$AAA" '$0 ~ SRCH && !OCC[$4]++ {CNT++ } END {print CNT+0}'   $i  

and report back.

Still, I'm convinced there will be an apter / better solution to the overall problem dealing with ALL .pdb files, and ALL amino acids in one go if needed...

And, please use CODE, not ICODE, tags for data as well. You may want to edit your former post.
# 14  
Old 06-03-2019
Thank you it works and I edited my previous post. I also have a similar question for this code then:
Code:
#!/bin/bash
read -p "amino acid: " AAA
if [[ "ALA ARG ASN ASP CYS GLN GLY GLU HIS ILE \
	   LEU LYS MET PHE PRO SER THR TRP TYR VAL" =~ $AAA ]]
then 
	for i in qac_"$AAA"_HS_*.pdb.txt; 
		do
			cat $i | grep -o -i $AAA | wc -l | awk '{print $1}'
    	       done
else
	exit 1
fi

.pdb.txt file looks like this:
Code:
c<A>r<134>R<ALA> c<C>r<7>R<DC> 12.9516 3.80289 . .
c<A>r<134>R<ALA> c<C>r<8>R<DG> 5.92777 4.58004 . .
c<A>r<138>R<ALA> c<C>r<7>R<DC> 2.65391 4.55194 . .
c<A>r<248>R<ALA> c<C>r<10>R<DG> 9.10674 3.59363 . .
c<A>r<248>R<ALA> c<C>r<11>R<DT> 0.0228499 5.34781 . .
c<A>r<248>R<ALA> c<W>r<4>R<DC> 21.2356 2.61229 . .
c<A>r<260>R<ALA> c<C>r<5>R<DC> 6.66863 5.26436 . .

You have 7 lines of r<x> where x is a number from 1 to 100 and 7 occurrences of ALA. However, how I could change grep to awk in the code so it would count 4 ALA instead of 7?

Last edited by Scrutinizer; 06-06-2019 at 12:23 PM.. Reason: quote tags -> code tags
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Unexpected result from awk

Hello, Giving those commands: cat > myfile 1 2 3 ^D cat myfile | awk '{ s=s+$1 ; print s}' The output is: 1 3 6 It seems like this command iterates each time on a different row so $1 is the first field of each row.. But what caused it to refer to each row ?. What I mean... (3 Replies)
Discussion started by: uniran
3 Replies

2. UNIX for Beginners Questions & Answers

Line duplication with awk?!

So while this seemed totally trivial it turned out to be much more difficult than I had thought. I have a file with 3 rows, and I "just" want to add each field n number of times. E.g. > cat file.txt 0.5 -0.1 0.6 for n=3 into: cat newfile.txt 0.5 0.5 0.5 -0.1 -0.1 -0.1 0.6 0.6 0.6 I... (4 Replies)
Discussion started by: Glorp
4 Replies

3. Linux

De-Duplication Problem

Hi all, I download and install lessfs for deduplication, I copy files in /SharedFiles directory and lessfs work right and not store again copy files, but, when i delete all files in /SharedFiles , not return free space to total space, files not show in /SharedFiles , but not copy new files in... (3 Replies)
Discussion started by: saeedha
3 Replies

4. Programming

Table Duplication in PHP

Hey, I am making a Facebook like Page system as my first project, So far it's been bate in mind I did it from my 3DS at the same time as my PC gets replaced, So far it's turned out great. Now I am on to creation the blocking system I need to get the code to say If the user already likes the... (0 Replies)
Discussion started by: AimyThomas
0 Replies

5. UNIX for Advanced & Expert Users

File Descriptor redirection and duplication

i have many questions concerning the FD. it was stated that "to redirect Error to output std, you have to write the following code" # ls -alt FileNotThere File > logfile 2>&1 # cat logfile ls: cannot access FileNotThere: No such file or directory -rw-r--r-- 1 root root 0 2010-02-26... (9 Replies)
Discussion started by: ahmad.zuhd
9 Replies

6. Shell Programming and Scripting

How to avoid duplication within 2 files?

Hi all, Actually 2 files are there - file1, file2. file1 contains ---> london mosco america russia mosco file2 contains --> europe india japan mosco england london Question is I want to print all the city names without duplication cities in those... (10 Replies)
Discussion started by: balan_mca
10 Replies

7. Shell Programming and Scripting

File Duplication Script?

I have a file, let's say 1.jpg, and I have a text file that contains a list of filenames I would like to duplicate 1.jpg as (i.e., 2.jpg, 3.jpg, 4.jpg, etc.). The filenames that I want to create are all on separate lines, one per line. I'm sure there's a simple solution, but I'm not claiming to... (7 Replies)
Discussion started by: futurestar
7 Replies

8. UNIX for Advanced & Expert Users

mount LVM duplication drives

Hi, I'm stuck in an awkward situation please help :) I have two identical Seagate 80GB harddrives. My objective is a bit strange. 1.I want to have a cloned disk as bootable backup 2.When booting using the master drive, I also want to mount the cloned backup disk so I can do incremental... (6 Replies)
Discussion started by: onthetopo
6 Replies

9. HP-UX

awk to output cmd result

I was wondering if it was possible to tell awk to print the output of a command in the print. .... | awk '{print $0}' I would like it to print the date right before $0, so something like (this doesn't work though) .... | awk '{print date $0}' (4 Replies)
Discussion started by: IMTheNachoMan
4 Replies

10. Windows & DOS: Issues & Discussions

File Duplication

hi all how to find the file duplication in a windows 2000 server as usual replies are sincerely appreciated. thanks raguram R (3 Replies)
Discussion started by: raguramtgr
3 Replies
Login or Register to Ask a Question