Duplication | awk | result


Login or Register to Reply

 
Thread Tools Search this Thread
# 1  
Duplication | awk | result

Dear forum members,

I want the script to count ALA as one (an example in quotes) and return an integer as 1 and not return 5 as an integer as it does now (look bash script). So how can I upgrade my script that it first checks or after finding all instances of ALA checks whether it is the same number associated with ALA (56 (r<56>) in quotes example associated with ALA five times, which can be 13 counted five times associated witH ALA word, 99 "-" with ALA word, 88 "-" ALA word, etc.). So in general I want the script to count how many times ALA is used and then return an integer of it but if it duplicates I would love for it to count only once.

The data example that I want to count as one:

Code:
c<B>r<56>a<2643>R<ALA>A<N> 5.654 16.636 -19.419 1.55 el=N oc=1;tf=27.14
c<B>r<56>a<2644>R<ALA>A<CA> 4.306 16.969 -19.795 1.7 el=C oc=1;tf=27.77
c<B>r<56>a<2645>R<ALA>A<C> 4.139 18.435 -20.144 1.7 el=C oc=1;tf=29.41
c<B>r<56>a<2646>R<ALA>A<O> 3.619 18.808 -21.204 1.52 el=O oc=1;tf=30.63
c<B>r<56>a<2647>R<ALA>A<CB> 3.373 16.628 -18.664 1.7 el=C oc=1;tf=28.99
c<B>r<88>a<2887>R<ALA>A<N> -3.023 7.753 -19.907 1.55 el=N oc=1;tf=20.84
c<B>r<88>a<2888>R<ALA>A<CA> -3.018 7.206 -18.575 1.7 el=C oc=1;tf=17.38
c<B>r<88>a<2889>R<ALA>A<C> -1.627 6.647 -18.364 1.7 el=C oc=1;tf=18.59
c<B>r<88>a<2890>R<ALA>A<O> -1.086 5.92 -19.197 1.52 el=O oc=1;tf=14.88
c<B>r<88>a<2891>R<ALA>A<CB> -4.015 6.09 -18.472 1.7 el=C oc=1;tf=18.6

Here is the code:

Code:
#!/bin/bash
read -p "amino acid: " AAA
if [ "$AAA" == "ALA" ] || [ "$AAA" == "VAL" ] || [ "$AAA" == "ARG" ] || [ "$AAA" == "ASN" ] || [ "$AAA" == "ASP" ] || 
   [ "$AAA" == "CYS" ] || [ "$AAA" == "GLY" ] || [ "$AAA" == "ILE" ] || [ "$AAA" == "LEU" ] || [ "$AAA" == "LYS" ] ||
   [ "$AAA" == "MET" ] || [ "$AAA" == "PHE" ] || [ "$AAA" == "PRO" ] || [ "$AAA" == "SER" ] || [ "$AAA" == "THR" ] || 
   [ "$AAA" == "TRP" ] || [ "$AAA" == "TYR" ] || [ "$AAA" == "HIS" ] || [ "$AAA" == "GLN" ] || [ "$AAA" == "GLU" ]
then 
#for AAA in ALA ARG ASN ASP CYS GLN GLY GLU HIS ILE \
#		LEU LYS MET PHE PRO SER THR TRP TYR VAl; do
	for i in HS_*.pdb; do
		cat $i | voronota get-balls-from-atoms-file --annotated | \
                grep -o -i "$AAA" | wc -l | awk '{print $1}' 
#    	       | voronota calculate-contacts --annotated \
#    	       | voronota query-contacts --inter-residue --match-first "R<$AAA>" --match-second "R<A,DC,DG,DT>" \
    done
else
	exit 1
fi

I will look forward to your reponses.

Sincerely,
Aurimas

Last edited by Scrutinizer; 1 Week Ago at 12:24 PM.. Reason: quote tags -> code tags
# 2  
Not sure I understand what you're after. Is it the count of distinct occurrences of the $AAA value combined with that r<...> "number associated with ALA" (the $AAA value)? Where comes the 88"-"into play? Where the 99"-"? What be the result of above sample code and data?
# 3  
So the output for the given script is 10. When the script is run it first asks to enter amino acid (in this case ALA) that is taken the value of AAA with which the script works with. AAA can be any of the value as given in the if statement like AAA == ALA, AAA == ARG and any of the written values in the if statement.

Let's take the quote:

Code:
c<B>r<56>a<2643>R<ALA>A<N> 5.654 16.636 -19.419 1.55 el=N oc=1;tf=27.14
c<B>r<56>a<2644>R<ALA>A<CA> 4.306 16.969 -19.795 1.7 el=C oc=1;tf=27.77
c<B>r<56>a<2645>R<ALA>A<C> 4.139 18.435 -20.144 1.7 el=C oc=1;tf=29.41
c<B>r<56>a<2646>R<ALA>A<O> 3.619 18.808 -21.204 1.52 el=O oc=1;tf=30.63
c<B>r<56>a<2647>R<ALA>A<CB> 3.373 16.628 -18.664 1.7 el=C oc=1;tf=28.99
c<B>r<88>a<2887>R<ALA>A<N> -3.023 7.753 -19.907 1.55 el=N oc=1;tf=20.84
c<B>r<88>a<2888>R<ALA>A<CA> -3.018 7.206 -18.575 1.7 el=C oc=1;tf=17.38
c<B>r<88>a<2889>R<ALA>A<C> -1.627 6.647 -18.364 1.7 el=C oc=1;tf=18.59
c<B>r<88>a<2890>R<ALA>A<O> -1.086 5.92 -19.197 1.52 el=O oc=1;tf=14.88
c<B>r<88>a<2891>R<ALA>A<CB> -4.015 6.09 -18.472 1.7 el=C oc=1;tf=18.6

This output ( the one given in quotes above) is made after the command
Code:
 voronota get-balls-from-atoms-file --annotated

and then through pipes are pased to
Code:
 grep -o -i "$AAA" | wc -l | awk '{print $1}'

which in turn gives an output as an integer 10 showing how many times ALA does occur in the dataset (look in quote above).

However I want once the output from
Code:
 voronota get-balls-from-atoms-file --annotated

is passed to pipes
Code:
 grep -o -i "$AAA" | wc -l | awk '{print $1}'

to return me a numerical value of 2.

Why 2? If you look at the quote above the first 5 lines have ALA ( given 5 times in the lines ...r<>a..R<ALA>...) in them but these ALA words are numbered exactly the same (as belonging to 56): (
Code:
c<B>r<56>a<2644>R<ALA>A<CA> 4.306 16.969 -19.795 1.7 el=C oc=1;tf=27.77
c<B>r<56>a<2644>R<ALA>A<CA> 4.306 16.969 -19.795 1.7 el=C oc=1;tf=27.77
c<B>r<56>a<2645>R<ALA>A<C> 4.139 18.435 -20.144 1.7 el=C oc=1;tf=29.41
c<B>r<56>a<2646>R<ALA>A<O> 3.619 18.808 -21.204 1.52 el=O oc=1;tf=30.63
c<B>r<56>a<2647>R<ALA>A<CB> 3.373 16.628 -18.664 1.7 el=C oc=1;tf=28.99

The same reasoning goes for ALA and number 88 (quote below) which also occurs 5 times with ALA. So total occurence of ALA (in the quote above) would be 10 but I want it to be 2 (as occurence of ALA is 1 for each number (56 & 88)). The ALA (or any of the chosen 20 AAA values) is constant in this case while numbers change and can take any integer from 1 to 1000.
Code:
c<B>r<88>a<2887>R<ALA>A<N> -3.023 7.753 -19.907 1.55 el=N oc=1;tf=20.84
c<B>r<88>a<2888>R<ALA>A<CA> -3.018 7.206 -18.575 1.7 el=C oc=1;tf=17.38
c<B>r<88>a<2889>R<ALA>A<C> -1.627 6.647 -18.364 1.7 el=C oc=1;tf=18.59
c<B>r<88>a<2890>R<ALA>A<O> -1.086 5.92 -19.197 1.52 el=O oc=1;tf=14.88
c<B>r<88>a<2891>R<ALA>A<CB> -4.015 6.09 -18.472 1.7 el=C oc=1;tf=18.6

Quote:
Is it the count of distinct occurrences of the $AAA value combined with that r<...> "number associated with ALA" (the $AAA value)?
I want the count of AAA to be printed to terminal as 2 (regarding the qoute above which has 56 and 88) and not 10 when the script is run. If the line r<56> is repeated 5 times (as in the quote above regarding 56) it also gives AAA 5 times (in this case ...R<ALA>...), but I want the count for all occurences of ALA at 56 to be 1. All occurences of specific AAA at 88 be given the value 1 as well and so on. Regarding the number 88, 99 or any other it is arbitrary and can change, hence your question about 88, 99 as these are just different numbers as 56.

Hope it is clearer now :-) If not let me know I'll try to rephrase everything.

Last edited by Scrutinizer; 1 Week Ago at 12:36 PM..
# 4  
OK, that's what I inferred. How about (the first line will work with a recent shell only)
Code:
if [[ "ALA ARG ASN ASP CYS GLN GLY GLU HIS ILE LEU LYS MET PHE PRO SER THR TRP TYR VAL" =~ $AAA ]]
  then  for i in HS_*.pdb
          do  cat $i | voronota get-balls-from-atoms-file --annotated | \
                  awk -F"[<>]" -vSRCH="$AAA" '$0 ~ SRCH && !OCC[$4]++ {CNT++ } END {print CNT+0}' $i 
          done
  else  echo exit 1
fi
2

# 5  
When I run your proposed code

Code:
#!/bin/bash
read -p "amino acid: " AAA
if [[ "ALA ARG ASN ASP CYS GLN GLY GLU HIS ILE LEU LYS MET PHE PRO SER THR TRP TYR VAL" =~ $AAA ]]
  then  for i in HS_*.pdb
          do  cat $i | voronota get-balls-from-atoms-file --annotated | \
                  awk -F"[<>]" -vSRCH="$AAA" '$0 ~ SRCH && !OCC[$4]++ {CNT++ } END {print CNT+0}' $i 
          done
  else  echo exit 1
fi

on MAC OS Mojave terminal I get this error:
Code:
awk: invalid -v option


Last edited by Scrutinizer; 1 Week Ago at 12:37 PM.. Reason: quote tags -> code tags
# 6  
The Mac OS man page (your friend, btw) for awk says it knows the -v option:
Quote:
-v var=value Assign values before prog is executed
Try a space?
# 7  
Thank you, it works. However, the output now is 1 for all the cases, which is incorrect as I need to calculate AAA instances for 28 complexes (1st complex should give 8 as there are 8 times ALA is repeated (40 if we include that ALA is calculated 5 times with specific number (56 as was in our case)) , 2nd - 8 as well, 3rd - 8, 4th - 9 and so on) and it varies from 2 instances of ALA (10 times repeated (2 x 5)) and can get any value from 2 to 37 when AAA is chosen as ALA. Should I use cat $i | and some specification of grep before giving it to awk?
The output I get now is:
Quote:
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
How is it possible to solve that so I get:
Quote:
8
8
8
9
18
15
9
14
19
7
14
11
8
11
18
11
10
19
34
5
2
12
7
16
7
4
29
3
Login or Register to Reply

|
Thread Tools Search this Thread
Search this Thread:
Advanced Search

More UNIX and Linux Forum Topics You Might Find Helpful
Line duplication with awk?!
Glorp
So while this seemed totally trivial it turned out to be much more difficult than I had thought. I have a file with 3 rows, and I "just" want to add each field n number of times. E.g. > cat file.txt 0.5 -0.1 0.6 for n=3 into: cat newfile.txt 0.5 0.5 0.5 -0.1 -0.1 -0.1 0.6 0.6 0.6 I...... UNIX for Beginners Questions & Answers
4
UNIX for Beginners Questions & Answers
De-Duplication Problem
saeedha
Hi all, I download and install lessfs for deduplication, I copy files in /SharedFiles directory and lessfs work right and not store again copy files, but, when i delete all files in /SharedFiles , not return free space to total space, files not show in /SharedFiles , but not copy new files in...... Linux
3
Linux
How to avoid duplication within 2 files?
balan_mca
Hi all, Actually 2 files are there - file1, file2. file1 contains ---> london mosco america russia mosco file2 contains --> europe india japan mosco england london Question is I want to print all the city names without duplication cities in those...... Shell Programming and Scripting
10
Shell Programming and Scripting
File Duplication Script?
futurestar
I have a file, let's say 1.jpg, and I have a text file that contains a list of filenames I would like to duplicate 1.jpg as (i.e., 2.jpg, 3.jpg, 4.jpg, etc.). The filenames that I want to create are all on separate lines, one per line. I'm sure there's a simple solution, but I'm not claiming to...... Shell Programming and Scripting
7
Shell Programming and Scripting
File Duplication
raguramtgr
hi all how to find the file duplication in a windows 2000 server as usual replies are sincerely appreciated. thanks raguram R... Windows & DOS: Issues & Discussions
3
Windows & DOS: Issues & Discussions

Featured Tech Videos