Today (Saturday) We will make some minor tuning adjustments to MySQL.

You may experience 2 up to 10 seconds "glitch time" when we restart MySQL. We expect to make these adjustments around 1AM Eastern Daylight Saving Time (EDT) US.


Count occurences of the word without it repeating


Login or Register to Reply

 
Thread Tools Search this Thread
# 1  
Count occurences of the word without it repeating

Hi, I would like to count the number of ALA occurences without having them to be repeated. In the script I have written now it has 40 repetitions of ALA but it has to be 8. ALA is chosen as one of the 20 values it can have when the script asks for the input of AAA, which for this example is chosen to be ALA.

The script I have:
Code:
#!/bin/bash
read -p "amino acid: " AAA
if [[ "ALA ARG ASN ASP CYS GLN GLY GLU HIS ILE \
	   LEU LYS MET PHE PRO SER THR TRP TYR VAL" =~ $AAA ]]
then 
	for i in HS_data_*.txt; 
		do
			cat $i | grep -o -i $AAA | wc -l | awk '{print $1}'
#			awk -F"[ ]" -v SRCH="$AAA" '$0 ~ SRCH && !OCC[$6]++ {CNT++ } END {print CNT+0}' $i
        done
else
	exit 1
fi

The input of one of HS_data_*.txt file is this:
Code:
ATOM   2351  N   ALA B  10      13.856  10.830 -20.161  1.00 27.93           N  
ATOM   2352  CA  ALA B  10      13.893  11.449 -18.853  1.00 27.45           C  
ATOM   2353  C   ALA B  10      13.899  10.389 -17.757  1.00 29.99           C  
ATOM   2354  O   ALA B  10      14.653  10.538 -16.788  1.00 30.44           O  
ATOM   2355  CB  ALA B  10      12.686  12.323 -18.679  1.00 26.90           C  
ATOM   2423  N   ALA B  26      11.645  18.555   7.864  1.00 32.06           N  
ATOM   2424  CA  ALA B  26      11.938  19.955   7.579  1.00 35.40           C  
ATOM   2425  C   ALA B  26      13.080  20.496   8.431  1.00 37.27           C  
ATOM   2426  O   ALA B  26      13.742  21.478   8.087  1.00 39.36           O  
ATOM   2427  CB  ALA B  26      10.716  20.815   7.844  1.00 34.56           C  
ATOM   2643  N   ALA B  56       5.654  16.636 -19.419  1.00 27.14           N  
ATOM   2644  CA  ALA B  56       4.306  16.969 -19.795  1.00 27.77           C  
ATOM   2645  C   ALA B  56       4.139  18.435 -20.144  1.00 29.41           C  
ATOM   2646  O   ALA B  56       3.619  18.808 -21.204  1.00 30.63           O  
ATOM   2647  CB  ALA B  56       3.373  16.628 -18.664  1.00 28.99           C  
ATOM   2887  N   ALA B  88      -3.023   7.753 -19.907  1.00 20.84           N  
ATOM   2888  CA  ALA B  88      -3.018   7.206 -18.575  1.00 17.38           C  
ATOM   2889  C   ALA B  88      -1.627   6.647 -18.364  1.00 18.59           C  
ATOM   2890  O   ALA B  88      -1.086   5.920 -19.197  1.00 14.88           O  
ATOM   2891  CB  ALA B  88      -4.015   6.090 -18.472  1.00 18.60           C  
ATOM   3187  N   ALA B 130      -4.398   5.962 -24.620  1.00 22.40           N  
ATOM   3188  CA  ALA B 130      -3.225   5.141 -24.341  1.00 20.70           C  
ATOM   3189  C   ALA B 130      -3.170   4.921 -22.854  1.00 19.83           C  
ATOM   3190  O   ALA B 130      -3.725   5.716 -22.066  1.00 17.31           O  
ATOM   3191  CB  ALA B 130      -1.913   5.797 -24.700  1.00 22.82           C  
ATOM   3516  N   ALA B 177       0.656  -7.277 -20.930  1.00 19.87           N  
ATOM   3517  CA  ALA B 177      -0.367  -8.059 -20.250  1.00 19.38           C  
ATOM   3518  C   ALA B 177      -0.263  -9.541 -20.590  1.00 20.35           C  
ATOM   3519  O   ALA B 177       0.029  -9.962 -21.720  1.00 19.92           O  
ATOM   3520  CB  ALA B 177      -1.747  -7.592 -20.659  1.00 15.99           C  
ATOM   3541  N   ALA B 181      -4.381 -14.273 -14.076  1.00 16.90           N  
ATOM   3542  CA  ALA B 181      -4.649 -13.158 -13.194  1.00 16.14           C  
ATOM   3543  C   ALA B 181      -3.446 -12.893 -12.306  1.00 18.15           C  
ATOM   3544  O   ALA B 181      -2.692 -13.819 -12.014  1.00 20.60           O  
ATOM   3545  CB  ALA B 181      -5.817 -13.463 -12.335  1.00 15.23           C  
ATOM   3626  N   ALA B 194       8.308 -12.434 -17.665  1.00 29.11           N  
ATOM   3627  CA  ALA B 194       9.387 -12.364 -18.631  1.00 28.89           C  
ATOM   3628  C   ALA B 194      10.604 -11.653 -18.089  1.00 31.02           C  
ATOM   3629  O   ALA B 194      10.592 -11.177 -16.949  1.00 31.88           O  
ATOM   3630  CB  ALA B 194       8.920 -11.616 -19.844  1.00 25.66           C

As you can see from the input ALA is repeated 40 times but 5 times each, so a total of 8 times. The 4th column gives the ALA value, while 6th column shows how many times the same ALA is repeated. For example ALA at 10 (6th column) is repeated 5 times, ALA at 26 is repeated 5 times, ALA at 56 is also repeated 5 times, etc.

The output has to count ALA 8 times instead of 40 which is the current case with my script (bold: cat $i | grep -o -i $AAA | wc -l | awk '{print $1}').

Also I was trying to figure out how to count ALA 8 times using strictly the # awk -F"[ ]" -v SRCH="$AAA" '$0 ~ SRCH && !OCC[$6]++ {CNT++ } END {print CNT+0}' $i command (commented), however I am struggling to get the correct awk command.

Thus, I would like to ask of few questions:
1) How could I make the bolded command count ALA 8 times instead of 40?
2) How could I make strictly the awk command (commented) count ALA also 8 times instead of 5 as it does now which does not make sense as there are much more ALA words?

Last edited by Aurimas; 06-14-2019 at 11:06 PM..
# 2  
Hi,

The awk statement when you just leave out -F"[ ]" .
Code:
awk -v SRCH="$AAA" '$0 ~ SRCH && !OCC[$6]++ {CNT++ } END {print CNT+0}' "$i"

renders 8


I would suggest slightly modifying it to make it more exact:
Code:
awk -v AMINO="$AAA" '$4 == AMINO && !OCC[$6]++ {CNT++ } END {print CNT+0}' "$i"

Mind you, with the approach of the loop, you are counting per file. So perhaps you would like the filename too:

So
Code:
for i in HS_data_*.txt; 
do
  awk -v AMINO="$AAA" '$4 == AMINO && !OCC[$6]++ {CNT++ } END {print FILENAME ": ",CNT+0}' "$i"
done

Or if you want the total of all the HS_data files in the directory, try:
Code:
awk -v AMINO="$AAA" '$4 == AMINO && !OCC[$6]++ {CNT++ } END {print CNT+0}' HS_data_*.txt

or if there are too many files and you get line length errors, try:
Code:
for i in HS_data_*.txt; 
do
  cat "$i"
done | 
awk -v AMINO="$AAA" '$4 == AMINO && !OCC[$6]++ {CNT++ } END {print CNT+0}'


Regards,

S.

Last edited by Scrutinizer; 06-15-2019 at 04:35 AM..
These 3 Users Gave Thanks to Scrutinizer For This Post:
# 3  
Thank you very much Scrutinizer for a lengthy response!!! You're wonderfully generous Smilie

Everything in your response gives 8 or any other expected value depending on what you wrote except for the first awk code. When I write awk -F"[ ]" -v SRCH="$AAA" '$0 ~ SRCH && !OCC[$6]++ {CNT++ } END {print CNT+0}' "$i" it gives 5 which I have no idea how it came up with. Any ideas why that might be the case?
These 2 Users Gave Thanks to Aurimas For This Post:
# 4  
As Scrutinizer posted, modifying the field separator is responsible, as it changes field numbering. $6 now assumes the values "C", "N", "O", "CA", "CB", whose count is 5.
# 5  
Understood, but when I added field seperator and tried $6 as well as other values till $12 no one gave me 8. Where might be the problem?
# 6  
That's because of the different lengths of $6 (1 or 2) resulting in a different FS count after it, so sometimes "ALA" shows up in $8, sometimes in $9, and the fields to follow as well.
This User Gave Thanks to RudiC For This Post:
# 7  
RudiC, thanks for confirming my thought that different numerical values have to do with uneven spacings. Thanks!
Login or Register to Reply

|
Thread Tools Search this Thread
Search this Thread:
Advanced Search

More UNIX and Linux Forum Topics You Might Find Helpful
Count the occurences of strings
gomez
I have some text files in a folder f1 with 10 columns. The first five columns of a file are shown below. aab abb 263-455 263 455 aab abb 263-455 263 455 aab abb 263-455 263 455 bbb abb 26-455 26 455 bbb abb 26-455 26 455 bbb aka 264-266 264 266 bga bga 230-232 230 ...... Shell Programming and Scripting
10
Shell Programming and Scripting
Count pattern occurences
nvnni
hi, I have a text..and i need to find a pattern in the text and count to the no of times the pattern occured. i have used grep command ..but the problem is , it shows the occurrences of the pattern but doesn't count no of times the pattern occuries.... UNIX for Dummies Questions & Answers
5
UNIX for Dummies Questions & Answers
Count number of occurences of a word
shikhakaul
I want to count the number of occurences of say "200" in a file but that file also contains various stuff including dtaes like 2007 or smtg like 200.1 so count i am getting by doing grep -c "word" file is wrong Please help!!!!!... UNIX for Dummies Questions & Answers
8
UNIX for Dummies Questions & Answers
How to count the occurences of a specific word in a file in bash shell
mskart
Hello, I want to count the occurences of a specific word in a .txt file in bash shell. Can somebody help me pleaze?? Thanks!!!... UNIX for Dummies Questions & Answers
2
UNIX for Dummies Questions & Answers
no of occurences of q word
Satyak
hi I hace a string "abc,def,ghi,abc,def ,ghi,abc,def,ghi,abc,def ,ghi,abc" i replaced commas with spaces, now i want to calculate nof occurences of "abc" word. thanks in advance Satya... Shell Programming and Scripting
6
Shell Programming and Scripting

Featured Tech Videos