Visit The New, Modern Unix Linux Community


Count occurences of the word without it repeating


 
Thread Tools Search this Thread
Top Forums UNIX for Beginners Questions & Answers Count occurences of the word without it repeating
# 1  
Count occurences of the word without it repeating

Hi, I would like to count the number of ALA occurences without having them to be repeated. In the script I have written now it has 40 repetitions of ALA but it has to be 8. ALA is chosen as one of the 20 values it can have when the script asks for the input of AAA, which for this example is chosen to be ALA.

The script I have:
Code:
#!/bin/bash
read -p "amino acid: " AAA
if [[ "ALA ARG ASN ASP CYS GLN GLY GLU HIS ILE \
	   LEU LYS MET PHE PRO SER THR TRP TYR VAL" =~ $AAA ]]
then 
	for i in HS_data_*.txt; 
		do
			cat $i | grep -o -i $AAA | wc -l | awk '{print $1}'
#			awk -F"[ ]" -v SRCH="$AAA" '$0 ~ SRCH && !OCC[$6]++ {CNT++ } END {print CNT+0}' $i
        done
else
	exit 1
fi

The input of one of HS_data_*.txt file is this:
Code:
ATOM   2351  N   ALA B  10      13.856  10.830 -20.161  1.00 27.93           N  
ATOM   2352  CA  ALA B  10      13.893  11.449 -18.853  1.00 27.45           C  
ATOM   2353  C   ALA B  10      13.899  10.389 -17.757  1.00 29.99           C  
ATOM   2354  O   ALA B  10      14.653  10.538 -16.788  1.00 30.44           O  
ATOM   2355  CB  ALA B  10      12.686  12.323 -18.679  1.00 26.90           C  
ATOM   2423  N   ALA B  26      11.645  18.555   7.864  1.00 32.06           N  
ATOM   2424  CA  ALA B  26      11.938  19.955   7.579  1.00 35.40           C  
ATOM   2425  C   ALA B  26      13.080  20.496   8.431  1.00 37.27           C  
ATOM   2426  O   ALA B  26      13.742  21.478   8.087  1.00 39.36           O  
ATOM   2427  CB  ALA B  26      10.716  20.815   7.844  1.00 34.56           C  
ATOM   2643  N   ALA B  56       5.654  16.636 -19.419  1.00 27.14           N  
ATOM   2644  CA  ALA B  56       4.306  16.969 -19.795  1.00 27.77           C  
ATOM   2645  C   ALA B  56       4.139  18.435 -20.144  1.00 29.41           C  
ATOM   2646  O   ALA B  56       3.619  18.808 -21.204  1.00 30.63           O  
ATOM   2647  CB  ALA B  56       3.373  16.628 -18.664  1.00 28.99           C  
ATOM   2887  N   ALA B  88      -3.023   7.753 -19.907  1.00 20.84           N  
ATOM   2888  CA  ALA B  88      -3.018   7.206 -18.575  1.00 17.38           C  
ATOM   2889  C   ALA B  88      -1.627   6.647 -18.364  1.00 18.59           C  
ATOM   2890  O   ALA B  88      -1.086   5.920 -19.197  1.00 14.88           O  
ATOM   2891  CB  ALA B  88      -4.015   6.090 -18.472  1.00 18.60           C  
ATOM   3187  N   ALA B 130      -4.398   5.962 -24.620  1.00 22.40           N  
ATOM   3188  CA  ALA B 130      -3.225   5.141 -24.341  1.00 20.70           C  
ATOM   3189  C   ALA B 130      -3.170   4.921 -22.854  1.00 19.83           C  
ATOM   3190  O   ALA B 130      -3.725   5.716 -22.066  1.00 17.31           O  
ATOM   3191  CB  ALA B 130      -1.913   5.797 -24.700  1.00 22.82           C  
ATOM   3516  N   ALA B 177       0.656  -7.277 -20.930  1.00 19.87           N  
ATOM   3517  CA  ALA B 177      -0.367  -8.059 -20.250  1.00 19.38           C  
ATOM   3518  C   ALA B 177      -0.263  -9.541 -20.590  1.00 20.35           C  
ATOM   3519  O   ALA B 177       0.029  -9.962 -21.720  1.00 19.92           O  
ATOM   3520  CB  ALA B 177      -1.747  -7.592 -20.659  1.00 15.99           C  
ATOM   3541  N   ALA B 181      -4.381 -14.273 -14.076  1.00 16.90           N  
ATOM   3542  CA  ALA B 181      -4.649 -13.158 -13.194  1.00 16.14           C  
ATOM   3543  C   ALA B 181      -3.446 -12.893 -12.306  1.00 18.15           C  
ATOM   3544  O   ALA B 181      -2.692 -13.819 -12.014  1.00 20.60           O  
ATOM   3545  CB  ALA B 181      -5.817 -13.463 -12.335  1.00 15.23           C  
ATOM   3626  N   ALA B 194       8.308 -12.434 -17.665  1.00 29.11           N  
ATOM   3627  CA  ALA B 194       9.387 -12.364 -18.631  1.00 28.89           C  
ATOM   3628  C   ALA B 194      10.604 -11.653 -18.089  1.00 31.02           C  
ATOM   3629  O   ALA B 194      10.592 -11.177 -16.949  1.00 31.88           O  
ATOM   3630  CB  ALA B 194       8.920 -11.616 -19.844  1.00 25.66           C

As you can see from the input ALA is repeated 40 times but 5 times each, so a total of 8 times. The 4th column gives the ALA value, while 6th column shows how many times the same ALA is repeated. For example ALA at 10 (6th column) is repeated 5 times, ALA at 26 is repeated 5 times, ALA at 56 is also repeated 5 times, etc.

The output has to count ALA 8 times instead of 40 which is the current case with my script (bold: cat $i | grep -o -i $AAA | wc -l | awk '{print $1}').

Also I was trying to figure out how to count ALA 8 times using strictly the # awk -F"[ ]" -v SRCH="$AAA" '$0 ~ SRCH && !OCC[$6]++ {CNT++ } END {print CNT+0}' $i command (commented), however I am struggling to get the correct awk command.

Thus, I would like to ask of few questions:
1) How could I make the bolded command count ALA 8 times instead of 40?
2) How could I make strictly the awk command (commented) count ALA also 8 times instead of 5 as it does now which does not make sense as there are much more ALA words?

Last edited by Aurimas; 06-14-2019 at 11:06 PM..
# 2  
Hi,

The awk statement when you just leave out -F"[ ]" .
Code:
awk -v SRCH="$AAA" '$0 ~ SRCH && !OCC[$6]++ {CNT++ } END {print CNT+0}' "$i"

renders 8


I would suggest slightly modifying it to make it more exact:
Code:
awk -v AMINO="$AAA" '$4 == AMINO && !OCC[$6]++ {CNT++ } END {print CNT+0}' "$i"

Mind you, with the approach of the loop, you are counting per file. So perhaps you would like the filename too:

So
Code:
for i in HS_data_*.txt; 
do
  awk -v AMINO="$AAA" '$4 == AMINO && !OCC[$6]++ {CNT++ } END {print FILENAME ": ",CNT+0}' "$i"
done

Or if you want the total of all the HS_data files in the directory, try:
Code:
awk -v AMINO="$AAA" '$4 == AMINO && !OCC[$6]++ {CNT++ } END {print CNT+0}' HS_data_*.txt

or if there are too many files and you get line length errors, try:
Code:
for i in HS_data_*.txt; 
do
  cat "$i"
done | 
awk -v AMINO="$AAA" '$4 == AMINO && !OCC[$6]++ {CNT++ } END {print CNT+0}'


Regards,

S.

Last edited by Scrutinizer; 06-15-2019 at 04:35 AM..
These 3 Users Gave Thanks to Scrutinizer For This Post:
# 3  
Thank you very much Scrutinizer for a lengthy response!!! You're wonderfully generous Smilie

Everything in your response gives 8 or any other expected value depending on what you wrote except for the first awk code. When I write awk -F"[ ]" -v SRCH="$AAA" '$0 ~ SRCH && !OCC[$6]++ {CNT++ } END {print CNT+0}' "$i" it gives 5 which I have no idea how it came up with. Any ideas why that might be the case?
These 2 Users Gave Thanks to Aurimas For This Post:
# 4  
As Scrutinizer posted, modifying the field separator is responsible, as it changes field numbering. $6 now assumes the values "C", "N", "O", "CA", "CB", whose count is 5.
# 5  
Understood, but when I added field seperator and tried $6 as well as other values till $12 no one gave me 8. Where might be the problem?
# 6  
That's because of the different lengths of $6 (1 or 2) resulting in a different FS count after it, so sometimes "ALA" shows up in $8, sometimes in $9, and the fields to follow as well.
This User Gave Thanks to RudiC For This Post:
# 7  
RudiC, thanks for confirming my thought that different numerical values have to do with uneven spacings. Thanks!

Previous Thread | Next Thread
Thread Tools Search this Thread
Search this Thread:
Advanced Search

Test Your Knowledge in Computers #934
Difficulty: Medium
The C standard library does not specify any specific resolution, epoch, range, or datatype for system time values.
True or False?

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Count the occurences of strings

I have some text files in a folder f1 with 10 columns. The first five columns of a file are shown below. aab abb 263-455 263 455 aab abb 263-455 263 455 aab abb 263-455 263 455 bbb abb 26-455 26 455 bbb abb 26-455 26 455 bbb aka 264-266 264 266 bga bga 230-232 230 ... (10 Replies)
Discussion started by: gomez
10 Replies

2. Shell Programming and Scripting

awk count occurences

line number:status, market, keystation 1,SENT,EBS,1 : 1 2,DONE,REU,1 : 1 3,SENT,EBS,2 : 1 4,DONE,EBS,1 : 0 5,SENT,EBS,2 : 0 6,SENT,EBS,2 : 0 7,SENT,EBS,2 : 0 8,SENT,EBS,1 : 1 for each status, market combination I want to keep a tally of active orders. i.e if an order is SENT, then +1, if... (8 Replies)
Discussion started by: Calypso
8 Replies

3. UNIX for Dummies Questions & Answers

Count pattern occurences

hi, I have a text..and i need to find a pattern in the text and count to the no of times the pattern occured. i have used grep command ..but the problem is , it shows the occurrences of the pattern but doesn't count no of times the pattern occuries. (5 Replies)
Discussion started by: nvnni
5 Replies

4. Shell Programming and Scripting

Count occurences of string

Hi, Please help me in finding the number of occurences of the string. Example: Apple, green, blue, Apple, Orange, green, blue are the strings can be even in the next line. The o/p should look as: Word Count ----- ----- Apple 2 green 2 Orange 1 blue 2 Thanks (2 Replies)
Discussion started by: acc888
2 Replies

5. Shell Programming and Scripting

Awk to count occurences

Hi, i am in need of an awk script to accomplish the following: Input table looks like: Student1 arts Student2 science Student3 arts Student4 science Student5 science Student6 science Student7 science Student8 science Student9 science Student10 science Student11 science... (8 Replies)
Discussion started by: saint2006
8 Replies

6. UNIX for Dummies Questions & Answers

Count number of occurences of a word

I want to count the number of occurences of say "200" in a file but that file also contains various stuff including dtaes like 2007 or smtg like 200.1 so count i am getting by doing grep -c "word" file is wrong Please help!!!!! (8 Replies)
Discussion started by: shikhakaul
8 Replies

7. Shell Programming and Scripting

Perl - Count occurences

I have enclosed the script. I am able to find the files that contain my search string but when I try to count the occurences within the file I get zero always. Any help on this. #!/usr/bin/perl my $find = $ARGV; my $replace = $ARGV; my $glob = $ARGV; @filelist = <*$glob>; # process each... (22 Replies)
Discussion started by: TimHortons
22 Replies

8. UNIX for Dummies Questions & Answers

How to count the occurences of a specific word in a file in bash shell

Hello, I want to count the occurences of a specific word in a .txt file in bash shell. Can somebody help me pleaze?? Thanks!!! (2 Replies)
Discussion started by: mskart
2 Replies

9. Shell Programming and Scripting

no of occurences of q word

hi I hace a string "abc,def,ghi,abc,def ,ghi,abc,def,ghi,abc,def ,ghi,abc" i replaced commas with spaces, now i want to calculate nof occurences of "abc" word. thanks in advance Satya (6 Replies)
Discussion started by: Satyak
6 Replies

10. Web Development

How to find all occurences of word?

Hi, For example lets consider i have word like this:cell I have some text that is stored in table. These are few sentences. TRAP also regulates translation of trpE by promoting formation of an cell. In addition initiation of pabA, trpP and ycbK by directly blocking cells. I... (0 Replies)
Discussion started by: vanitham
0 Replies

Featured Tech Videos