Number of matches and matched pattern(s) in awk


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Number of matches and matched pattern(s) in awk
# 1  
Old 12-26-2015
Number of matches and matched pattern(s) in awk

input:
Code:
!@#$%2QW5QWERTAB$%^&*

The string above is not separated (or FS="").
For clarity sake one could re-write the string by including a "|" as FS as follow:
Code:
!|@|#|$|%|2QW|5QWERT|A|B|$|%|^|&|*

Here, I am only interested in patterns (their numbers are variable between records) containing capital letters, i.e.:
2QW
5QWERT
A
B

Note that patterns with more than one capital letter is preceeded by a digit which equals the length of the pattern.

The output I am trying to obtain is:
Code:
String   # of A   #of B   # of longer patterns
!@#$%2QW5QWERTABCD$%^&*   1   1   1:QW; 1:QWERT

What I tried so far:
Code:
awk 'BEGIN{OFS="   "; print "String   # of A   #of B   # of longer patterns"}
{
   string=$0
   
   # number of 'A'
   num_A=gsub(/A/,"A",string)
   
   #number of 'B'
   num_B=gsub(/B/,"B",string)
   
   # extract long pattern #1
   num_pattern_1==0   
   match(string, /[0-9]+/)
   length_pattern_1=substr(string, RSTART,RLENGTH)
   pattern_1=substr(string, RSTART+1, length_pattern_1)

   # extract long_pattern #2 (stuck here)
   Is there a way to skip the first digit match?  
   If I use 'split(string, b, "[0-9]+")' I could use a for loop through the different indexes of array b, but I will lose the pattern length.

   # count the number of same pattern
   Since I cannot use 'split' I don't see how I could iterate the count through the different motifs 

   print string, num_A, num_B, num_pattern_1":"pattern_1";"num_pattern_2":"pattern_2"; "num_pattern_X":"pattern-X
}'

# 2  
Old 12-26-2015
Code:
$ 
$ cat -n f36.awk
     1	BEGIN {ind=0; str=""}
     2	{
     3	     n = split($0, a, "");
     4	     i = 1;
     5	     while (i <= n) {
     6	         if (a[i] >= 1 && a[i] <= 9) {
     7	             str = a[i];
     8	             for (j=i+1; j<=i+a[i]; j++) {
     9	                 str = str""a[j];
    10	             }
    11	             ind++;
    12	             pattern[ind] = str;
    13	             str = "";
    14	             i = i + a[i] + 1;
    15	         } else if (a[i] >= "A" && a[i] <= "Z") {
    16	             ind++;
    17	             pattern[ind] = a[i];
    18	             i++;
    19	         } else {
    20	             i++;
    21	         }
    22	     }
    23	}
    24	END {
    25	    for (k=1; k<=ind; k++) { printf("pattern[%d] = [%s]\n", k, pattern[k]) }
    26	}
    27	
$ 
$ echo "\!@#$%2QW5QWERTAB$%^&*" | awk -f f36.awk
pattern[1] = [2QW]
pattern[2] = [5QWERT]
pattern[3] = [A]
pattern[4] = [B]
$ 
$ 
$ echo "\!@#$%2QW7QWERTABXY3PQR$%^Z&*LMn#O" | awk -f f36.awk
pattern[1] = [2QW]
pattern[2] = [7QWERTAB]
pattern[3] = [X]
pattern[4] = [Y]
pattern[5] = [3PQR]
pattern[6] = [Z]
pattern[7] = [L]
pattern[8] = [M]
pattern[9] = [O]
$ 
$

# 3  
Old 12-26-2015
Are the numbers specifying the length of a pattern limited to a single digit?
  1. Does the string 12XYCCCCCCCCCCD contain 2 patterns: 12XYCCCCCCCCCC and D (each occurring once)? Or does it contain the pattern 2XY occurring once, the pattern C occurring 10 times, and the pattern D occurring once?
  2. What happens if a digit is followed by fewer uppercase letters than are specified by that digit?
    In the string 3D@C, is there one pattern (3D@C) or two patterns (D and C)?
  3. Your code explicitly counts occurrences of A and B separately from counting patterns that might contain them. Is that what you want?
    Should the string 4AABB just report one occurrence of the pattern 4AABB? Or, should it report the pattern 4AABB occurring once, two occurrences of the pattern A, and two occurrences of the pattern B?
  4. Is the digit 1 special?
    Should the string 1XX be treated as one occurrence of the pattern 1X and one occurrence of the pattern X? Or, should it be treated as two occurrences of the pattern X?

Last edited by Don Cragun; 12-26-2015 at 10:45 PM.. Reason: Add 4th question.
# 4  
Old 12-26-2015
Addendum to Don's question # 2:

What happens if a digit is followed by a mix of uppercase and lowercase letters? Eg. in the string "3AbCD", is the pattern "3ACD" or something else?
# 5  
Old 12-26-2015
Sorry for the lack of clarity.

Quote:
Are the numbers specifying the length of a pattern limited to a single digit?
1. Does the string 12XYCCCCCCCCCCD contain 2 patterns: 12XYCCCCCCCCCC and D (each occurring once)? Or does it contain the pattern 2XY occurring once, the pattern C occurring 10 times, and the pattern D occurring once?
No. The number specifying the length of the pattern is always ≥2.
In the example Don Cragun mentioned ('12XYCCCCCCCCCCD'), there are 2 patterns: '12XYCCCCCCCCCC' and 'D'.
'12' should be consider as the figure '12' and not as single digits '1' then '2'.

Quote:
2. What happens if a digit is followed by fewer uppercase letters than are specified by that digit?
In the string 3D@C , is there one pattern ( 3D@C ) or two patterns ( D and C )?
It cannot happen. A number is always followed by uppercase or lowercase letters only (no symbols, or other characters than uppercase or lowercase letters). The length of the pattern formed by these letters is always ≥ than the number that precedes them.

Quote:
Your code explicitly counts occurrences of A and B separately from counting patterns that might contain them. Is that what you want?
Should the string 4AABB just report one occurrence of the pattern 4AABB ? Or, should it report the pattern 4AABB occurring once, two occurrences of the pattern A , and two occurrences of the pattern B ?
Correct. '4AABB' is one pattern only, 'AABB', as defined by the number '4'.

Quote:
Is the digit 1 special?
Should the string 1XX be treated as one occurrence of the pattern 1X and one occurrence of the pattern X ? Or, should it be treated as two occurrences of the pattern X ?
There is never the figure '1' in the string. The only figures present in the string are always ≥2 (e.g. 2, 34, 2000...).

Quote:
What happens if a digit is followed by a mix of uppercase and lowercase letters? Eg. in the string "3AbCD", is the pattern "3ACD" or something else?
If '3AbCD' occurs, then we have 2 patterns: 'AbC' and 'D'.
A number X is always followed by letters that forms a X-long pattern. The case doesn't matter as soon as the characters are letters.
If a letter occurs directly after the X-long motif, it is considered a pattern itself.

example 1:
Code:
!@#$%4AvDf2QWER

There are 4 patterns ('AvDf', 'QW', 'E' and 'R')

example 2:
Code:
3BHuI4RtYU2vGP

There are 5 patterns ('BHu', 'I', RtYU', 'vG' and 'P')

example 3:
Code:
$%$6ABcdEf)-2yg*%/LK@~~()

There are 4 patterns ('ABcdEf', 'yg', 'L' and 'K')

Last edited by beca123456; 12-26-2015 at 11:35 PM..
# 6  
Old 12-26-2015
Code:
$ 
$ cat -n f36_v1.awk
     1	BEGIN {ind=0; str=""}
     2	{
     3	     n = split($0, a, "");
     4	     i = 1;
     5	     while (i <= n) {
     6	         if (a[i] >= 1 && a[i] <= 9) {
     7	             while (1) {
     8	                 str = str""a[i];
     9	                 if (a[i+1] < 0 || a[i+1] > 9) { break; }
    10	                 else { i++; }
    11	             }
    12	             len = str;
    13	             str = "";
    14	             for (j=i+1; j<=i+len; j++) {
    15	                 str = str""a[j];
    16	             }
    17	             ind++;
    18	             pattern[ind] = str;
    19	             str = "";
    20	             i += len + 1;
    21	         } else if ((a[i] >= "a" && a[i] <= "z") || (a[i] >= "A" && a[i] <= "Z")) {
    22	             ind++;
    23	             pattern[ind] = a[i];
    24	             i++;
    25	         } else {
    26	             i++;
    27	         }
    28	     }
    29	}
    30	END {
    31	    for (k=1; k<=ind; k++) { printf("pattern[%d] = [%s]\n", k, pattern[k]) }
    32	}
    33	
    34	
$ 
$ echo "\!@#$%2QW5QWERTAB$%^&*" | awk -f f36_v1.awk
pattern[1] = [QW]
pattern[2] = [QWERT]
pattern[3] = [A]
pattern[4] = [B]
$ 
$ echo "\!@#$%2QW7QWERTABXY3PQR$%^Z&*LMn#O" | awk -f f36_v1.awk
pattern[1] = [QW]
pattern[2] = [QWERTAB]
pattern[3] = [X]
pattern[4] = [Y]
pattern[5] = [PQR]
pattern[6] = [Z]
pattern[7] = [L]
pattern[8] = [M]
pattern[9] = [n]
pattern[10] = [O]
$ 
$ echo "\$%\$6ABcdEf)-2yg*%/LK@~~()" | awk -f f36_v1.awk
pattern[1] = [ABcdEf]
pattern[2] = [yg]
pattern[3] = [L]
pattern[4] = [K]
$ 
$ echo "3BHuI4RtYU2vGP" | awk -f f36_v1.awk
pattern[1] = [BHu]
pattern[2] = [I]
pattern[3] = [RtYU]
pattern[4] = [vG]
pattern[5] = [P]
$ 
$ echo "\!@#$%4AvDf2QWER" | awk -f f36_v1.awk
pattern[1] = [AvDf]
pattern[2] = [QW]
pattern[3] = [E]
pattern[4] = [R]
$ 
$

# 7  
Old 12-27-2015
Thanks durden_tyler, it helps a lot!

Now I have to work on the format of the output as mentioned in my original post, i.e. counting the number of occurrence of each pattern as follow (multiple-letter pattern in same field separated by "; " and single-letter patterns in one field;

example:
Code:
!@#$%3ABC$%DE$%4Fghi3ABC^&*D$%^&

We have:
Code:
pattern[1] = [ABC]
pattern[2] = [D]
pattern[3] = [E]
pattern[4] = [Fghi]
pattern[5] = [ABC]
pattern[6] = [D]

What I am trying to get is:
Code:
!@#$%3ABC$%DE$%4Fghi3ABC^&*D$%^&|2:D|1:E|2:ABC; 1:Fghi

The order of the multiple-letters pattern within the field doesn't matter.
The order of the single-letter patterns doesn't matter too.
But it would be useful to have the single-letter pattern before the multiple-letter pattern like above.

For clarity I used "|" as FS, but I could change it as " " like in my original post.

Last edited by beca123456; 12-27-2015 at 01:41 AM..
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

find pattern matches in consecutive lines in certain fields-awk

I have a text file with many thousands of lines, a small sample of which looks like this: InputFile:PS002,003 D -1 5 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 6 6 -1 -1 -1 -1 0 509 0 PS002,003 PSQ 0 1 7 18 1 0 -1 1 1 3 -1 -1 ... (5 Replies)
Discussion started by: jvoot
5 Replies

2. Shell Programming and Scripting

awk Index to get position matches pattern

Input data as below (filetest.txt): 1|22 JAN Minimum Bal 20.00 | SAT 2|09 FEB Extract bal 168.00BR | REM 3|MIN BAL | LEX Output should be: ( If there is Date & Month in 2nd field of Input file, It should be seperated else blank. If There is Decimal OR Decimal & Currency in last of the 2nd... (7 Replies)
Discussion started by: JSKOBS
7 Replies

3. Shell Programming and Scripting

Egrep patterns in a file and limit number of matches to print for each pattern match

Hi I need to egrep patterns in a file and limit number of matches to print for each matched pattern. -m10 option is not working out in my sun solaris 5.10 Please guide me the options to achieve. if i do head -10 , i wont be getting all pattern match results as output since for a... (10 Replies)
Discussion started by: ananan
10 Replies

4. Shell Programming and Scripting

awk to delete content before and after a matched pattern

Hello, I have been trying to write a script where I could get awk to delete data before and after a matched pattern. For eg Raw data Start NAME = John Age = 35 Occupation = Programmer City = New York Certification Completed = No Salary = 80000 End Start NAME = Mary Age = 25... (2 Replies)
Discussion started by: sidnow
2 Replies

5. Shell Programming and Scripting

Count number of pattern matches per line for all files in directory

I have a directory of files, each with a variable (though small) number of lines. I would like to go through each line in each file, and print the: -file name -line number -number of matches to the pattern /comp/ for each line. Two example files: cat... (4 Replies)
Discussion started by: pathunkathunk
4 Replies

6. Shell Programming and Scripting

awk with range but matches pattern

To match range, the command is: awk '/BEGIN/,/END/' but what I want is the range is printed only if there is additional pattern that matches in the range itself? maybe like this: awk '/BEGIN/,/END/ if only in that range there is /pattern/' Thanks (8 Replies)
Discussion started by: zorrox
8 Replies

7. Shell Programming and Scripting

print the whole row in awk based on matched pattern

Hi, I need some help on how to print the whole data for unmatched pattern. i have 2 different files that need to be checked and print out the unmatched patterns into a new file. My sample data as follows:- File1.txt Id Num Activity Class Type 309 1.1 ... (5 Replies)
Discussion started by: redse171
5 Replies

8. Shell Programming and Scripting

grep - match files containing minimum number of pattern matches

I want to search a bunch of files and list only those containing a minimum number of pattern matches. So if I want to identify files containing 3 (or more) instances of the pattern "said:" and I have file1 that contains the lines: He said: She said: and file2 that contains the lines: He... (3 Replies)
Discussion started by: stumpyuk
3 Replies

9. Shell Programming and Scripting

awk to sum specific field when pattern matches

Trying to sum field #6 when field #2 matches string as follows: Input data: 2010-09-18-20.24.44.206117 UOWEXEC db2bp DB2XYZ hostname 1 2010-09-18-20.24.44.206117 UOWWAIT db2bp DB2XYZ hostname ... (3 Replies)
Discussion started by: ux4me
3 Replies

10. Shell Programming and Scripting

awk to count pattern matches

i have an awk statement which i am using to count the number of occurences of the number ,5, in the file: awk '/,5,/ {count++}' TRY.txt | awk 'END { printf(" Total parts: %d",count)}' i know there is a total of 10 matches..what is wrong here? thanks (16 Replies)
Discussion started by: npatwardhan
16 Replies
Login or Register to Ask a Question