Number of matches and matched pattern(s) in awk

Number of matches and matched pattern(s) in awk


The string above is not separated (or FS="").
For clarity sake one could re-write the string by including a "|" as FS as follow:

Here, I am only interested in patterns (their numbers are variable between records) containing capital letters, i.e.:

Note that patterns with more than one capital letter is preceeded by a digit which equals the length of the pattern.

The output I am trying to obtain is:
String   # of A   #of B   # of longer patterns
!@#$%2QW5QWERTABCD$%^&*   1   1   1:QW; 1:QWERT

What I tried so far:
awk 'BEGIN{OFS="   "; print "String   # of A   #of B   # of longer patterns"}
   # number of 'A'
   #number of 'B'
   # extract long pattern #1
   match(string, /[0-9]+/)
   length_pattern_1=substr(string, RSTART,RLENGTH)
   pattern_1=substr(string, RSTART+1, length_pattern_1)

   # extract long_pattern #2 (stuck here)
   Is there a way to skip the first digit match?  
   If I use 'split(string, b, "[0-9]+")' I could use a for loop through the different indexes of array b, but I will lose the pattern length.

   # count the number of same pattern
   Since I cannot use 'split' I don't see how I could iterate the count through the different motifs 

   print string, num_A, num_B, num_pattern_1":"pattern_1";"num_pattern_2":"pattern_2"; "num_pattern_X":"pattern-X

$ cat -n f36.awk
     1	BEGIN {ind=0; str=""}
     2	{
     3	     n = split($0, a, "");
     4	     i = 1;
     5	     while (i <= n) {
     6	         if (a[i] >= 1 && a[i] <= 9) {
     7	             str = a[i];
     8	             for (j=i+1; j<=i+a[i]; j++) {
     9	                 str = str""a[j];
    10	             }
    11	             ind++;
    12	             pattern[ind] = str;
    13	             str = "";
    14	             i = i + a[i] + 1;
    15	         } else if (a[i] >= "A" && a[i] <= "Z") {
    16	             ind++;
    17	             pattern[ind] = a[i];
    18	             i++;
    19	         } else {
    20	             i++;
    21	         }
    22	     }
    23	}
    24	END {
    25	    for (k=1; k<=ind; k++) { printf("pattern[%d] = [%s]\n", k, pattern[k]) }
    26	}
$ echo "\!@#$%2QW5QWERTAB$%^&*" | awk -f f36.awk
pattern[1] = [2QW]
pattern[2] = [5QWERT]
pattern[3] = [A]
pattern[4] = [B]
$ echo "\!@#$%2QW7QWERTABXY3PQR$%^Z&*LMn#O" | awk -f f36.awk
pattern[1] = [2QW]
pattern[2] = [7QWERTAB]
pattern[3] = [X]
pattern[4] = [Y]
pattern[5] = [3PQR]
pattern[6] = [Z]
pattern[7] = [L]
pattern[8] = [M]
pattern[9] = [O]

Are the numbers specifying the length of a pattern limited to a single digit?
  1. Does the string 12XYCCCCCCCCCCD contain 2 patterns: 12XYCCCCCCCCCC and D (each occurring once)? Or does it contain the pattern 2XY occurring once, the pattern C occurring 10 times, and the pattern D occurring once?
  2. What happens if a digit is followed by fewer uppercase letters than are specified by that digit?
    In the string 3D@C, is there one pattern (3D@C) or two patterns (D and C)?
  3. Your code explicitly counts occurrences of A and B separately from counting patterns that might contain them. Is that what you want?
    Should the string 4AABB just report one occurrence of the pattern 4AABB? Or, should it report the pattern 4AABB occurring once, two occurrences of the pattern A, and two occurrences of the pattern B?
  4. Is the digit 1 special?
    Should the string 1XX be treated as one occurrence of the pattern 1X and one occurrence of the pattern X? Or, should it be treated as two occurrences of the pattern X?

Addendum to Don's question # 2:

What happens if a digit is followed by a mix of uppercase and lowercase letters? Eg. in the string "3AbCD", is the pattern "3ACD" or something else?
Sorry for the lack of clarity.

Are the numbers specifying the length of a pattern limited to a single digit?
1. Does the string 12XYCCCCCCCCCCD contain 2 patterns: 12XYCCCCCCCCCC and D (each occurring once)? Or does it contain the pattern 2XY occurring once, the pattern C occurring 10 times, and the pattern D occurring once?
No. The number specifying the length of the pattern is always ≥2.
In the example Don Cragun mentioned ('12XYCCCCCCCCCCD'), there are 2 patterns: '12XYCCCCCCCCCC' and 'D'.
'12' should be consider as the figure '12' and not as single digits '1' then '2'.

2. What happens if a digit is followed by fewer uppercase letters than are specified by that digit?
In the string 3D@C , is there one pattern ( 3D@C ) or two patterns ( D and C )?
It cannot happen. A number is always followed by uppercase or lowercase letters only (no symbols, or other characters than uppercase or lowercase letters). The length of the pattern formed by these letters is always ≥ than the number that precedes them.

Your code explicitly counts occurrences of A and B separately from counting patterns that might contain them. Is that what you want?
Should the string 4AABB just report one occurrence of the pattern 4AABB ? Or, should it report the pattern 4AABB occurring once, two occurrences of the pattern A , and two occurrences of the pattern B ?
Correct. '4AABB' is one pattern only, 'AABB', as defined by the number '4'.

Is the digit 1 special?
Should the string 1XX be treated as one occurrence of the pattern 1X and one occurrence of the pattern X ? Or, should it be treated as two occurrences of the pattern X ?
There is never the figure '1' in the string. The only figures present in the string are always ≥2 (e.g. 2, 34, 2000...).

What happens if a digit is followed by a mix of uppercase and lowercase letters? Eg. in the string "3AbCD", is the pattern "3ACD" or something else?
If '3AbCD' occurs, then we have 2 patterns: 'AbC' and 'D'.
A number X is always followed by letters that forms a X-long pattern. The case doesn't matter as soon as the characters are letters.
If a letter occurs directly after the X-long motif, it is considered a pattern itself.

example 1:

There are 4 patterns ('AvDf', 'QW', 'E' and 'R')

example 2:

There are 5 patterns ('BHu', 'I', RtYU', 'vG' and 'P')

example 3:

There are 4 patterns ('ABcdEf', 'yg', 'L' and 'K')

$ cat -n f36_v1.awk
     1	BEGIN {ind=0; str=""}
     2	{
     3	     n = split($0, a, "");
     4	     i = 1;
     5	     while (i <= n) {
     6	         if (a[i] >= 1 && a[i] <= 9) {
     7	             while (1) {
     8	                 str = str""a[i];
     9	                 if (a[i+1] < 0 || a[i+1] > 9) { break; }
    10	                 else { i++; }
    11	             }
    12	             len = str;
    13	             str = "";
    14	             for (j=i+1; j<=i+len; j++) {
    15	                 str = str""a[j];
    16	             }
    17	             ind++;
    18	             pattern[ind] = str;
    19	             str = "";
    20	             i += len + 1;
    21	         } else if ((a[i] >= "a" && a[i] <= "z") || (a[i] >= "A" && a[i] <= "Z")) {
    22	             ind++;
    23	             pattern[ind] = a[i];
    24	             i++;
    25	         } else {
    26	             i++;
    27	         }
    28	     }
    29	}
    30	END {
    31	    for (k=1; k<=ind; k++) { printf("pattern[%d] = [%s]\n", k, pattern[k]) }
    32	}
$ echo "\!@#$%2QW5QWERTAB$%^&*" | awk -f f36_v1.awk
pattern[1] = [QW]
pattern[2] = [QWERT]
pattern[3] = [A]
pattern[4] = [B]
$ echo "\!@#$%2QW7QWERTABXY3PQR$%^Z&*LMn#O" | awk -f f36_v1.awk
pattern[1] = [QW]
pattern[2] = [QWERTAB]
pattern[3] = [X]
pattern[4] = [Y]
pattern[5] = [PQR]
pattern[6] = [Z]
pattern[7] = [L]
pattern[8] = [M]
pattern[9] = [n]
pattern[10] = [O]
$ echo "\$%\$6ABcdEf)-2yg*%/LK@~~()" | awk -f f36_v1.awk
pattern[1] = [ABcdEf]
pattern[2] = [yg]
pattern[3] = [L]
pattern[4] = [K]
$ echo "3BHuI4RtYU2vGP" | awk -f f36_v1.awk
pattern[1] = [BHu]
pattern[2] = [I]
pattern[3] = [RtYU]
pattern[4] = [vG]
pattern[5] = [P]
$ echo "\!@#$%4AvDf2QWER" | awk -f f36_v1.awk
pattern[1] = [AvDf]
pattern[2] = [QW]
pattern[3] = [E]
pattern[4] = [R]

Thanks durden_tyler, it helps a lot!

Now I have to work on the format of the output as mentioned in my original post, i.e. counting the number of occurrence of each pattern as follow (multiple-letter pattern in same field separated by "; " and single-letter patterns in one field;


We have:
pattern[1] = [ABC]
pattern[2] = [D]
pattern[3] = [E]
pattern[4] = [Fghi]
pattern[5] = [ABC]
pattern[6] = [D]

What I am trying to get is:
!@#$%3ABC$%DE$%4Fghi3ABC^&*D$%^&|2:D|1:E|2:ABC; 1:Fghi

The order of the multiple-letters pattern within the field doesn't matter.
The order of the single-letter patterns doesn't matter too.
But it would be useful to have the single-letter pattern before the multiple-letter pattern like above.

For clarity I used "|" as FS, but I could change it as " " like in my original post.

