Number of matches and matched pattern(s) in awk


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Number of matches and matched pattern(s) in awk
# 8  
Old 12-27-2015
Code:
cat beca123456.input
!@#$%2QW5QWERTAB$%^&*
!@#$%4AvDf2QWER
3BHuI4RtYU2vGP
$%$6ABcdEf)-2yg*%/LK@~~()

Code:
awk '
{
    for(i=1;i<=length($0);i++){
        ch = substr($0, i, 1)
        if(ch ~ /[0-9]/){
            pat = substr($0, i+1, ch)
            multi[pat]++;
            i += ch
        }
        else if(ch ~ /[a-zA-Z]/){
            single[ch]++
        }
        
    }
}

{
    printf "%s", $0
    for (s in single){
       printf "|%d:%s", single[s], s 
       delete single[s]
    }

    for (s in multi){
      m == "" ? m="|"multi[s]":"s : m=m"; "multi[s]":"s
      delete multi[s]
    }
    print m
    m = ""

}' beca123456.input

Code:
!@#$%2QW5QWERTAB$%^&*|1:A|1:B|1:QWERT; 1:QW
!@#$%4AvDf2QWER|1:R|1:E|1:AvDf; 1:QW
3BHuI4RtYU2vGP|1:P|1:I|1:RtYU; 1:vG; 1:BHu
$%$6ABcdEf)-2yg*%/LK@~~()|1:K|1:L|1:yg; 1:ABcdEf

I am confused about your affair with the "|", at this point I do not know if you want it or not in the actual output. However, this match your example:
Quote:
What I am trying to get is:

Code:
!@#$%3ABC$%DE$%4Fghi3ABC^&*D$%^&|2:D|1:E|2:ABC; 1:Fghi

# 9  
Old 12-27-2015
Here is an alternative approach that will work with any standards-conforming version of awk. (Note that the standards say the behavior is unspecified if FS (or the ERE used in split()) is an empty string.
Code:
awk '
BEGIN {	printf("String   #_of_occurrences_of__pattern:pattern...\n")
}
{	printf("%s", left = $0)
	while(match(left, /[[:alnum:]]+/)) {
		# Throw away leading non-digit, non-alpha characters.
		if(RSTART > 1)
			left = substr(left, RSTART)
		if((num = left + 0) > 0) {
			# We have a string starting with a leading digit string.
			p = substr(left, len = length(num) + 1, num)
			left = substr(left, len + num)
			if(p in mcnt) {
				# We have seen this pattern before.
				mcnt[p]++
			} else {# We have not seen this pattern before.
				mcnt[mplist[++nmp] = p] = 1
			}
		} else {
			# We have a single alphabetic character string.
			p = substr(left, 1, 1)
			left = substr(left, 2)
			if(p in scnt) {
				# We have seen this pattern before.
				scnt[p]++
			} else {# We have not seen this pattern before.
				scnt[splist[++nsp] = p] = 1
			}
		}
	}
	# Print the results for this input line.
	# Print single character patterns.
	for(i = 1; i <= nsp; i++) {
		printf("   %d:%s", scnt[splist[i]], splist[i])
		delete scnt[splist[i]]
		delete splist[i]
	}
	# Print multiple character patterns.
	for(i = 1; i <= nmp; i++) {
		printf("   %d:%s", mcnt[mplist[i]], mplist[i])
		delete mcnt[mplist[i]]
		delete mplist[i]
	}
	print ""
	nmp = nsp = 0
}' file

If file contains:
Code:
!@#$%2QW5QWERTAB$%^&*
!|@|#|$|%|2QW|5QWERT|A|B|$|%|^|&|*
!@#$%2QW5QWERTAB$%^&*2QW5QWERTABAB
12ABCDEFGHIJKLMNABC
12ABCDEFGHIJKLMNABC#12ABCDEFGHIJKLMNDEF
~!@#$%^&*()_+
Aa@52ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz#aA
AAAAAAAAAAAAAAAAAAAAAAAaaaaaabbbbbbAAAAAAAAAAAAAAAAAA
!@#$%3ABC$%DE$%4Fghi3ABC^&*D$%^&

it produces the output:
Code:
String   #_of_occurrences_of__pattern:pattern...
!@#$%2QW5QWERTAB$%^&*   1:A   1:B   1:QW   1:QWERT
!|@|#|$|%|2QW|5QWERT|A|B|$|%|^|&|*   1:A   1:B   1:QW   1:QWERT
!@#$%2QW5QWERTAB$%^&*2QW5QWERTABAB   3:A   3:B   2:QW   2:QWERT
12ABCDEFGHIJKLMNABC   1:M   1:N   1:A   1:B   1:C   1:ABCDEFGHIJKL
12ABCDEFGHIJKLMNABC#12ABCDEFGHIJKLMNDEF   2:M   2:N   1:A   1:B   1:C   1:D   1:E   1:F   2:ABCDEFGHIJKL
~!@#$%^&*()_+
Aa@52ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz#aA   2:A   2:a   1:ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
AAAAAAAAAAAAAAAAAAAAAAAaaaaaabbbbbbAAAAAAAAAAAAAAAAAA   41:A   6:a   6:b
!@#$%3ABC$%DE$%4Fghi3ABC^&*D$%^&   2:D   1:E   2:ABC   1:Fghi

If someone wants to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk. (I don't think nawk knows how to handle the character class expression [[:alnum:]].)

PS: Note that this script uses a consistent field separator of three <space> characters instead of a mixture of pipe symbols and semicolons.

Last edited by Don Cragun; 12-27-2015 at 03:18 AM.. Reason: Add PS. & fix auto spell correct induced typo: s/album/alnum/
This User Gave Thanks to Don Cragun For This Post:
# 10  
Old 12-27-2015
Hi Don, it works fine with /usr/xpg4/bin/awk on Solaris 10. Just tested it. Indeed nawk cannot handle POSIX character classes..

--
EDIT: misread Don's post as a request instead of a suggestion..

Last edited by Scrutinizer; 12-27-2015 at 03:07 AM..
This User Gave Thanks to Scrutinizer For This Post:
# 11  
Old 12-27-2015
Different approach:
Code:
awk '
        {printf "%s", $0
         gsub (/[^A-Za-z0-9]/, "")
         n = split ($0, DIG, "[A-Za-z]*")
         m = split ($0, CHR, "[0-9]*")
         S = CHR[1]
         B = 1 + !(DIG[1])
         for (i=B; i<n; i++)    {IX = i - B + 2
                                 TMP = substr (CHR[IX], 1, DIG[i])
                                 PAT[TMP]++
                                 sub ("^" TMP, _, CHR[IX])
                                 S = S CHR[IX]
                                }
         for (i=split (S, T, ""); i>0; i--) SGL[T[i]]++
         for (p in PAT) printf "\t%d:%s", PAT[p], p
         for (s in SGL) printf "\t%d:%s", SGL[s], s
         printf "\n"
         S = ""
         delete PAT
         delete SGL
        }
' file

This User Gave Thanks to RudiC For This Post:
# 12  
Old 12-27-2015
Thanks Don Cragun for the script and the explanations ! It works perfectly.

Thanks RudiC ! However, could you tell me what the following line means. I have never sen this expression before:
Code:
B = 1 + !(DIG[1])

Does it mean 'B = index of CHR starting after same DIG index'?

Thanks Aia ! The code is easier to understand but it only works for number with single digit.
example:
Code:
!@#$%12QWtttttttttt5QWERTAB$%^&*

returns:
Code:
!@#$%12QWtttttttttt5QWERTAB$%^&*   1:A   1:B   10:t   1:Q   1:W   1:2; 1:QWERT

instead of:
Code:
!@#$%12QWtttttttttt5QWERTAB$%^&*   1:A   1:B   1:QWtttttttttt   1:QWERT

---------- Post updated at 03:19 PM ---------- Previous update was at 02:37 PM ----------

Last edited by beca123456; 12-27-2015 at 10:54 PM.. Reason: My last request should have been posted independently of this thread. It has been removed (and solved anyway)
# 13  
Old 12-27-2015
Quote:
Originally Posted by beca123456
Thanks Aia ! The code is easier to understand but it only works for number with single digit.
example:
Code:
!@#$%12QWtttttttttt5QWERTAB$%^&*

returns:
Code:
!@#$%12QWtttttttttt5QWERTAB$%^&*   1:A   1:B   10:t   1:Q   1:W   1:2; 1:QWERT

instead of:
Code:
!@#$%12QWtttttttttt5QWERTAB$%^&*   1:A   1:B   1:QWtttttttttt   1:QWERT

Sorry, I missed that in your request.

Code:
awk '
{
    for(i=1;i<=length($0);i++){
        ch = substr($0, i, 1)
        if(ch ~ /[0-9]/){
            d = ""
            while(ch ~ /[0-9]/){
               d = d ch
               ch = substr($0, ++i, 1)
            }
            pat = substr($0, i, d)
            multi[pat]++
            i += (d-1)
       }
        else if(ch ~ /[a-zA-Z]/){
            single[ch]++
        }
    }
}
{
    printf "%s", $0
    for (s in single){
       printf " %d:%s", single[s], s
       delete single[s]
    }

    for (s in multi){
      m == "" ? m=" "multi[s]":"s : m=m"; "multi[s]":"s
      delete multi[s]
    }
    print m
    m = ""
}' beca123456.file


Last edited by Aia; 12-27-2015 at 05:15 PM.. Reason: Highlighed the changes
This User Gave Thanks to Aia For This Post:
# 14  
Old 12-27-2015
Thanks Aia. The new version works great !

Thanks Don Cragun and RudiC as well!

Last edited by beca123456; 12-27-2015 at 10:53 PM..
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

find pattern matches in consecutive lines in certain fields-awk

I have a text file with many thousands of lines, a small sample of which looks like this: InputFile:PS002,003 D -1 5 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 6 6 -1 -1 -1 -1 0 509 0 PS002,003 PSQ 0 1 7 18 1 0 -1 1 1 3 -1 -1 ... (5 Replies)
Discussion started by: jvoot
5 Replies

2. Shell Programming and Scripting

awk Index to get position matches pattern

Input data as below (filetest.txt): 1|22 JAN Minimum Bal 20.00 | SAT 2|09 FEB Extract bal 168.00BR | REM 3|MIN BAL | LEX Output should be: ( If there is Date & Month in 2nd field of Input file, It should be seperated else blank. If There is Decimal OR Decimal & Currency in last of the 2nd... (7 Replies)
Discussion started by: JSKOBS
7 Replies

3. Shell Programming and Scripting

Egrep patterns in a file and limit number of matches to print for each pattern match

Hi I need to egrep patterns in a file and limit number of matches to print for each matched pattern. -m10 option is not working out in my sun solaris 5.10 Please guide me the options to achieve. if i do head -10 , i wont be getting all pattern match results as output since for a... (10 Replies)
Discussion started by: ananan
10 Replies

4. Shell Programming and Scripting

awk to delete content before and after a matched pattern

Hello, I have been trying to write a script where I could get awk to delete data before and after a matched pattern. For eg Raw data Start NAME = John Age = 35 Occupation = Programmer City = New York Certification Completed = No Salary = 80000 End Start NAME = Mary Age = 25... (2 Replies)
Discussion started by: sidnow
2 Replies

5. Shell Programming and Scripting

Count number of pattern matches per line for all files in directory

I have a directory of files, each with a variable (though small) number of lines. I would like to go through each line in each file, and print the: -file name -line number -number of matches to the pattern /comp/ for each line. Two example files: cat... (4 Replies)
Discussion started by: pathunkathunk
4 Replies

6. Shell Programming and Scripting

awk with range but matches pattern

To match range, the command is: awk '/BEGIN/,/END/' but what I want is the range is printed only if there is additional pattern that matches in the range itself? maybe like this: awk '/BEGIN/,/END/ if only in that range there is /pattern/' Thanks (8 Replies)
Discussion started by: zorrox
8 Replies

7. Shell Programming and Scripting

print the whole row in awk based on matched pattern

Hi, I need some help on how to print the whole data for unmatched pattern. i have 2 different files that need to be checked and print out the unmatched patterns into a new file. My sample data as follows:- File1.txt Id Num Activity Class Type 309 1.1 ... (5 Replies)
Discussion started by: redse171
5 Replies

8. Shell Programming and Scripting

grep - match files containing minimum number of pattern matches

I want to search a bunch of files and list only those containing a minimum number of pattern matches. So if I want to identify files containing 3 (or more) instances of the pattern "said:" and I have file1 that contains the lines: He said: She said: and file2 that contains the lines: He... (3 Replies)
Discussion started by: stumpyuk
3 Replies

9. Shell Programming and Scripting

awk to sum specific field when pattern matches

Trying to sum field #6 when field #2 matches string as follows: Input data: 2010-09-18-20.24.44.206117 UOWEXEC db2bp DB2XYZ hostname 1 2010-09-18-20.24.44.206117 UOWWAIT db2bp DB2XYZ hostname ... (3 Replies)
Discussion started by: ux4me
3 Replies

10. Shell Programming and Scripting

awk to count pattern matches

i have an awk statement which i am using to count the number of occurences of the number ,5, in the file: awk '/,5,/ {count++}' TRY.txt | awk 'END { printf(" Total parts: %d",count)}' i know there is a total of 10 matches..what is wrong here? thanks (16 Replies)
Discussion started by: npatwardhan
16 Replies
Login or Register to Ask a Question