Egrep patterns in a file and limit number of matches to print for each pattern match


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Egrep patterns in a file and limit number of matches to print for each pattern match
# 8  
Old 06-10-2017
Quote:
Originally Posted by RudiC
How about
Code:
awk 'match ($0, PAT) && ++T[substr($0, RSTART, RLENGTH)]<4' PAT="abc|bcd|cde" file2
abc1
bcd1
abc2
abc3
bcd2
bcd3
cde1
cde2
cde3

This is very limited and it works in this contrived case because the match is exactly the literal alternation of the regex. However, the purpose of regex is to represent patterns and they are not the exact literal. In order to give it a change each regex must be tracked on its own and each line must be check times the amount of regex.

---------- Post updated at 02:23 PM ---------- Previous update was at 12:14 PM ----------

In case that a visual aid is necessary.
example.file
Code:
abc1
Abc5
bcd1
abc2
abc3
abc4
Abc6
bcd2
bcd3
cde1
cde2
bcd4
cde3

You want three instance for each match. With the previously presented suggestion:
Code:
awk 'match ($0, PAT) && ++T[substr($0, RSTART, RLENGTH)]<4' PAT="[Aa]bc|bcd|cde" example.file | sort

Code:
abc1
abc2
abc3
Abc5
Abc6
bcd1
bcd2
bcd3
cde1
cde2
cde3

Instead, this could be a more realistic three matches of a pattern.
Code:
awk '/[Aa]bc/ && a++ < 3; /bcd/ && b++ < 3; /cde/ && c++ <3' example.file | sort

Code:
abc1
abc2
Abc5
bcd1
bcd2
bcd3
cde1
cde2
cde3

This User Gave Thanks to Aia For This Post:
# 9  
Old 06-10-2017
If you want to print lines from a file that match any of the EREs stored one per line in a file named ERES, but stop using an ERE after it has matched 3 (or whatever value you assign to the maxp variable) lines, you could try:
Code:
/usr/xpg4/bin/awk -v maxp=3 '
FNR == NR {
	eres[$0]
	neres++
	next
}
{	for(ere in eres)
		if($0 ~ ere) {
			#printf("L%d matched %s: %s\n", FNR, ere, $0)
			p = 1
			if(++eres[ere] == maxp) {
				neres--
				delete eres[ere]
				#printf("%s maxed out: left: %d\n", ere, neres)
			}
		}
}
p {	print
	p = 0
}
!neres {exit
}' ERES file.txt

If the file ERES contains:
Code:
abc
bcd
cde

and file.txt contains the sample data you provided in post #4, the above code produces the output:
Code:
abc1
bcd1
abc2
abc3
bcd2
bcd3
cde1
cde2
cde3

Note that if an input line is matched by more than one of your EREs, that line will only be printed once but each matching ERE's match count will be incremented. And, the output will be in the order of the input text file; i.e., output will not be grouped by matching ERE as in your sample output in post #4. Note also that this code will stop reading file.txt as soon as all given EREs have been matched maxp times; with large input files the could save a lot of I/O.

Note that you can still use alternation in your EREs, but the limit on the number of times an ERE will be matched still applies. If the ERES file just contains your original ERE:
Code:
abc|bcd|cde

the output produced would just be:
Code:
abc1
bcd1
abc2

This User Gave Thanks to Don Cragun For This Post:
# 10  
Old 06-12-2017
Hi

It works fine, if you find time, Please explain. What about case insensitivity for the pattern. How to incorporate in this code.
# 11  
Old 06-13-2017
Here is a version of that awk script wrapped in a Korn shell script that adds command line options to specify the number of matches for each ERE, a simplified case insensitive ERE option, an option to name an alternative pathname to the file containing your EREs, and an option to print some debugging information while the awk script is running:
Code:
#!/bin/ksh
IAm=${0##*/}
# Set default parameter values.
ci=0		# Case insensitive ERE matching: 0=no, 1=yes.
debug=0		# Print debugging information while running awk script.
EREfile=ERES	# Pathname of file containing EREs to be processed.
maxp=3		# Maximum # of matches to be printed for each ERE.

# Function to print usage message and exit.
Usage() {
	printf 'Usage: %s: [−di] [−e file] [-m count]
	-d	Enable debugging statements in awk script.
	-e file	Pathname of file containing EREs to process (default: "ERES").
	-i	Perform case insensitive ERE matching.
	-m max	Maximum number of times to print matches for each ERE in the
		ERE file (default: 3).  Setting max to 0, allows infinite
		matches.\n' "$IAm" >&2
	exit 2
}

# Parse command line options to override defaults.
while getopts de:im: opt
do	case "$opt" in
	(d)	debug=1;;
	(e)	EREfile="$OPTARG";;
	(i)	ci=1;;
	(m)	maxp="$OPTARG";;
	(?)	Usage;;
	esac
done
shift $(($OPTIND - 1))
if [ $# -gt 0 ]
then	Usage
fi

/usr/xpg4/bin/awk -v maxp="$maxp" -v caseinsensitive=$ci -v debug=$debug '
BEGIN {	# Before reading any lines from either input file, define the following
	# variables for later use in this script.
	fmt[0] = "[%s%s]"	# Use this format to add upper and lowercase
				# characters to the case insensitive ERE when
				# we are not in a bracket expression.
	fmt[1] = "%s%s"		# Use this format to add upper and lowercase
				# characters to the case insensitive ERE when
				# we are in a bracket expression.
}
FNR == NR {
	# This clause is performed only for lines read from the 1st input file
	# (where the number of lines read from this file (FNR) is equal to the
	# number of lines read from all input files (NR).  The 1st input file
	# contains the EREs to be processed on this invocation.

	# Are we performing case insensitive searches?
	if(caseinsensitive) {
		# Yes.  Convert any alphabetic characters found outside a
		# bracket expression or inside the 1st level of brackets in a
		# bracket expression to include both the uppercase and the
		# lowercase versions of that character.  This makes several
		# assumptions that might or might not be true in your
		# environment:
		# 1  There are no square brackets in your ERE except the
		#    opening and closing "[" and "]" in a bracket expression or
		#    in a collating symbol, equivalence class, or character
		#    class expression; except for backslash escaped square
		#    brackets outside of a bracket expression.
		# 2  Alphabetic characters inside in a collating symbol,
		#    equivalence class, and character class expression should
		#    not be modified.
		# 3  An equivalence class expression (e.g. [[:lower:]]) should
		#    not be modified to also match characters in another case.
		# 4  Any character following a backslash outside of a bracket
		#    expression should not be modified.
		# 5  There are no range expressions in a bracket expression
		#    where either endpoint is an alphabetic character.
		for(i = 1; i <= length($0); i++) {
			if((c = substr($0, i, 1)) == "\\" && bc == 0)
				# We have a backslash outside of a bracket
				# expression.  Pass this character and the next
				# through unchanged.
				c = c substr($0, ++i, 1) 
			else if(c == "[")
				# We have an opening square bracket.  Increment
				# the bracketing count.
				bc++
			else if(c == "]")
				# We have a closing square bracket.  Decrement
				# the bracketing count.
				bc--
			else if(c ~ /[[:alpha:]]/ && bc < 2)
				# We have an alphabetic character that is not
				# in a collating symbol, equivalence class, or
				# character class expression.  If we are in a
				# bracket expression replace it by its uppercase
				# and lowercase versions.  If we are not in a
				# bracket expression replace it with a bracket
				# expression containing its uppercase and
				# lowercase versions.
				c = sprintf(fmt[bc==1], toupper(c), tolower(c))
			# Add the character(s) converted to the output ERE.
			ere = ere c
		}
		if(debug)
			printf("D:ERE \"%s\" replaced with \"%s\".\n", $0, ere)
		# Set the input string to the modified versions and clear the
		# modified version in preparation for the next input line.
		$0 = ere
		ere = ""
	}

	# Add the original or modified ERE to the array of active EREs and
	# increment the number of active EREs.
	eres[$0]
	neres++
	if(debug)
		printf("D:%d active EREs: Added \"%s\".\n", neres, $0)

	# Skip the remaing steps in this script and read the next line from an
	# input file.
	next
}
{	# This clause processes lines read from the 2nd input file.

	# For each remaining active ERE in the array...
	for(ere in eres)
		# see if the current input line matches this ERE...
		if($0 ~ ere) {
			# it does.
			if(debug)
				printf("D:Line %d matched %s: %s\n",
				    FNR, ere, $0)
			# Set the flag to print this line.
			p = 1
			# Check to see if this ERE has matched maxp lines...
			if(++eres[ere] == maxp) {
				# it has.  Remove this ERE from the active
				# array and decrement the number of active EREs.
				delete eres[ere]
				neres--
				if(debug)
					printf("D:%s max hit: EREs left: %d\n",
					    ere, neres)
			}
		}
}
p {	# Print lines that were matched by at least one active ERE and clear the
	# print flag for the next line.
	print
	p = 0
}
!neres {# If the number of remaining active EREs is zero, we are done; exit
	# instead of continuing to read the remaining lines from the 2nd file.
	exit
}' "$EREfile" file.txt

Although written and tested using a Korn shell, it should work with any POSIX conforming shell. The limitations on the case insensitive ERE processing are detailed in the comments in the awk script.

Does this help?
This User Gave Thanks to Don Cragun For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Match text to lines in a file, iterate backwards until text or text substring matches, print to file

hi all, trying this using shell/bash with sed/awk/grep I have two files, one containing one column, the other containing multiple columns (comma delimited). file1.txt abc12345 def12345 ghi54321 ... file2.txt abc1,text1,texta abc,text2,textb def123,text3,textc gh,text4,textd... (6 Replies)
Discussion started by: shogun1970
6 Replies

2. Shell Programming and Scripting

awk to print match or non-match and select fields/patterns for non-matches

In the awk below I am trying to output those lines that Match between file1 and file2, those Missing in file1, and those missing in file2. Using each $1,$2,$4,$5 value as a key to match on, that is if those 4 fields are found in both files the match, but if those 4 fields are not found then missing... (0 Replies)
Discussion started by: cmccabe
0 Replies

3. Shell Programming and Scripting

How to print line if two lines above it matches patterns.?

Hi, I could only find examples to print line before/after a match, but I'd need to print line after two separate lines matching. E.g.: From the below log entry, I would need to print out the 1234. This is from a huge log file, that has a lot of entries with "CLIENT" and "No" entries (+ other... (3 Replies)
Discussion started by: Juha
3 Replies

4. Shell Programming and Scripting

Match pattern and print the line number of occurence using awk

Hi, I have a simple problem but i guess stupid enough to figure it out. i have thousands rows of data. and i need to find match patterns of two columns and print the number of rows. for example: inputfile abd abp 123 abc abc 325 ndc ndc 451 mjk lkj... (3 Replies)
Discussion started by: redse171
3 Replies

5. Shell Programming and Scripting

Match 2 different patterns and print the lines

Hi, i have been trying to extract multiple lines based on two different patterns as below:- file1 @jkm|kdo|aas012|192.2.3.1 blablbalablablkabblablabla sjfdsakfjladfjefhaghfagfkafagkjsghfalhfk fhajkhfadjkhfalhflaffajkgfajkghfajkhgfkf jahfjkhflkhalfdhfwearhahfl @jkm|sdf|wud08q|168.2.1.3... (8 Replies)
Discussion started by: redse171
8 Replies

6. Shell Programming and Scripting

grep - match files containing minimum number of pattern matches

I want to search a bunch of files and list only those containing a minimum number of pattern matches. So if I want to identify files containing 3 (or more) instances of the pattern "said:" and I have file1 that contains the lines: He said: She said: and file2 that contains the lines: He... (3 Replies)
Discussion started by: stumpyuk
3 Replies

7. UNIX for Dummies Questions & Answers

extracting lates pattern match from multiple matches in log

Hi, I have a large, multiline log file. I have used pcregrep to extract all entries in that log that match a particular pattern - where that pattern spans multiple lines. However, because the log file is large, and these entries occur every few minutes, I still output a very large amount... (6 Replies)
Discussion started by: dbrb2
6 Replies

8. Shell Programming and Scripting

Match multiple patterns in a file and then print their respective next line

Dear all, I need to search multiple patterns and then I need to print their respective next lines. For an example, in the below table, I will look for 3 different patterns : 1) # ATC_Codes: 2) # Generic_Name: 3) # Drug_Target_1_Gene_Name: #BEGIN_DRUGCARD DB00001 # AHFS_Codes:... (3 Replies)
Discussion started by: AshwaniSharma09
3 Replies

9. Shell Programming and Scripting

print lines which match multiple patterns

Hi, I have a text file as follows: 11:38:11.054 run1_rdseq avg_2-5 999988.0000 1024.0000 11:50:52.053 run3_rdrand 999988.0000 1135.0 128.0417 11:53:18.050 run4_wrrand avg_2-5 999988.0000 8180.5833 11:55:42.051 run4_wrrand avg_2-5 999988.0000 213.8333 11:55:06.053... (2 Replies)
Discussion started by: annazpereira
2 Replies

10. Shell Programming and Scripting

Sed to delete exactly match pattern and print them in other file

Hi there, I need help about using sed. Iam using sed to delete and print lines that match the port number as listed in sedfile. I am using -d and -p command for delete match port and print them respectively. However, the output is not synchonize where the total deleted lines is not similar with... (3 Replies)
Discussion started by: new_buddy
3 Replies
Login or Register to Ask a Question