Egrep find word that occurs twice in a row

10-17-2017

Registered User

2, 0

Join Date: Oct 2017

Last Activity: 19 October 2017, 5:33 PM EDT

Posts: 2

Thanks Given: 0

Thanked 0 Times in 0 Posts

Egrep find word that occurs twice in a row

Hi there I am trying to figure out and understand what the syntax would be to egrep lines that have a word occur twice in a row. the two words obviously should have a space between them and also it has to be case sensitive which I believe grep is by deffault. the closest I have come is...

Code:

grep '/.*\|.*\|/'

although I am kind of unsure what exactly I am saying here. Any
help would be much appreciated. I am not looking for just an
answer but an understanding of what the answer is doing as well

cheers

spo

Moderator's Comments:

Please use CODE tags as required by forum rules!

Last edited by RudiC; 10-18-2017 at 05:11 AM.. Reason: Added CODE tags.

spo_2138

View Public Profile for spo_2138

Find all posts by spo_2138

10-17-2017

Moderator

3,689, 1,352

Join Date: Jan 2012

Last Activity: 22 August 2020, 11:29 PM EDT

Location: Galactic Empire

Posts: 3,689

Thanks Given: 268

Thanked 1,352 Times in 1,258 Posts

Code:

egrep "pattern.*pattern" filename

Code:

awk '{n=gsub("pattern","&")}n>1' filename

Yoda

View Public Profile for Yoda

Visit Yoda's homepage!

Find all posts by Yoda

10-18-2017

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Try using regular expression's back reference \<num> , that lets you repeat a parenthesized subexpression.

Code:

grep '\(pattern\).*\1'

Another thing that can be useful are word boundaries. Your grep version may support them. they are \b , or \< and \>. Have a look at regular expressions..

Last edited by Scrutinizer; 10-18-2017 at 12:28 AM..

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

10-19-2017

Registered User

2, 0

Join Date: Oct 2017

Last Activity: 19 October 2017, 5:33 PM EDT

Posts: 2

Thanks Given: 0

Thanked 0 Times in 0 Posts

Hi I am still having trouble getting either of these to work and still do not quite understand I need it to only recognize repeated whole words and neither are working for me. I am using and would like to use([a-z]+) as part of the code. Thanks for any and all help!

spo_2138

View Public Profile for spo_2138

Find all posts by spo_2138

10-19-2017

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

According to the standards, extended regular expressions do not have back-references; only basic regular expressions have back-references. Therefore, with a standards conforming version of egrep (which the standards specify as grep -E (not egrep), it is almost impossible to find a variable string that appears twice on a line.

If you use grep instead of egrep (as Scrutinizer suggested in post #3), you can use it to print lines that have a string matching the basic regular expression (AKA BRE) pattern followed by a second occurrence of the same string.

The command:

Code:

grep '/.*|.*/'

and the command:

Code:

grep '/.*\|.*/'

will both print lines that contain a / immediately following by any string of 0 or more characters followed by a | followed by any string of 0 or more characters followed by a / (which does not seem to in any way match what you said you're looking for).

If you're looking for a string of one or more lower-case alphabetic characters (in a locale where the underlying codeset is a superset of ASCII) immediately followed by a by a duplicate of that same string (with nothing between them), you could get that using the grep command:

Code:

grep '\([a-z]+\)\1'

and if you wanted to find two adjacent words that appear at the start of a line or immediately follow a space and are followed by a space or the end of a line that occur next to each other separated by a single space, that would be something like:

Code:

grep -e '^\([a-z\) \1$' -e '^\([a-z\) \1 ' -e ' \([a-z\) \1 ' -e ' \([a-z\) \1$'

As noted by Scrutinizer in post #7, the above BREs are incorrect. The corrected form (assuming there is a single space character between words) is:

Code:

grep -e '^\([a-z][a-z]*\) \1$' -e '^\([a-z][a-z]*\) \1 ' -e ' \([a-z][a-z]*\) \1 ' -e ' \([a-z][a-z]*\) \1$'

In the above command the first BRE looks for two identical lower-case words alone on a line, the 2nd BRE looks for two identical words at the start of a line followed by one or more other words, the 3rd BRE looks for two identical words following one or more other words an followed by one or more other words, and the last BRE looks for two identical lower-case words at the end of a line following one or more other words.

Some versions of grep do not conform to the standards unless additional parameters are specified to force standards conformance. Without knowing what operating system you're using, we have no way of knowing if this problem might affect you.

Last edited by Don Cragun; 10-19-2017 at 11:00 PM.. Reason: Fix typos in BREs noted by Scrutinizer.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

10-19-2017

Registered User

2,202, 340

Join Date: Apr 2007

Last Activity: 10 May 2020, 8:59 AM EDT

Location: 44.21.48N 80.50.15W

Posts: 2,202

Thanks Given: 3

Thanked 340 Times in 306 Posts

Quote:

Originally Posted by Don Cragun

Code:

grep '/.*|.*/'

and the command:

Code:

grep '/.*\|.*/'

will both print lines that contain a / immediately following by any string of 0 or more characters followed by a | followed by any string of 0 or more characters followed by a / (which does not seem to in any way match what you said you're looking for).

If you're looking for a string of one or more lowe-case alphabetic characters (in a locale where the underlying codeset is a superset of ASCII) immediately followed by a by a duplicate of that same string (with nothing between them), you could get that using the grep command:

Code:

grep '\([a-z]+\)\1'

Code:

grep -e '^\([a-z\) \1$' -e '^\([a-z\) \1 ' -e ' \([a-z\) \1 ' -e ' \([a-z\) \1$'

you should be able to accomplish this by processing the file twice.

Code:

sed -e 's/abcde/xxxxx/' <inputfile|grep abcde |sed -e 's/xxxxx/abcde/'

jgt

View Public Profile for jgt

Visit jgt's homepage!

Find all posts by jgt

10-19-2017

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

In addition to Don's suggestion :
grep does not know +, so you woud need to use \{1,\} instead.
In the example, a closing square bracket and repeat operators appears to be missing, so I think it would need to be modified like so:

Code:

grep -e '^\([a-z]\{1,\}\) \1$' -e '^\([a-z]\{1,\}\) \1 ' -e ' \([a-z]\{1,\}\) \1 ' -e ' \([a-z]\{1,\}\) \1$'

Where both the sub-pattern and its back reference are on word boundaries, either at the beginning followed by space, at the end preceded by space or in between space characters.

But without word boundary operators, it gets more complicated when the words do not have to be adjacent:

Code:

grep -e '^\([a-z]\{1,\}\) \([^ ]* \)*\1$' -e '^\([a-z]\{1,\}\) \([^ ]* \)*\1 ' -e ' \([a-z]\{1,\}\) \([^ ]* \)*\1 ' -e ' \([a-z]\{1,\}\) \([^ ]* \)*\1$'

Another thing to note that this is just the case where words are on the boundaries with a space. But there can be comma's, semicolons punctuations etcetera.

--
If you have GNU or BSD grep (as opposed to standard grep) then you can use word boundaries as an extension to regex, so it can be simplified into something like this:

Code:

grep '\<\([a-z]\{1,\}\)\>.*\<\1\>'

They also support back reference with extended regular expressions so, you can can do this:

Code:

grep -E '\<([a-z]+)\>.*\<\1\>'

Note in general instead of [a-z], it is preferable to use [[:lower:]] for lowercase or [[:alpha:]] which matches both upper and lower case in all compliant code sets..

Last edited by Scrutinizer; 10-19-2017 at 10:47 PM..

This User Gave Thanks to Scrutinizer For This Post:

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

UNIX for Beginners Questions & Answers

Egrep find word that occurs twice in a row

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Find a word and increment the number in the word & save into new files

Discussion started by: jypark22

2. Shell Programming and Scripting

Find word in a line and output in which line the word occurs / no. of times it occurred

Discussion started by: anuragpgtgerman

3. Shell Programming and Scripting

How to find a phrase and pull all lines that follow until the phrase occurs again?

Discussion started by: Scottie1954

4. Shell Programming and Scripting

perl lwp find word and print next word :)

Discussion started by: vogueestylee

5. UNIX for Dummies Questions & Answers

Find EXACT word in files, just the word: no prefix, no suffix, no 'similar', just the word

Discussion started by: chicchan

6. Shell Programming and Scripting

Find and replace a word in all the files (that contain the word) under a directory

Discussion started by: filter

7. Shell Programming and Scripting

Need to replace the first word of a line if it occurs again in the next line(shell)

Discussion started by: geeko

8. Shell Programming and Scripting

Looking for a single line to count how many times one character occurs in a word...

Discussion started by: Shingoshi

9. Shell Programming and Scripting

find a word in a file, and change a word beneath it ??

Discussion started by: vikas027

10. Shell Programming and Scripting

TO find the word which occurs maximum number of times

Discussion started by: aajan