Grep words with X doubles only

06-23-2013

Registered User

80, 0

Join Date: May 2012

Last Activity: 27 October 2014, 11:02 AM EDT

Location: The Cape Fear...ooooh!

Posts: 80

Thanks Given: 50

Thanked 0 Times in 0 Posts

Grep words with X doubles only

Hi!
I'm trying to figure out how to find words with X number of doubles, only. I'm searching a dictionary, (one word per line). For instance, if you want to find words containing only one pair of double letters, you could do something like this:

Code:

egrep '(.)\1' wordlist.txt |egrep -v '(.)\1.*(.)\2'

That'll get rid of words with two, or more, doubles. But when you want to search for two or three sets of doubles, it gets a bit unwieldy.

Code:

egrep '(.)\1.*(.)\2' wordlist.txt |egrep -v '(.)\1.*(.)\2.*(.)\3'

And so on...
It seems to me that there must be a way to specify the max number of doubles in a single regex, but I cannot figure out how. I've found a number of pages online that talk about finding doubles, but none of them mention how to limit them to only the desired amount. I thought maybe a negative backreference could do it but, either I'm writing it wrong, or it just doesn't work. I still get words with more than X doubles.

Code:

$ grep -P '(.)\1.*(?!\1)' twl |head
aa
aah
aahed
aahing
aahs
aal
aalii
aaliis
aals
aardvark

$ grep -P '^(.)\1.*(?!(.)\1)$' twl |head
aa
aah
aahed
aahing
aahs
aal
aalii
aaliis
aals
aardvark

And I tried a bunch of other stuff, but can't figure it out, so I'm turning to you all.

sudon't

View Public Profile for sudon't

Find all posts by sudon't

06-23-2013

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

To find words that have at least one double but not more than 3:

Code:

sed -n -e '/\(.\)\1/!d; s//&/4; t' -e p

The number can be parameterized with a shell variable.

Regards,
Alister

These 3 Users Gave Thanks to alister For This Post:

alister

View Public Profile for alister

Find all posts by alister

06-24-2013

Registered User

80, 0

Join Date: May 2012

Last Activity: 27 October 2014, 11:02 AM EDT

Location: The Cape Fear...ooooh!

Posts: 80

Thanks Given: 50

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by alister

To find words that have at least one double but not more than 3:

Code:

sed -n -e '/\(.\)\1/!d; s//&/4; t' -e p

The number can be parameterized with a shell variable.

Regards,
Alister

Hey Alister!
I don't really know sed very well, but let me see if I can figure this out. I think it's worth pointing out that I have no background with this stuff. I'm just learning on my own, in my spare time, such as it is.
In the first regex, it looks like you're finding any instance of doubles, then saying do not delete, presumably so the pattern gets passed to the second regex?
In the second regex, it looks like you're saying substitute "nothing" with the found pattern. I'm guessing the "4" is a quantifier? And the "t'"? I have no idea.
I feel I have a vague notion of what you're doing, but can't entirely parse the two regexes. But, here we are with two (three?) regexes again. Is it really not possible to do this in a single grep ERE, or PCRE?

sudon't

View Public Profile for sudon't

Find all posts by sudon't

06-24-2013

Registered User

1,413, 498

Join Date: Mar 2012

Last Activity: 8 November 2019, 2:39 AM EST

Location: India

Posts: 1,413

Thanks Given: 101

Thanked 498 Times in 474 Posts

/\(.\)\1/!d :
For all lines not matching the pattern, delete the pattern space. The ! is for the pattern and not for the action d. This means that for all lines not having at least 1 consecutive double character, the rest of the script will not be attempted to be executed and the next line from the input stream will be loaded in the pattern space.

s//&/4:
For all lines having at least 1 consecutive double character pair (filtered by the previous subcommand), try to substitute the 4th occurrence of the last matched pattern (that will be the pair of doubles matched by the first pattern, that is the meaning of //, not "nothing") with the matched string itself. Remember it's the 4th occurrence of the pattern and not the matched string.

t:
That's a programming command. It says that if the last substitution was successful (since the last line read), go to the end of the script (since no label is given). This ensures that if your line has 4 or more pairs of doubles, it will not be printed (helped by the -n option).

p:
And, if the line manages to cross that last barrier, just print it. This way you are assured that the line has from 1 to 3 double pairs.

Oh, seems like a long time since I used sed in my scripts.

------

@alister: Good one.

Last edited by elixir_sinari; 06-24-2013 at 02:12 PM..

These 2 Users Gave Thanks to elixir_sinari For This Post:

elixir_sinari

View Public Profile for elixir_sinari

Find all posts by elixir_sinari

UNIX for Dummies Questions & Answers

Grep words with X doubles only

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Grep only words containing specific string

Discussion started by: baris35

2. Shell Programming and Scripting

How to grep the words with space between?

Discussion started by: netbanker

3. UNIX for Dummies Questions & Answers

Remove Doubles Without Sort?

Discussion started by: sudon't

4. Shell Programming and Scripting

grep words from txt

Discussion started by: Daniel Gate

5. Shell Programming and Scripting

grep for words in file

Discussion started by: fretagi

6. UNIX Desktop Questions & Answers

need help writing a program to look for doubles

Discussion started by: rickym2626

7. UNIX for Dummies Questions & Answers

search multiple words using grep

Discussion started by: pb18798

8. UNIX for Dummies Questions & Answers

Grep Three Words

Discussion started by: murbina

9. Shell Programming and Scripting

find words with grep....

Discussion started by: chrisxgr

10. Programming

long doubles

Discussion started by: crashnburn