grep backreferencing question


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting grep backreferencing question
# 1  
Old 10-19-2010
grep backreferencing question

Hello,
My input would be :
Code:
###Anything
   int b,c,a;
int    a,b,b;
###Anything
  int c,d,c;
int k,l;
###ANYTHING

Many declarations interspersed with other statements. I am trying to find only the declarations where a line has a variable declared more than once.

The output for the above would be:
Code:
int a,b,b;
  int c,d,c;

I did grep '^[ ]*int[ ]*[a-z][a-z0-9]*\(,[a-z][a-z0-9]*\)\{0,\};$' to match all declarations, but I am not able to make the regex remember a variable and match it when it occurs later. My output just catches all the declaration statements.

Please help.


Thanks,
Prasanna

---------- Post updated at 12:55 PM ---------- Previous update was at 12:51 PM ----------

To add, I am only using grep to do this. I have done this before, but I don't remember. I am sure it's possible with grep with a little tweak to the regex and the backreferencing.

Last edited by Scott; 10-19-2010 at 03:31 PM.. Reason: Please use code tags
# 2  
Old 10-19-2010
The problem is that grep is always greedy. So I can make a regex like
Code:
echo "a,b,c,d,c,b,a,c" | egrep -o "([a-z]+)(,[a-z]+)*"

...and it will match the whole string, but when I start trying to use backreferences, the first ([a-z]+) will only ever match the very first variable: It will never skip past it and try other combinations when the backreference fails. There's no way to make grep non-greedy, either. Perl regexes support this though.
# 3  
Old 10-19-2010
Your grep is "too anchored" and your regex visualization is too wild. There is no back referencing in regex, just iteratively forward testing: '.*' means try remainder of pattern at every following byte.

A line containing the word int and later a semicolon should not have any variable-legal word repeated between them. Every variable name in C must start with a letter, the rest of the name can consist of letters, numbers and underscore characters. Commas are not variable-legal words, so you can ignore them -- classic excess information problem.

Deal with white spaces using \<\> or similar word boundary, so you avoid substrings but do not get tangled in the whole comma, space, tab thing. Some grep do not honor '\<\>' so you may need sed or '\b'.

Regex Tutorial - \b Word Boundaries

If you get desperate, add spaces by commas and semicolon so you can look for space or tab [ \t]. If you need to restore the original, sed has a hold space h/g command pair.

Code:
Narrative: grep for a line with the free standing word 'int', and
 later on that line for every C variable name as a free standing word somewhere,
  see if we have that same C variable name as a free standing word later anywhere,
 and yet later on that line a semicolon.

grep '\<int\>.*\<\([a-zA-Z][a-zA-Z0-9_]*\)\>.*\<\1\>.*;'

# 4  
Old 10-19-2010
Non-greedy match( i.e. *? ) in perl:

Code:
#!/usr/bin/perl

my $var="a,b,c,d,c";

if($var =~ /([a-z]+)(,[a-z]+)*?,\1/ )
{
        print "Match\n";
}



---------- Post updated at 11:30 AM ---------- Previous update was at 11:29 AM ----------

Quote:
Originally Posted by DGPickett
There is no back referencing in regex
There's definitely backreferencing in egrep.
# 5  
Old 10-19-2010
Code:
grep 'int .*\([^,][^,]*\),.*\1[,\;]' infile

Code:
int a,b,b;
int c,d,c;


Last edited by Scrutinizer; 10-19-2010 at 05:23 PM..
# 6  
Old 10-19-2010
Quote:
Originally Posted by Corona688
There's definitely backreferencing in egrep.
Oh, that \(\) \1 bit, nothing back about it, the first collects and the second applies. Who defined this silly term, Bachus-Naur?

Well, when you slip into egrep/grep -E, the rules shift, which is one reason I use sed in complex egrep situations. It was a trustworthy pal until someone redefined regex '\<', arrogant little beasts!

---------- Post updated at 01:58 PM ---------- Previous update was at 01:54 PM ----------

Code:
$ sed '/\<int\>.*\<\([a-zA-Z][a-zA-Z0-9_]*\)\>.*\<\1\>.*;/!d'  <<!
###Anything
int b,c,a;
int a,b,b;
###Anything
int c,d,c;
int k,l;
###ANYTHING
 
Many declarations interspersed with other statements. I am trying to find only the declarations where a line has a variable declared more than once.
 
!
int a,b,b;
int c,d,c;
$

---------- Post updated at 01:58 PM ---------- Previous update was at 01:58 PM ----------

Gnu sed, older regex lib.

---------- Post updated at 02:01 PM ---------- Previous update was at 01:58 PM ----------

Would you like line numbers?

Code:
$ sed '
/\<int\>.*\<\([a-zA-Z][a-zA-Z0-9_]*\)\>.*\<\1\>.*;/!d
=
'  <<!
###Anything
int b,c,a;
int a,b,b;
###Anything
int c,d,c;
int k,l;
###ANYTHING
 
Many declarations interspersed with other statements. I am trying to find only the declarations where a line has a variable declared more than once.
 
!
3
int a,b,b;
5
int c,d,c;
$

# 7  
Old 10-19-2010
grep backreferencing question

Quote:
Originally Posted by DGPickett
Your grep is "too anchored" and your regex visualization is too wild. There is no back referencing in regex, just iteratively forward testing: '.*' means try remainder of pattern at every following byte.

A line containing the word int and later a semicolon should not have any variable-legal word repeated between them. Every variable name in C must start with a letter, the rest of the name can consist of letters, numbers and underscore characters. Commas are not variable-legal words, so you can ignore them -- classic excess information problem.

Deal with white spaces using \<\> or similar word boundary, so you avoid substrings but do not get tangled in the whole comma, space, tab thing. Some grep do not honor '\<\>' so you may need sed or '\b'.

Regex Tutorial - \b Word Boundaries

If you get desperate, add spaces by commas and semicolon so you can look for space or tab [ \t]. If you need to restore the original, sed has a hold space h/g command pair.

Code:
Narrative: grep for a line with the free standing word 'int', and
 later on that line for every C variable name as a free standing word somewhere,
  see if we have that same C variable name as a free standing word later anywhere,
 and yet later on that line a semicolon.

grep '\<int\>.*\<\([a-zA-Z][a-zA-Z0-9_]*\)\>.*\<\1\>.*;'


>>>> Thanks lot. The only problem with the above is, it matches illegal declarations also.

like, int a,b,,b; int a,b,b,;

---------- Post updated at 02:20 PM ---------- Previous update was at 02:19 PM ----------

Quote:
Originally Posted by Scrutinizer
Code:
grep -E 'int .*([^,]+),.*\1[,\;]' infile

Code:
int a,b,b;
int c,d,c;



---------- Post updated at 19:37 ---------- Previous update was at 19:32 ----------

Normal grep:
Code:
grep 'int .*\([^,][^,]*\),.*\1[,\;]' infile


>>> Thanks lot. This one matches illegal declarations too.

like, int a,b,,b; int a,b,b,;
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Question about grep

is there anyway i can ask grep to only get the first line? as in the top command line line 1 <-- just grep this line line 2 line 3 ---------- Post updated at 04:24 PM ---------- Previous update was at 04:19 PM ---------- nvm.. found out that i can do it with |head (12 Replies)
Discussion started by: Nick1097
12 Replies

2. Shell Programming and Scripting

Question about grep

can anyone tell me what the \/$ means? from grep \/$ (8 Replies)
Discussion started by: Nick1097
8 Replies

3. Shell Programming and Scripting

Grep question

All, I am wanting to find out if I can do this in one grep statement grep -R failed * |grep -iEw 'Mar 1|Feb 2' I want to search all files in a directory for the text "failed" AND a "date or date". Currently, I am using the above running one grep and then piping it to another. It works,... (3 Replies)
Discussion started by: markdjones82
3 Replies

4. Shell Programming and Scripting

grep question

Hello, Is there a way in grep to remember patterns? For eg: int a,b,c,d,a; If a variable is declared twice, like in the previous example, I should be able to print only those lines. Is there a way to print only the lines where the variable name occurs more than once, using grep... (1 Reply)
Discussion started by: prasanna1157
1 Replies

5. UNIX for Dummies Questions & Answers

grep question

Instead of using the following command #dmesg | grep -v sendmail | grep -v xntpd How can I use just one grep -v and give both arguments. Please suggest thanks (4 Replies)
Discussion started by: Tirmazi
4 Replies

6. Programming

'Backreferencing' in SQL?

My SQL is very rust and I'm having a problem with a query. First, here are the tables involved. Table `os`: +--------------------------------+ | id | distro | version | +--------------------------------+ | 1 | CentOS | 5.2 | | 2 | RHEL | 5 | | 3 ... (1 Reply)
Discussion started by: Housni
1 Replies

7. UNIX for Dummies Questions & Answers

Grep question.

Hi, I am executing the below command. grep ".UPDATE" file1.txt | grep -v MQQUEUE > Myprog1 The expected output is all lines in file1.txt which have the string ".UPDATE" and dont contain the string MQQUEUE. However, the output which I am getting is just searching for the string... (3 Replies)
Discussion started by: saurabhsinha23
3 Replies

8. Shell Programming and Scripting

Grep question

I'm using grep in a shell and I was wondering: Can I grep a file and then delete all files that contain what it returns? So instead of grep 'blah' * and I have 50 files that have blah in it and I would have to delete all 50 manually, how would I just delete them all in one fell swoop? (3 Replies)
Discussion started by: tphegley
3 Replies

9. Shell Programming and Scripting

grep question

Hi, I am currently using grep -c to scan lines for certain data. I am just wondering if you can search a specific column of a file using grep -c. Thanks (6 Replies)
Discussion started by: Jaken
6 Replies

10. UNIX for Dummies Questions & Answers

grep question

what is the format for grep if I want to search from the current directory and through all its subdirectories?:) (3 Replies)
Discussion started by: pkappaz
3 Replies
Login or Register to Ask a Question