sed pattern matching question

04-28-2011

Registered User

5, 1

Join Date: Apr 2011

Last Activity: 30 April 2011, 8:32 PM EDT

Posts: 5

Thanks Given: 3

Thanked 1 Time in 1 Post

sed pattern matching question

I inherited a script that contains the following sed command:

Code:

sed -n -e '/^.*ABCD|/p' $fileName | sed -e 's/^.*ABCD|//' | sed -e 's/|ABCD$//' > ${fileName}.tmp

What I'm wondering is whether ABCD has a special pattern matching value in sed, such as a character class similar or identical to [A-Z].

I'm thinking they were intended to be literal values.

Thanks in advance!

Last edited by Franklin52; 04-29-2011 at 03:51 AM.. Reason: Please use code tags

topmhat

View Public Profile for topmhat

Find all posts by topmhat

04-28-2011

Registered User

160, 12

Join Date: Aug 2008

Last Activity: 22 July 2013, 9:20 AM EDT

Location: Florida

Posts: 160

Thanks Given: 5

Thanked 12 Times in 11 Posts

No, they are literals.

From what I can tell the whole string (ABCD|/p) is a literal. With the wildcard (*), beginning of line ('^'), and end of line ('$') being the only regex.

This basically does these 4 operations:

1. Match any character from beginning of line up to and including 'ABCD|/p' and only print the matching lines.

2. Take the output from the previous sed command and them match from beginning of line any character up to and inclusing the string literal 'ABCD|' and remove the matching string from the output.

3. Take the output from the previous sed command and them match the string literal '|ABCD' if it is at the end of the line and remove it.

4. Output the results to a file named ${fileName}.tmp

Example:

File (test.txt) Contents:

Code:

sldns;jnfd
dddghgdhgABCD|/p|test line 1
ABCD|/p|test line 2
This is|ABCD|/p|test line 3
abcdsnalfnalfsg

Command (output to screen and not a file):

Code:

sed -n -e '/^.*ABCD|/p' test.txt |sed -e 's/^.*ABCD|//' | sed -e 's/|ABCD$//'

Results:

Code:

/p|test line 1
/p|test line 2
/p|test line 3

ddreggors

View Public Profile for ddreggors

Find all posts by ddreggors

04-28-2011

Registered User

686, 179

Join Date: Mar 2011

Last Activity: 17 March 2020, 9:58 PM EDT

Posts: 686

Thanks Given: 51

Thanked 179 Times in 171 Posts

Quote:

From what I can tell the whole string (ABCD|/p) is a literal.

Mmmmm... no.
The string 'ABCD|' is a literal, the rest '/p' is the end of regex ('/') and print command ('p'). So the first sed command

Code:

sed -n -e '/^.*ABCD|/p'

is an instruction to print only lines containing 'ABCD|'. There is some redundancy there; the following would do the same:

Code:

sed -n -e '/ABCD|/p'

The second sed removes the beginning of line until ABCD| including. Note that matching here is greedy, so if you have multiple instances of ABCD| there, the pattern is gonna match the longest possible substring. E.g.:

Code:

$ echo "blah|ABCD|hhhh|fals.;and&+324ABCD|ooo|ABCD" | sed -e 's/^.*ABCD|//' 
ooo|ABCD

The third one removes trailing '|ABCD'

We could simplify this as :

Code:

sed -n '/ABCD|/{  #do the following on lines containing "ABCD|"
            s/^.*ABCD|// ;  #eat the longest substring from beginning to "ABCD|"
            s/|ABCD$//;     # eat the  last "|ABCD" just before end of line
            p }'               # and print it

Last edited by mirni; 04-28-2011 at 08:40 PM..

This User Gave Thanks to mirni For This Post:

mirni

View Public Profile for mirni

Find all posts by mirni

04-28-2011

Registered User

160, 12

Join Date: Aug 2008

Last Activity: 22 July 2013, 9:20 AM EDT

Location: Florida

Posts: 160

Thanks Given: 5

Thanked 12 Times in 11 Posts

oops yep I was wrong there.

I oddly missed that '/' was the deliminating character for the regex pattern and therefore the p on the other side was actually a sed command.
I guess the '|/' threw me off...

Thanks for the correction.

This User Gave Thanks to ddreggors For This Post:

ddreggors

View Public Profile for ddreggors

Find all posts by ddreggors

04-28-2011

Registered User

5, 1

Join Date: Apr 2011

Last Activity: 30 April 2011, 8:32 PM EDT

Posts: 5

Thanks Given: 3

Thanked 1 Time in 1 Post

Thank you both. I too feel the ABCD is literal. I think mirni is correct about the overall behavior (as were you ddreggors with the exception of the print).

The twist in all of this is that this sed command has been used at the end of an extraction process, supposedly to remove lines that contain leading ABDC followed by a pipe, and trailing ABCD preceded by a pipe. And to print (p), everything else to a .tmp file. Then a mv command was usedto overwrite the original file with the tmp file.

But when I run this command against a sample file created with the first 1000 lines of one of our prod files, the tmp file is empty each time. However when it runs in production the original file has the same byte count as the processed and renamed output file (as measured before the mv command is carried out). So print is working in prod.

The files are too big to diff though I compared the first 100K rows of the before and after to each other, and then the last 100K rows and came up with no differences each time. So in production it is writing every line to the tmp file, but when I run it it writes no lines to the tmp file.

My original intent was to change this over to a Perl pattern match and replace in the hopes of speeding up the process, but I wanted to understand the sed statement first. Now it's looking like at best sed is doing nothing, given that the before and after files are identical. But I still need to figure out why my tmp file is empty, using the same command (from the command line), while it prod the tmp file is the same size as the original file (when run from a script).

topmhat

View Public Profile for topmhat

Find all posts by topmhat

04-28-2011

Registered User

686, 179

Join Date: Mar 2011

Last Activity: 17 March 2020, 9:58 PM EDT

Posts: 686

Thanks Given: 51

Thanked 179 Times in 171 Posts

You are not gonna get much speed-up, if any at all, by using Perl.
If you could post a sample input, we might be able to help you out.
Can you post output of:

Code:

grep -m3  'ABCD|' $fileName

Also, you have -n switch there, so sed will not print unless explicitly instructed ('p' command). Which means output of this sed filter should be smaller than original input.

Last edited by mirni; 04-28-2011 at 09:34 PM..

mirni

View Public Profile for mirni

Find all posts by mirni

04-29-2011

Registered User

5, 1

Join Date: Apr 2011

Last Activity: 30 April 2011, 8:32 PM EDT

Posts: 5

Thanks Given: 3

Thanked 1 Time in 1 Post

Quote:

Originally Posted by mirni

You are not gonna get much speed-up, if any at all, by using Perl.
If you could post a sample input, we might be able to help you out.
Can you post output of:

Code:

grep -m3  'ABCD|' $fileName

Also, you have -n switch there, so sed will not print unless explicitly instructed ('p' command). Which means output of this sed filter should be smaller than original input.

I can't post a sample because the data is sensitive.

I have yet to actually find any records that begin with .*ABCD| or end with |ABCD

The p command is in fact explicitly included because the command used is always:
sed -n -e '/^.*ABCD|/p' $fileName | sed -e 's/^.*ABCD|//' | sed -e 's/|ABCD$//' > ${fileName}.tmp

One other thing that is confusing me is that the pattern before the print matches the one after the print. The only difference seems to be that the first is being fed to print, while the second occurrence is being targeted for removal. I'm am not quite sure what the developer's intent was there.

As for sed itself, would it be safe to say that the result of this specific command would be that any entire line which either begins with any one or more characters followed by a literal ABCD and a | would be removed. And any line ending with a pipe followed by a literal ABCD and end of line would be removed?

Thank you again!

topmhat

View Public Profile for topmhat

Find all posts by topmhat

Shell Programming and Scripting

sed pattern matching question

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Bash pattern matching question

Discussion started by: stormcel

2. Shell Programming and Scripting

Sed: printing lines AFTER pattern matching EXCLUDING the line containing the pattern

Discussion started by: essem

3. Shell Programming and Scripting

pattern matching question

Discussion started by: aoussenko

4. Shell Programming and Scripting

pattern matching question

Discussion started by: aoussenko

5. Shell Programming and Scripting

Pattern matching question

Discussion started by: aoussenko

6. Shell Programming and Scripting

pattern matching question

Discussion started by: aoussenko

7. Shell Programming and Scripting

Pattern matching question

Discussion started by: wpfontenot

8. Shell Programming and Scripting

SED Question: Search and Replace start of line to matching pattern

Discussion started by: DrivesMeCrazy

9. Shell Programming and Scripting

Pattern matching question

Discussion started by: aoussenko

10. Shell Programming and Scripting

pattern matching + perl question

Discussion started by: Optimus_P