sed pattern matching question


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting sed pattern matching question
# 1  
Old 04-28-2011
sed pattern matching question

I inherited a script that contains the following sed command:
Code:
sed -n -e '/^.*ABCD|/p' $fileName | sed -e 's/^.*ABCD|//' | sed -e 's/|ABCD$//' > ${fileName}.tmp

What I'm wondering is whether ABCD has a special pattern matching value in sed, such as a character class similar or identical to [A-Z].

I'm thinking they were intended to be literal values.

Thanks in advance! Smilie

Last edited by Franklin52; 04-29-2011 at 03:51 AM.. Reason: Please use code tags
# 2  
Old 04-28-2011
No, they are literals.

From what I can tell the whole string (ABCD|/p) is a literal. With the wildcard (*), beginning of line ('^'), and end of line ('$') being the only regex.

This basically does these 4 operations:

1. Match any character from beginning of line up to and including 'ABCD|/p' and only print the matching lines.

2. Take the output from the previous sed command and them match from beginning of line any character up to and inclusing the string literal 'ABCD|' and remove the matching string from the output.

3. Take the output from the previous sed command and them match the string literal '|ABCD' if it is at the end of the line and remove it.

4. Output the results to a file named ${fileName}.tmp

Example:

File (test.txt) Contents:
Code:
sldns;jnfd
dddghgdhgABCD|/p|test line 1
ABCD|/p|test line 2
This is|ABCD|/p|test line 3
abcdsnalfnalfsg

Command (output to screen and not a file):
Code:
sed -n -e '/^.*ABCD|/p' test.txt |sed -e 's/^.*ABCD|//' | sed -e 's/|ABCD$//'

Results:
Code:
/p|test line 1
/p|test line 2
/p|test line 3

# 3  
Old 04-28-2011
Quote:
From what I can tell the whole string (ABCD|/p) is a literal.
Mmmmm... no.
The string 'ABCD|' is a literal, the rest '/p' is the end of regex ('/') and print command ('p'). So the first sed command
Code:
sed -n -e '/^.*ABCD|/p'

is an instruction to print only lines containing 'ABCD|'. There is some redundancy there; the following would do the same:
Code:
sed -n -e '/ABCD|/p'

The second sed removes the beginning of line until ABCD| including. Note that matching here is greedy, so if you have multiple instances of ABCD| there, the pattern is gonna match the longest possible substring. E.g.:
Code:
$ echo "blah|ABCD|hhhh|fals.;and&+324ABCD|ooo|ABCD" | sed -e 's/^.*ABCD|//' 
ooo|ABCD

The third one removes trailing '|ABCD'

We could simplify this as :
Code:
sed -n '/ABCD|/{  #do the following on lines containing "ABCD|"
            s/^.*ABCD|// ;  #eat the longest substring from beginning to "ABCD|"
            s/|ABCD$//;     # eat the  last "|ABCD" just before end of line
            p }'               # and print it


Last edited by mirni; 04-28-2011 at 08:40 PM..
This User Gave Thanks to mirni For This Post:
# 4  
Old 04-28-2011
oops yep I was wrong there.

I oddly missed that '/' was the deliminating character for the regex pattern and therefore the p on the other side was actually a sed command.
I guess the '|/' threw me off...

Thanks for the correction. Smilie
This User Gave Thanks to ddreggors For This Post:
# 5  
Old 04-28-2011
Thank you both. I too feel the ABCD is literal. I think mirni is correct about the overall behavior (as were you ddreggors with the exception of the print).

The twist in all of this is that this sed command has been used at the end of an extraction process, supposedly to remove lines that contain leading ABDC followed by a pipe, and trailing ABCD preceded by a pipe. And to print (p), everything else to a .tmp file. Then a mv command was usedto overwrite the original file with the tmp file.

But when I run this command against a sample file created with the first 1000 lines of one of our prod files, the tmp file is empty each time. However when it runs in production the original file has the same byte count as the processed and renamed output file (as measured before the mv command is carried out). So print is working in prod.

The files are too big to diff though I compared the first 100K rows of the before and after to each other, and then the last 100K rows and came up with no differences each time. So in production it is writing every line to the tmp file, but when I run it it writes no lines to the tmp file.

My original intent was to change this over to a Perl pattern match and replace in the hopes of speeding up the process, but I wanted to understand the sed statement first. Now it's looking like at best sed is doing nothing, given that the before and after files are identical. But I still need to figure out why my tmp file is empty, using the same command (from the command line), while it prod the tmp file is the same size as the original file (when run from a script).
# 6  
Old 04-28-2011
You are not gonna get much speed-up, if any at all, by using Perl.
If you could post a sample input, we might be able to help you out.
Can you post output of:
Code:
grep -m3  'ABCD|' $fileName

Also, you have -n switch there, so sed will not print unless explicitly instructed ('p' command). Which means output of this sed filter should be smaller than original input.

Last edited by mirni; 04-28-2011 at 09:34 PM..
# 7  
Old 04-29-2011
Quote:
Originally Posted by mirni
You are not gonna get much speed-up, if any at all, by using Perl.
If you could post a sample input, we might be able to help you out.
Can you post output of:
Code:
grep -m3  'ABCD|' $fileName

Also, you have -n switch there, so sed will not print unless explicitly instructed ('p' command). Which means output of this sed filter should be smaller than original input.
I can't post a sample because the data is sensitive.

I have yet to actually find any records that begin with .*ABCD| or end with |ABCD

The p command is in fact explicitly included because the command used is always:
sed -n -e '/^.*ABCD|/p' $fileName | sed -e 's/^.*ABCD|//' | sed -e 's/|ABCD$//' > ${fileName}.tmp

One other thing that is confusing me is that the pattern before the print matches the one after the print. The only difference seems to be that the first is being fed to print, while the second occurrence is being targeted for removal. I'm am not quite sure what the developer's intent was there.

As for sed itself, would it be safe to say that the result of this specific command would be that any entire line which either begins with any one or more characters followed by a literal ABCD and a | would be removed. And any line ending with a pipe followed by a literal ABCD and end of line would be removed?

Thank you again!
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Bash pattern matching question

I need to check the condition of a variable before the script continues and it needs to match a specific pattern such as EPS-03-0 or PDF-02-1. The first part is a 3 or 4 letter string followed by a hyphen, then a 01,02 or 03 followed by a hyphen then a 0 or a 1. I know I could check for every... (4 Replies)
Discussion started by: stormcel
4 Replies

2. Shell Programming and Scripting

Sed: printing lines AFTER pattern matching EXCLUDING the line containing the pattern

'Hi I'm using the following code to extract the lines(and redirect them to a txt file) after the pattern match. But the output is inclusive of the line with pattern match. Which option is to be used to exclude the line containing the pattern? sed -n '/Conn.*User/,$p' > consumers.txt (11 Replies)
Discussion started by: essem
11 Replies

3. Shell Programming and Scripting

pattern matching question

Hi guys I have the following case statement in my script: case $pn.$db in *?.fcp?(db)) set f ${pn} cp ;; *?.oxa?(oxa) ) set oxa $pn ;; esac Can somebody help me to understand how to interpret *?.fcp?(db)) or *?.oxa?(oxa) ? I cannot figure out how in this case pattern maching... (5 Replies)
Discussion started by: aoussenko
5 Replies

4. Shell Programming and Scripting

pattern matching question

Hi Guys I am trying to check if the pattern "# sign followed by one or several tabs till the end of the line" exists in my file. I am using the following query: $ cat myfile | nawk '{if(/^#\t*$/) print "T"}' Unfortunately it does not return the desired output since I know for sure that the line... (4 Replies)
Discussion started by: aoussenko
4 Replies

5. Shell Programming and Scripting

Pattern matching question

Hi Guys, I am trying to setup a check for the string using an "if" statement. The valid entry is only the one which contain Numbers and Capital Alpha-Numeric characters, for example: BA6F, BA6E, BB21 etc... I am using the following "if" constract to check the input, but it fails allowing Small... (3 Replies)
Discussion started by: aoussenko
3 Replies

6. Shell Programming and Scripting

pattern matching question

Hi guys, I have a file in the following format: 4222 323K 323L D222 494 8134 A023 A024 49 812A 9871 9872 492 A961 A962 A963 491 0B77 0B78 0B79 495 0B7A 0B7B 0B7C 4949 WER9 444L 999O I need to grep the line... (5 Replies)
Discussion started by: aoussenko
5 Replies

7. Shell Programming and Scripting

Pattern matching question

Hi, I am writing a simple log parsing system and have a question on pattern matching. It is simply grep -v -f patterns.re /var/log/all.log Now, I have the following in my logs Apr 16 07:33:17 ad-font-dc1 EvntSLog: AD-FONT-DC1/NTDS ISAM (700) - "NTDS (384) NTDSA: Online defragmentation... (5 Replies)
Discussion started by: wpfontenot
5 Replies

8. Shell Programming and Scripting

SED Question: Search and Replace start of line to matching pattern

Hi guys, got a problem here with sed on the command line. If i have a string as below: online xx:wer:xcv: sdf:/asdf/http:https-asdfd How can i match the pattern "http:" and replace the start of the string to the pattern with null? I tried the following but it doesn't work: ... (3 Replies)
Discussion started by: DrivesMeCrazy
3 Replies

9. Shell Programming and Scripting

Pattern matching question

Hi guys, I have the following expression : typeset EXBYTEC_CHK=`egrep ^"+${PNUM}" /bb/data/firmexbytes.dta` can anybody please explain to me what ^"+${PNUM}" stands for in egrep statement? Thanks -A (3 Replies)
Discussion started by: aoussenko
3 Replies

10. Shell Programming and Scripting

pattern matching + perl question

i can only find the first occurance of a pattern how do i set it to loop untill all occurances have changed. #! /usr/bin/perl use POSIX; open (DFH_FILE, "./dfh") or die "Can not read file ($!)"; foreach (<DFH_FILE>) { if ($_ !~ /^#|^$/) { chomp; ... (1 Reply)
Discussion started by: Optimus_P
1 Replies
Login or Register to Ask a Question