sed filtering lines by range fails 1-line-ranges


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting sed filtering lines by range fails 1-line-ranges
# 1  
Old 10-31-2012
sed filtering lines by range fails 1-line-ranges

The following is part of a larger project and sed is (right now) a given. I am working on a recursive Korn shell function to "peel off" XML tags from a larger text. Just for context i will show the complete function (not working right now) here:

Code:
function pGetXML
{
typeset chTag="$1"
typeset chOpt="$1"
typeset chLine=""

if [ "${chOpt#*/}" = "${chOpt}" ] ; then
     chOpt=""
else
     chOpt="${chOpt#*/}"
     chTag="${chTag%/*}"
fi

print -u2 - "inside pGetXML...."
print -u2 - "chTag=${chTag}"
print -u2 - "chOpt=${chOpt}"
print -u2 - "Args=$*\n"

if [ -n "$chTag" ] ; then
     shift
     sed -n '/<'"$chTag"'[^>]*'"$chOpt"'[^>]*>/,/<\/'"$chTag"'[^>]*>/p' |\
     pGetXML $*
else
     while read chLine ; do
          pStripTags "$chLine"
     done
fi

return 0
}

The function will be called like

Code:
pGetXML "arg1/type=opt1" "arg2/type=opt2" "Value"...

and is intended to "peel off" layers of XML tags from a file organized like this:

Code:
<arg1 type=opt1>
     <arg2 type=opt2>
          <Value>blabla</Value>
     </arg2>
     <othertag>
          <Value>foo bar</Value>
     </othertag>
</arg1>

The function should first print everything from "<arg1>" to "</arg1>" (the "option" is used because there could be other tags with the same name i am not interested in, like "<arg1 type=else>"), in the second instance filter from that only the lines "<arg2>...</arg2>" and in the third pass only the lines "<Value>...</Value>". The function "pStripTags" simply strips off the tags leaving the text inside.

Well, this is what was intended and it kind of works, but in the last step "sed" fails to do as expected when opening and closing tag of the range is on eht same line. I am at this stage down to this portion of the text (this is verified):

Code:
     <arg2 type=opt2>
          <Value>blabla</Value>
     </arg2>

and the sed command (verified with "set -xv") is this:

Code:
sed -n '/<Value[^>]*[^>]*>/,/<\/Value[^>]*>/p'

I would have expected it to only print line 2, but it doesn't. Instead it prints line 2 and 3.

The objective is to create a sed script that will fit into the recursive function. Any pointers will be welcome.

bakunin
# 2  
Old 10-31-2012
from the man page:
Quote:
Addresses
... and if addr2 is a regexp, it will not be tested against the line that addr1 matched.
Not sure how to circumvent... will the </value> tag be always be in the same line ?
This User Gave Thanks to RudiC For This Post:
# 3  
Old 10-31-2012
Thanks for the man page quote. Either i am blind or the AIX man page doesn't mention this detail. But this is at least an explanation.

Quote:
Originally Posted by RudiC
from the man page:
Not sure how to circumvent... will the </value> tag be always be in the same line ?
No, this is what led me to trying my solution in first place. As you can see from the example text all but the innermost tags are on separate lines.

I will post the reworked script as soon as i have it ready. Thanks.

bakunin
# 4  
Old 10-31-2012
Hi bakunin, you may replace your sed script with this:
Code:
sed -n '
:strt
/<'"$chTag"'[^>]*'"$chOpt"'[^>]*>/{
/<\/'"$chTag"'[^>]*>/{
p
d
}
N
b strt
}'

In case of a range of addresses, sed will find a line matching the first address and will not try to match the second address too at that line. The second address will be attempted to be matched on subsequent lines. Hence, the problem.
# 5  
Old 10-31-2012
Well, I come up with this:
Code:
sed -rn '/<Value[^>]*[^>]*>/{h;
                 /<\/Value[^>]*>/!b nxt;g;p;b end
         : nxt    {n; /<\/Value[^>]*>/!{H;b nxt}
                     /<\/Value[^>]*>/H;x;p;b end
                  }
         : end}
        '

which prints out one liners as well as multiliners between <Value> tags... give it a shot and report back.
# 6  
Old 10-31-2012
Many thanks for your helpful suggestions.

I modified the function a bit and noticed, that i don't need the last step "pStripTags" if i modify the sed-script to strip the tags immediately. Here is the revised function. I have added "tee -a <tracefile>" commands to control the various steps of the recursion. For production they can safely be removed as they only serve debugging purposes:

Code:
# ------------------------------------------------------------------------------
# pGetXML                        extract certain values from a layered XML code
# ------------------------------------------------------------------------------
# Author.....: bakunin, with help of various unix.com members
# last update: 2012 08 23    by: bakunin
# ------------------------------------------------------------------------------
# Revision Log:
#
# ------------------------------------------------------------------------------
# Usage:
#     pGetXML tag1[/option1] [tag2[/option2] ..]
#
#
#     Example:
#          cat file | pGetXML foo/opt1 bar/opt2
#          will search for a range of "<foo ...opt1..> ... </foo>" and in the
#          resulting stream search for a range of "<bar ..opt2..> ... </bar>
#          The result will be reformatted to a single line and the enclosing
#          tags will be removed. This text:
#
#          <foo type=opt2>
#               <sometag>
#          </foo>
#          <foo type=opt1>
#               <bar>
#                    somevalue
#               </bar
#               <bar type=opt2>searched_for</bar>
#          </foo>
#
#          will result only in "searched_for", because in the first foo-tag the
#          option doesn't match, the same goes for the first bar-tag 
#
# Prerequisites:
# - none
# ------------------------------------------------------------------------------
# Documentation:
# Extracts values from an XML file of nested tags presented at <stdin>.
# The given list of tags is searched recursively. Only the tag name has to
# be given, so
#
#             pGetXML foo
#
# will return the content of "<foo> .. </foo>". It is possible to refine tags
# by using "options", which will be searched for in the tag definition (see
#  below).
#
# Output goes to <stdout>.
#
#     Parameters: tag1[/opt1] [tag2[/opt2] ..tagN[/optN]] 
#     returns: void
# ------------------------------------------------------------------------------
# known bugs:
#
#     none
# ------------------------------------------------------------------------------
# ..........................(C) 2012 bakunin ..................................
# ------------------------------------------------------------------------------

function pGetXML
{
typeset chTag="$1"
typeset chOpt="$1"
typeset chLine=""

if [ "${chOpt#*/}" = "${chOpt}" ] ; then
     chOpt=""
else
     chOpt="${chOpt#*/}"
     chTag="${chTag%/*}"
fi

# DEBUG start
#      print -u2 - "inside pGetXML...."
#      print -u2 - "chTag=${chTag}"
#      print -u2 - "chOpt=${chOpt}"
#      print -u2 - "Args=$*\n"
# DEBUG end

if [ -n "$chTag" ] ; then
     shift
     sed -n '/<'"$chTag"'[^>]*'"$chOpt"'[^>]*>/ {
               :next
               /<\/'"$chTag"'[^>]*>/! {
                    N
                    b next
               }
             }
             /<\/'"$chTag"'[^>]*>/ {
               s/\n//g
               s/^.*<'"$chTag"'[^>]*'"$chOpt"'[^>]*>//
               s/<\/'"$chTag"'[^>]*>.*$//p
             }' |\
     tee -a xxx.$(date +'%H%M%N').out |\
     pGetXML $*
else
     tee -a xxx.last.out |\
     while read chLine ; do
          print - "$chLine"
     done
fi

return 0
}

bakunin
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Cannot subset ranges from another range set

Ca21chr2_C_albicans_SC5314 2159343 2228327 Ca21chr2_C_albicans_SC5314 636587 638608 Ca21chr2_C_albicans_SC5314 5286 50509 Ca21chr2_C_albicans_SC5314 634021 636276 Ca21chr2_C_albicans_SC5314 1886545 1900975 Ca21chr2_C_albicans_SC5314 610758 613544... (9 Replies)
Discussion started by: cryptodice
9 Replies

2. UNIX for Beginners Questions & Answers

Sed/awk to delete a regex between range of lines

Hi Guys I am looking for a solution to one problem to remove parentheses in a range of lines. Input file module bist_logic_inst(a, ab , dhd, dhdh , djdj, hdh, djjd, jdj, dhd, dhp, dk ); input a; input ab; input dhd; input djdj; input dhd; output hdh; output djjd; output jdj;... (5 Replies)
Discussion started by: kshitij
5 Replies

3. Shell Programming and Scripting

sed variable expansion fails for substitution in range

I'm trying to change "F" to "G" in lines after the first one: 'FUE.SER' 5 1 1 F0501 F0401 F0502 2 1 F0301 E0501 F0201 E0502 F0302 3 1 F0503 E0503 E0301 E0201 E0302 E0504 F0504 4 1 F0402 F0202 E0202 F0101 E0203 F0203 F0403 5 1 F0505 E0505 E0303 E0204 E0304 E0506... (10 Replies)
Discussion started by: larrl
10 Replies

4. Shell Programming and Scripting

sed replace range of characters in each line

Hi, I'm trying to replace a range of characters by their position in each line by spaces. I need to replace characters 95 to 145 by spaces in each line. i tried below but it doesn't work sed -r "s/^(.{94})(.{51})/\ /" inputfile.txt > outputfile.txt can someone please help me... (3 Replies)
Discussion started by: Kevin Tivoli
3 Replies

5. Shell Programming and Scripting

sed pattern fails to delete line of numbers

We are using Red Hat Linux. I have a flat file with among other things, the following lines, which appear occasionally throughout the file: Using sed, I delete this line: L;L;L;L;R;R;R;L;R;L;R;R;R;L;L;L With: /^;;;;;*/d Works fine every time. However, I cannot delete... (6 Replies)
Discussion started by: bloomlock
6 Replies

6. Shell Programming and Scripting

Awk/sed : help on:Filtering multiple lines to one:

Experts Good day, I want to filter multiple lines of same error of same day , to only 1 error of each day, the first line from the log. Here is the file: May 26 11:29:19 cmihpx02 vmunix: NFS write failed for server cmiauxe1: error 5 (RPC: Timed out) May 26 11:29:19 cmihpx02 vmunix: NFS... (4 Replies)
Discussion started by: rveri
4 Replies

7. Shell Programming and Scripting

Grep range of lines to print a line number on match

Hi Guru's, I am trying to grep a range of line numbers (based on match) and then look for another match which starts with a special character '$' and print the line number. I have the below code but it is actually printing the line number counting starting from the first line of the range i am... (15 Replies)
Discussion started by: Kevin Tivoli
15 Replies

8. Shell Programming and Scripting

Generate Regex numeric range with specific sub-ranges

hi all, Say i have a range like 0 - 1000 and i need to split into diffrent files the lines which are within a specific fixed sub-range. I can achieve this manually but is not scalable if the range increase. E.g cat file1.txt Response time 2 ms Response time 15 ms Response time 101... (12 Replies)
Discussion started by: varu0612
12 Replies

9. Shell Programming and Scripting

Sed print range of lines between line number and pattern

Hi, I have a file as below This is the line one This is the line two <\XMLTAG> This is the line three This is the line four <\XMLTAG> Output of the SED command need to be as below. This is the line one This is the line two <\XMLTAG> Please do the need to needful to... (4 Replies)
Discussion started by: RMN
4 Replies

10. Shell Programming and Scripting

Remove a range of lines from a file using sed

Hi I am having some issue editing a file in sed. What I want to do is, in a loop pass a variable to a sed command. Sed should then search a file for a line that matches that variable, then remove all lines below until it reaches a line starting with a constant. I have managed to write a... (14 Replies)
Discussion started by: Andy82
14 Replies
Login or Register to Ask a Question