perl regexp: matching the parent xml tag


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting perl regexp: matching the parent xml tag
# 1  
Old 05-18-2010
perl regexp: matching the parent xml tag

Hi folks. I would like to remove the full parent (outer) xml tag from a file given a matching child (inner) tag, in a bash shell.

To be more specific, this is what I have so far:

Code:
$ cat myFile.xml
<Sometag></Sometag>
<Outer>
    <Inner>1</Inner>
</Outer>
<Outer>
    <stuff>alot</stuff>
    <Inner>0</Inner>
    <more>even</more>
</Outer>
<Outer>
    <Inner>2</Inner>
</Outer>

$ perl -pe 'BEGIN {undef $/} s|\n<Outer>.+?<Inner>0</Inner>.+?</Outer>||sg' myFile.xml
<Sometag></Sometag>
<Outer>
    <Inner>2</Inner>
</Outer>

The goal is to remove all Outer tags that contain an Inner tag with value 0. However, the above command clearly doesn't do what I want it to. In particular, the first .+? seems to be greedy, and I don't understand why. Does anybody know how I can do it? I appreciate any help.

I'm not bound to perl, but AFAIK perl is the easiest choice for multi-line matching. Any working alternative (sed, awk?) is perfectly welcome.

Last edited by BatManWSL; 05-18-2010 at 10:32 AM..
# 2  
Old 05-18-2010
XSLT is generally a better mechanism for handling this sort of document transformation.

Assuming you convert your XML document into a well formed XML document by adding a root element, the following stylesheet will do what you want.
Code:
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" >
  <xsl:output method="xml" />

  <xsl:template match="Outer">
      <xsl:if test="Inner &gt; 0" >
         <xsl:copy-of select="." />
      </xsl:if>
  </xsl:template>

  <xsl:template match="node()|@*">
     <xsl:copy>
         <xsl:apply-templates select="@*|node()"/>
     </xsl:copy>
  </xsl:template>

</xsl:stylesheet>

The first template does all the heavy lifting. The second template is just an identity tranformation.

Here is sample output
Code:
$ xsltproc example.xsl example.xml
<?xml version="1.0"?>
<root>
<Sometag/>
<Outer>
    <Inner>1</Inner>
</Outer>
<Outer>
    <Inner>2</Inner>
</Outer>
</root>
$

This User Gave Thanks to fpmurphy For This Post:
# 3  
Old 05-18-2010
Quote:
Originally Posted by BatManWSL
...
Code:
$ cat myFile.xml
<Sometag></Sometag>
<Outer>
    <Inner>1</Inner>
</Outer>
<Outer>
    <stuff>alot</stuff>
    <Inner>0</Inner>
    <more>even</more>
</Outer>
<Outer>
    <Inner>2</Inner>
</Outer>
 
$ perl -pe 'BEGIN {undef $/} s|\n<Outer>.+?<Inner>0</Inner>.+?</Outer>||sg' myFile.xml
<Sometag></Sometag>
<Outer>
    <Inner>2</Inner>
</Outer>

The goal is to remove all Outer tags that contain an Inner tag with value 0. However, the above command clearly doesn't do what I want it to. In particular, the first .+? seems to be greedy, and I don't understand why. Does anybody know how I can do it?...
I agree with fpmurphy on this mainly for two reasons:
(a) complexity of Perl regular expressions increases with that of your XML processing requirements, and
(b) a change in the XML structure could render the entire regex useless. That could be a *very* frustrating experience.

Nevertheless, I'd like to answer your questions.

Firstly, all the standard quantifiers - *, +, ? and {m,n} are greedy. That's by definition.

Secondly, your regex works exactly as it is expected to.

Code:
$
$
$ cat myfile.xml
<Sometag></Sometag>
<Outer>
    <Inner>1</Inner>
</Outer>
<Outer>
    <stuff>alot</stuff>
    <Inner>0</Inner>
    <more>even<more>
</Outer>
<Outer>
    <Inner>2</Inner>
</Outer>
$
$
$ perl -pe 'BEGIN {undef $/} s|\n<Outer>.+?<Inner>0</Inner>.+?</Outer>||sg' myfile.xml
<Sometag></Sometag>
<Outer>
    <Inner>2</Inner>
</Outer>
$
$

I've color coded the parts of the regex that match the parts in the xml file.

Note that when you mention ".+?", Perl matches between the first "<Outer>" and "<Inner>", and that includes the part of the string that has "</Outer>" in it.

You will need to tell Perl to look-ahead of "<Outer>" but not match if the look-ahead string has "</Outer>" in it. And same is the case for the string after "</Inner>" - look-ahead but don't match if the string has "<Outer>" in it.

That's where the concept of "negative lookahead" (?! construct comes into picture.

Your regex should've been so -

Code:
$
$ cat myfile.xml
<Sometag></Sometag>
<Outer>
    <Inner>1</Inner>
</Outer>
<Outer>
    <stuff>alot</stuff>
    <Inner>0</Inner>
    <more>even<more>
</Outer>
<Outer>
    <Inner>2</Inner>
</Outer>
$
$
$ perl -pe 'BEGIN{undef $/} s|\n<Outer>((?!</Outer>).)*<Inner>0</Inner>((?!<Outer>).)*</Outer>||msg' myfile.xml
<Sometag></Sometag>
<Outer>
    <Inner>1</Inner>
</Outer>
<Outer>
    <Inner>2</Inner>
</Outer>
$
$

Having said that, if you want to explore XML processing with Perl, check out the XMLTwig module at cpan or at xmltwig.com.

HTH,
tyler_durden

Last edited by durden_tyler; 05-18-2010 at 04:02 PM..
This User Gave Thanks to durden_tyler For This Post:
# 4  
Old 05-19-2010
Very nice, thank you for your replies!
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Moving XML tag/contents after specific XML tag within same file

Hi Forum. I have an XML file with the following requirement to move the <AdditionalAccountHolders> tag and its content right after the <accountHolderName> tag within the same file but I'm not sure how to accomplish this through a Unix script. Any feedback will be greatly appreciated. ... (19 Replies)
Discussion started by: pchang
19 Replies

2. Shell Programming and Scripting

Help with tag value extraction from xml file based on a matching condition

Hi , I have a situation where I need to search an xml file for the presence of a tag <FollowOnFrom> and also , presence of partial part of the following tag <ContractRequest _LoadId and if these 2 exist ,then extract the value from the following tag <_LocalId> which is "CW2094139". There... (2 Replies)
Discussion started by: paul1234
2 Replies

3. Shell Programming and Scripting

Help with XML tag value extraction based on matching condition

sample xml file part <DocumentMinorVersion>0</DocumentMinorVersion> <DocumentVersion>1</DocumentVersion> <EffectiveDate>2017-05-30T00:00:00Z</EffectiveDate> <FollowOnFrom> <ContractRequest _LoadId="export_AJ6iAFoh6g0rE9"> <_LocalId>CRW2218451</_LocalId> ... (4 Replies)
Discussion started by: paul1234
4 Replies

4. Shell Programming and Scripting

To search for a particular tag in xml and collate all similar tag values and display them count

I want to basically do the below thing. Suppose there is a tag called object1. I want to display an output for all similar tag values under heading of Object 1 and the count of the xmls. Please help File: <xml><object1>house</object1><object2>child</object2>... (9 Replies)
Discussion started by: srkmish
9 Replies

5. Shell Programming and Scripting

XML Parse between to tag with upper tag

Hi Guys Here is my Input : <?xml version="1.0" encoding="UTF-8"?> <xn:MeContext id="01736"> <xn:VsDataContainer id="01736"> <xn:attributes> <xn:vsDataType>vsDataMeContext</xn:vsDataType> ... (12 Replies)
Discussion started by: pareshkp
12 Replies

6. Shell Programming and Scripting

Catching the xml tag when only parent directory is known ..not the actual directory

Hi folks, I have an query that is let say i have to search in an xml file an tag that is <abcdef> now this xml file is at /opt/usr/local so one fastest way to achieve this is go to this location by cd /opt/usr/local and then do grep like this... grep -i abcdef but for this I must know the... (4 Replies)
Discussion started by: punpun66
4 Replies

7. Shell Programming and Scripting

How to retrieve the value from XML tag whose end tag is in next line

Hi All, Find the following code: <Universal>D38x82j1JJ </Universal> I want to retrieve the value of <Universal> tag as below: Please help me. (3 Replies)
Discussion started by: mjavalkar
3 Replies

8. Shell Programming and Scripting

perl regexp matching

Hello, I cannot see what's wrong in my code. When I run code below, it just print an empty string. my $test = "SWER~~ERTGSDFGTHAS_RTAWGA_DFAS.x4-234253454.in"; if ($test = ~ m/\~{1,2}.*4/) { print "$1\n"; } else { print "No match...\n"; } Anyone know what I'm doing wrong? ... (4 Replies)
Discussion started by: urandom
4 Replies

9. Shell Programming and Scripting

Help using perl commandline for XML matching

I have a file that contains this. <NAME>/bob</NAME> I'm trying to print just the /bob part to my screen. I have a command line example I really think should work. Keep in mind that the content between the <NAME> </NAME> is always changing. $/tmp> perl -ne 'print /<NAME>($.)<\/NAME>/'... (2 Replies)
Discussion started by: x96riley3
2 Replies

10. Shell Programming and Scripting

Extracting tag values from XML using perl

Hi All, I'm trying to extract the values for the 'src' and 'alt' tags within an xml file. In the files that I'm searching, the tags are always enclosed within an 'img' tag. Typically: <img src="diwiz01.gif" width="576" height="254" alt="Out-of-process and In-process COM Objects"><bookmark... (3 Replies)
Discussion started by: Steve_altius
3 Replies
Login or Register to Ask a Question