· simerian · XML Extract


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting · simerian · XML Extract
# 1  
Old 10-29-2003
Java · simerian · XML Extract

The script following in this thread allows XML data to be located and extracted in a variety of forms from an XML data stream. Using this utility, it is possible to extract all manner of XML subsets and allow data to be post inserted into the "original" XML at any logical point.

The pipe is called using 3 positional parameters:
Code:
	SAXPrint xml_file | ExtractXML.ksh “genealogy” “extract” “delimiter”

Note that the XML file must be in the same format as output by SAXPrint. For this reason, the example above uses the output from SAXPrint directly, however, a suitably formatted file may simply be concatenated through the pipe or each the contents of a variable. This is useful if a number of iterative calls need to be made in order to extract the desired content.

The genealogy value defines the waypoints within the XML schema that must be traversed to reach the final destination. Along with element definitions, multiple attribute criteria may be supplied to enable selection of elements within multiple occurrences. This shall be explained in further detail along with the examples.

The extract value defines the type of data extraction desired; the allowed values are as follows:
Code:
[tilde]		^	Non-inclusive pre-read to the opening target element.
[minus]		-	Inclusive pre-read to the opening target element.
[plus]		+	Inclusive post-read from the opening target element.
[dollar]		$	Non-inclusive post-read from the opening target element.
[hash]		#	Inclusive extract from the opening to the closing target element.
[less]		<	Non-inclusive pre-read to the closing target element.
[greater]	>	Inclusive post-read from the closing target element.
["at"]		@	Extract whole opening target element.

[attribute]	The name of an attribute within the opening target element, the value of which is extracted.

Note that when extracting an attribute, only one attribute name may be supplied.

The delimiter value defines the single character separating the waypoints within the genealogy definition.

Last edited by Simerian; 10-29-2003 at 07:38 AM..
# 2  
Old 10-29-2003
ExtractXML.ksh

Code:
#!/usr/bin/ksh
#
# COPYRIGHT (c) 2003 - SIMERIAN
# 
# e: info@simerian.com
# w: www.simerian.com
#
# DISCLAIMER
# The author of this product does not accept any responsibility for
# loss or damages resulting from the use of said product and makes no
# warranty or representation, either express or implied, including but
# not limited to, any implied warranty of merchantability or fitness for a
# particular purpose. This product is provided "AS IS", and you, its user,
# assume all risks when using it.
#
# DISTRIBUTION
# You may freely redistribute this product subject to the following conditions:
# 1) that the whole product is redistributed, AND,
# 2) that the product or any of its components are NOT altered, AND,
# 3) that no charge be made for any redistribution (ex. consumables & handling).
#
# Feedback is appreciated in order that products can be supported & improved.

#---v----1----v----2----v----3----v----4----v----5----v----6----v----7----v???

	typeset -ru     pGenealogy=$1
	typeset -ru     pExtract=$2
	typeset -ru     pDelimiter=$3

	# Genealogy - Supplied as a delimited path name with each element
	# accompanied by an optional selection criteria.  The delimiter
	# character may be defined by the last option, the default is 
	# forward slash character "/":
	#
	# e.g. ELEMENT1/ELEMENT2.ATTRIBUTE=1/ELEMENT3
	#
	# The genealogy is not case-sensitive and double-quotes do not need to
	# placed about the value setting.

	# Extract - Defines the type of data extract required of the routine:
	#
	# "-" Inclusive Pre-Read to the opening target element.
	# "^" Non-inclusive Pre-Read to the opening target element.
	# "$" Non-inclusive Post-Read from the opening target element.
	# "+" Inclusive Post-Read from the opening target element.
	# "@" Extract opening target element.
	# "#" Extract opening & closing target element and any child elements.
	# "<" Non-inclusive Pre-Read to the closing target element.
	# ">" Inclusive Post-Read from the closing target element.
	#
	# Example: Using ELEMENT3 as the target element:
	#
	#             - ^                                 <
	#             | |                                 |
	#             | +-{ <OUTER ...>                   |
	#      +----> +---{    <INSIDE ...> }---+ <--- @  |
	#    # |                  <XXX ...> }-+ |         |
	#      |                  </XXX>      | | }-------+
	#      +---->          </INSIDE>      | | }-------+
	#                   </OUTER>          | |         |
	#                                     | |         |
	#                                     $ +         >

# Extract XML component.

	typeset         vCMD=""

	vCMD="${vCMD} -v pG=${pGenealogy}"
	vCMD="${vCMD} -v pX=${pExtract}"
	vCMD="${vCMD} -v pD=${pDelimiter:-/}"

	awk ${vCMD} '

BEGIN {

	pGenealogy=toupper(pG)
	pExtract=toupper(pX)
	pDelimiter=toupper(pD)

	cDEBUG=0

	cSPC=3    

	cPRE_NON="^"
	cPRE_INC="-"
	cPOST_NON="$"
	cPOST_INC="+"
	cGROUP="#"
	cPRE_CLOSE="<"
	cPOST_CLOSE=">"
	cTAG="@"

	cATTRIBUTE="."

	vINDEX=cPRE_NON cPRE_INC cPOST_NON cPOST_INC cTAG cGROUP cPRE_TAG cPOST_TAG cPRE_CLOSE cPOST_CLOSE

	if (pExtract == "") pExtract=cTAG

	if (cDEBUG) { printf "# vINDEX <%s>\n",vINDEX }
	match(vINDEX,sprintf("\%s",pExtract))
	if (cDEBUG) {
	   printf "~ RSTART <%s>\n",RSTART
	   printf "~ RLENGTH <%s>\n",RLENGTH
	}
	if (RSTART == 0 || RLENGTH != 1) {
	   vAttribute=pExtract
	   pExtract=cATTRIBUTE
	} 

	if (cDEBUG) {
	   printf "# vAttribute <%s>\n",vAttribute
	   printf "# pExtract <%s>\n",pExtract
	}
	   
	cDEPTH=0
	cCRITERIA=1

	if (cDEBUG) {
	   printf "# Command Line Parameters:\n"
	   printf "~ Genealogy <%s>\n",pGenealogy
	   printf "~ Extract <%s>\n",pExtract
	   printf "~ Delimiter <%s>\n",pDelimiter
	}

	vGenealogyDepth=split(pGenealogy,vElement,pDelimiter)

	if (cDEBUG) {
	   printf "# Mapping:\n"
	}

	vTAGPrev=""
	for (d=1; d <= vGenealogyDepth; d++) {
	    x=index(vElement[d],".")
	    if (x > 0) {
	       vTAG=substr(vElement[d],1,x-1)
	       vCriteria=fnFormatCriteria(substr(vElement[d],x+1))
	    } else {
	       vTAG=vElement[d]
	       vCriteria=""
	    }

	    vIDX_Key[d]=vTAG
	    vIDX_TAG[vTAG,cDEPTH]=d
	    vIDX_TAG[vTAG,cCRITERIA]=vCriteria

	    if (cDEBUG) printf "~ %s. [%s](%s)",d,vTAG,vCriteria
	    if (d > 1) {
	       vIDX_Parent[vTAG]=vTAGPrev
	       vIDX_Child[vTAGPrev]=vTAG
	       if (cDEBUG) {
	          printf ", Parent[%s]",vIDX_Parent[vTAG]
	          printf ", Child[%s]=%s",vTAGPrev,vIDX_Child[vTAGPrev]
	       }
	    }
	    vTAGPrev=vTAG
	    if (cDEBUG) printf "\n"
	}

	vTAG=vIDX_Key[1]
	vTAGMatch=sprintf("<%s",vTAG)

	if (cDEBUG) {
	   printf "# Target Point:\n"
	   printf "~ Tag <%s>\n",vIDX_Key[vGenealogyDepth]
	   printf "~ Depth <%s>\n",vGenealogyDepth
	   printf "# Starting Point:\n"
	   printf "~ Tag <%s>\n",vIDX_Key[1]
	   printf "# Searching:\n"
	}

	if (cDEBUG) { printf "# Printing: " }
	if (pExtract == cPRE_NON || pExtract == cPRE_INC || pExtract == cPRE_CLOSE) {
	   fPrintXML=1
	   if (cDEBUG) { printf "ON\n" }
	} else {
	   fPrintXML=0
	   if (cDEBUG) { printf "OFF\n" }
	}
}

function fnFormatCriteria (pCriteria) {

	if (pCriteria == "") return ""

	vConditionMax=split(pCriteria,vCondition,".")
	
	pCriteria=""

	for (c=1; c <= vConditionMax; c++) {
	    if (index(vCondition[c],"=") != 0) {

	       sub("=\"","=",vCondition[c])
	       sub("=","=\"",vCondition[c])

	       sub("\"$","",vCondition[c])
	       sub("$","\"",vCondition[c])

	       pCriteria=sprintf("%s|%s",pCriteria,vCondition[c])
	    }
	}

	sub("^[|]","",pCriteria)

	return pCriteria
}

/^[[:space:]]*<\?/ {

	next
}

{
	vXMLCopy=$0
	sub("^[[:space:]]*","",vXMLCopy)
	sub("[[:space:]]*$","",vXMLCopy)
	vXMLMask=toupper(vXMLCopy)
}

/^[[:space:]]*<[[:alpha:]]+/ {

	vOpen=NR

	vDepthCur++
	vINDENT=sprintf("%*.*s",(vDepthCur-1)*cSPC,(vDepthCur-1)*cSPC," ")

	if (cDEBUG) printf "~ +[%02d] [%-60.60s]\n",vDepthCur,vINDENT vXMLCopy
}

(index(vXMLMask,vTAGMatch) == 1) && (vDepthCur == vIDX_TAG[vTAG,cDEPTH]) {

	fCriteriaMatch=0
	vCriteria=vIDX_TAG[vTAG,cCRITERIA]
	if (cDEBUG) {
	   printf "# !Matching! TAG [%s]",vTAG
	   printf ", Depth [%s]",vIDX_TAG[vTAG,cDEPTH]
	   printf ", Criteria [%s]\n",vIDX_TAG[vTAG,cCRITERIA]
	}
	if (vCriteria != "") {
	   vConditionMax=split(vCriteria,vCondition,"|")
	   c=0
	   while (index(vXMLMask,vCondition[c+1])) { c++ }
	   if (c == vConditionMax) { fCriteriaMatch=1 } 
	}

	if (vDepthCur == vGenealogyDepth) {
	   if (vCriteria == "" || fCriteriaMatch) {

	      if (cDEBUG) printf "# !ACQUIRED! [%s](%s)\n",vTAG,vCriteria

	      if (pExtract == cPRE_NON) exit 0
	      if (pExtract == cPRE_INC) fDitherAM=1

	      if (pExtract == cPOST_NON) fDitherPM=1
	      if (pExtract == cPOST_INC) fPrintXML=1

	      if (pExtract == cTAG) {
		 sub(sprintf("</%s>",vTAG),"",vXMLCopy)
		 printf "%s\n",vXMLCopy
		 exit 0
	      }

	      if (pExtract == cGROUP) { vDepthOpen=vDepthCur; fPrintXML=1; fGrouping=1 }

	      if (pExtract == cPRE_CLOSE) { vDepthOpen=vDepthCur; fClosedAM=1 }
	      if (pExtract == cPOST_CLOSE) { vDepthOpen=vDepthCur; fClosedPM=1 }

	      if (pExtract == cATTRIBUTE) {
	         vRegExp=sprintf("%s=\"[^\"]*\"",vAttribute)
	         match(vXMLMask,vRegExp)
	         if (vAttribute != "" && RSTART > 0) {
	            vTypeset=substr(vXMLCopy,RSTART,RLENGTH)
	            match(vTypeset,"\"[^\"]*\"$")
	            vAttributeValue=substr(vTypeset,RSTART+1,RLENGTH-2)
	            printf "%s",vAttributeValue
		    exit 0
	         }
	      }
	   }
	} else {
	   if (vCriteria == "" || fCriteriaMatch) {
	      if (cDEBUG) printf "# !Waypoint! [%s](%s)\n",vTAG,vCriteria
	      vTAG=vIDX_Child[vTAG]
	      vTAGMatch=sprintf("<%s",vTAG)
	   }
	}
}

/<\// || /\/>/ {

	vClose=NR
	if (cDEBUG) {
	   if (vOpen == vClose) {
	      printf "~ -[%02d] [%-60.60s]\n",vDepthCur,vINDENT "."
	   } else {
	      printf "~ -[%02d] [%-60.60s]\n",vDepthCur,vINDENT vXMLCopy
	   }
	}

	if (fGrouping && vDepthCur == vDepthOpen) fDitherAM=1
	if (fClosedAM && vDepthCur == vDepthOpen) exit 0
	if (fClosedPM && vDepthCur == vDepthOpen) fPrintXML=1

	fReduceDepth=1
}

fPrintXML {

	printf "%s%s\n",vINDENT,vXMLCopy
}

fDitherAM || fDitherPM {

	if (fDitherAM) exit 0

	if (fDitherPM && fPrintXML == 0) fPrintXML=1
}

fReduceDepth {

	fReduceDepth=0
	vDepthCur--
	vINDENT=sprintf("%*.*s",(vDepthCur-1)*cSPC,(vDepthCur-1)*cSPC," ")
}
	' <&0 >&1

#---v----1----v----2----v----3----v----4----v----5----v----6----v----7----v

	exit 0


Last edited by Simerian; 10-29-2003 at 07:45 AM..
# 3  
Old 10-29-2003
Examples

Code:
<?xml version="1.0" encoding="LATIN1"?>
<DOCUMENT DocumentID="1">
	<EVENT Key="0" FileName="filename" FileSize="1" Method="create">
		<DATE>28102003</DATE>
		<TIME>110000</TIME>
	</EVENT>
	<EVENT Key="1" FileName="filename" FileSize="3" Method="edit">
		<DATE>29102003</DATE>
		<TIME>135600</TIME>
	</EVENT>
</DOCUMENT>

Extracting up to but not including <EVENT Key=”1”>: Use "document/event.key=1" "^"

<DOCUMENT DocumentID="1">
	<EVENT Key="0" FileName="filename" FileSize="1" Method="create">
		<DATE>28102003</DATE>
		<TIME>110000</TIME>

Extract up and including <EVENT Key=”1”>: Use "document/event.key=1" "-"

<DOCUMENT DocumentID="1">
	<EVENT Key="0" FileName="filename" FileSize="1" Method="create">
		<DATE>28102003</DATE>
		<TIME>110000</TIME>
	</EVENT>
	<EVENT Key="1" FileName="filename" FileSize="3" Method="edit">

Extracting from and including <EVENT Key=”1”>: Use “document/event.key=1" "+"

	<EVENT Key="1" FileName="filename" FileSize="3" Method="edit">
		<DATE>29102003</DATE>
		<TIME>135600</TIME>
	</EVENT>
</DOCUMENT>

Extracting from but not including <EVENT Key=”1”>: "document/event.key=1" "$"

		<DATE>29102003</DATE>
		<TIME>135600</TIME>
	</EVENT>
</DOCUMENT>

Extracting the whole of the <EVENT Key=”1”> element group: "document/event.key=1" "#"

	<EVENT Key="1" FileName="filename" FileSize="3" Method="edit">
		<DATE>29102003</DATE>
		<TIME>135600</TIME>
	</EVENT>

Extract to the last insertion point of <EVENT Key=”1”>: "document/event.key=1" "<"

<DOCUMENT DocumentID="1">
	<EVENT Key="0" FileName="filename" FileSize="1" Method="create">
		<DATE>28102003</DATE>
		<TIME>110000</TIME>
	</EVENT>
	<EVENT Key="1" FileName="filename" FileSize="3" Method="edit">
		<DATE>29102003</DATE>
		<TIME>135600</TIME>

Extracting from the last insertion point of <EVENT Key=”1”>: "document/event.key=1" ">"

	</EVENT>
</DOCUMENT>

Extracting the opening element for <EVENT Key=”1”>: "document/event.key=1" "@"

	<EVENT Key="1" FileName="filename" FileSize="3" Method="edit">

Extracting a specific attribute from <EVENT Key=”1”>: "document/event.key=1" "filesize"

	3

Multiple attributes can be specified using the form:

	"element.attribute=value/element.attribute=value/..."
or	"element.attribute=value.attribute=value/..."

Some of the more astute amongst you will have no doubt bemoaned the fact the script is currently entity name case-insensitive - which obviously breaks the XML standards. This can be rectified with a little code editing which I shall leave to the more adventurous of you.

HINT: Look for the use of toupper!
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Extract a value from an xml file

I have this XML file format and all in one line: Fri Dec 23 00:14:52 2016 Logged Message:689|<?xml version="1.0" encoding="UTF-8"?><PORT_RESPONSE><HEADER><ORIGINATOR>XMG</ORIGINATOR><DESTINAT... (16 Replies)
Discussion started by: mrn6430
16 Replies

2. Shell Programming and Scripting

Extract strings from XML files and create a new XML

Hello everybody, I have a double mission with some XML files, which is pretty challenging for my actual beginner UNIX knowledge. I need to extract some strings from multiple XML files and create a new XML file with the searched strings.. The original XML files contain the source code for... (12 Replies)
Discussion started by: milano.churchil
12 Replies

3. Shell Programming and Scripting

Extract a particular xml only from an xml jar file

Hi..need help on how to extract a particular xml file only from an xml jar file... thanks! (2 Replies)
Discussion started by: qwerty000
2 Replies

4. Shell Programming and Scripting

Extract Multivalue from XML

I have below attached xml file , how can I have my desired output as below. i/p file <soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"><soap:Body><ns2:executeMDXResponse... (4 Replies)
Discussion started by: manas_ranjan
4 Replies

5. Shell Programming and Scripting

Extract value from XML

I have a file like below <soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"><soap:Body><ns2:executeMDXResponse... (9 Replies)
Discussion started by: manas_ranjan
9 Replies

6. Shell Programming and Scripting

xml extract problem

I have looked at other responses and never was able to modify to work. data is: <?xml version="1.0"?> <note version="0.3" xmlns:link="http://beatniksoftware.com/tomboy/link" xmlns:size="http://beatniksoftware.com/tomboy/size" xmlns="http://beatniksoftware.com/tomboy"><title>recoll</title><text... (12 Replies)
Discussion started by: Klasform
12 Replies

7. Shell Programming and Scripting

sed extract from xml

I have an xml file that generally looks like this: "<row><dnorpattern>02788920</dnorpattern><description/></row><row><dnorpattern>\+ 44146322XXXX</dnorpattern><description/></row><row><dnorpattern>40XXX</dnorpattern><description/></row><row><dnorpattern>11</dn... (4 Replies)
Discussion started by: garboon
4 Replies

8. Shell Programming and Scripting

XML data extract

Hi all, I have the following xml document : <HEADER><El1>asdf</El1> <El2>3</El2> <El3>asad</El3> <El4>asasdf</El4> <El5>asdf</El5> <El6>asdf</El6> <El7>asdf</El7> <El8>A</El8> <El9>0</El9> <El10>75291028141917</El10> <El11>asdf</El11> <El12>sdf</El12> <El13>er</El13> <El14><El15>asdf... (1 Reply)
Discussion started by: nthed
1 Replies

9. Shell Programming and Scripting

SED extract XML value

I have the following string: <min-pool-size>2</min-pool-size> When I pipe the string into the following code I am expcting for it to return just the value "2", but its just reurning the whole string. Why?? sed -n '/<min-pool-size>/,/<\/min-pool-size>/p' Outputting:... (13 Replies)
Discussion started by: ArterialTool
13 Replies

10. Post Here to Contact Site Administrators and Moderators

· simerian · Posting Issues

I have tried to post threads with attachments (i.e. for script submissions), however, the site reports that the files are over 1MB even when they are less than 10KB in size. Anybody else experiencing this issue or have any ideas? (2 Replies)
Discussion started by: Simerian
2 Replies
Login or Register to Ask a Question