Parse XML For Values


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Parse XML For Values
# 1  
Old 10-23-2014
Parse XML For Values

Hi All,
I want to parse XML to extract values of the tags to do further processing. The XML looks like
Code:
<?xml version="1.0" encoding="ISO-8859-1"?>
<allinput>
<input A="2389906" B="install">
<C>111</C>
<D>222</D>
<E>333</E>
<F></F>
<G>444</G>
<H></H>
<I></I>
<J></J>
<K>C,D,E,G</K>
<L>C,D,E,G</L>
<M>555</M>
</input>
<input A="4732435" B="delete">
<C>999</C>
<D>792</D>
<E></E>
<F></F>
<G>990</G>
<H>942</H>
<I>992</I>
<J></J>
<K>C,D,G,H,I</K>
<L>C,D,G,H,I</L>
<M>804</M>
</input>
</allinput>

I want to extract valuesof Tags A to M for each group and do processing based on the values. There may be only 1 group or maybe 100s.

Can someone suggest the way forward.

Thanks!
# 2  
Old 10-23-2014
It's hard to help you when you post data that's so obviously different from what the real data will look like. Obscuring is one thing, but this is altered perhaps too far to be a useful test.

Once again, my awk generic XML parser:

Code:
BEGIN {
        FS=">"; #       OFS=">";
        RS="<"; #       ORS="<"
}

# These should be special variables for match() but aren't.
function rbefore(STR)   { return(substr(STR, N, RSTART-1)); }# before match
function rmid(STR)      { return(substr(STR, RSTART, 1)); }  # First char match
function rall(STR)      { return(substr(STR, RSTART, RLENGTH)); }# Entire match
function rafter(STR)    { return(substr(STR, RSTART+RLENGTH)); }# after match

function aquote(OUT, A, PFIX, TA) { # Turns Q SUBSEP R into A[PFIX":"Q]=R
        if(OUT)
        {
                if(PFIX) PFIX=PFIX":"
                split(OUT, TA, SUBSEP);
                A[toupper(PFIX) toupper(TA[1])]=TA[2];
        }

        return("");
}

# Intended to be less stupid about quoted text in XML/HTML.
# Splits a='b' c='d' e='f' into A[PFIX":"a]=b, A[PFIX":"c]=d, etc.
function qsplit(STR, A, PFIX, X, OUT) {
        while(STR && match(STR, /([ \n\t]+)|[\x27\x22=]/))
        {
                OUT = OUT rbefore(STR);
                RMID=rmid(STR);

                if((RMID == "'") || (RMID == "\""))     # Quote characters
                {
                        if(!Q)          Q=RMID;         # Begin quote section
                        else if(Q == RMID)      Q="";   # End quote section
                        else                    OUT = OUT RMID; # Quoted quote
                } else if(RMID == "=") {
                        if(Q)   OUT=OUT RMID; else OUT=OUT SUBSEP;
                } else if((RMID=="\r")||(RMID=="\n")||(RMID=="\t")||(RMID==" ")) {
                        if(Q)   OUT = OUT rall(STR); # Literal quoted whitespace
                        else    OUT = aquote(OUT, A, PFIX); # Unquoted WS, next block
                }
                STR=rafter(STR); # Strip off the text we've processed already.
        }

        aquote(OUT STR, A, PFIX); # Process any text we haven't already.
}


{ SPEC=0 ; TAG="" }

NR==1 {
        if(ORS == RS) print;
        next } # The first "line" is blank when RS=<

/^[!?]/ { SPEC=1    }   # XML specification junk

# Handle open-tags
match($1, /^[^\/ \r\n\t>]+/) {
        TAG=substr(toupper($1), RSTART, RLENGTH);
        if((!SPEC) && !($1 ~ /\/$/))
        {
                TAGS=TAG "%" TAGS;
                DEP++;
                LTAGS=TAGS
        }

        for(X in ARGS) delete ARGS[X];

        qsplit(rafter($1), ARGS);
}

# Handle close-tags
(!SPEC) && /^[\/]/ {
        sub(/^\//, "", $1);
        LTAGS=TAGS
#        sub("^.*" toupper($1) "%", "", TAGS);
        sub("^" toupper($1) "%", "", TAGS);
        $1="/"$1
        DEP=split(TAGS, TA, "%")-1;
        if(DEP < 0) DEP=0;
        TAG=TAGS
        # Get the previous opened tag if any
        sub(/%.*/, "", TAG);
}

### Example of how to use it ###
# TAG is the name of the last open-tag
# TAGS is an array of tag names like INNER%MIDDLE%OUTERMOST
# $2 is CDATA inside the current tag
# ARGS is an array of arguments for the current tag
# Tag names are all converted to uppercase.
#
# So, when processing <a> in  <html><a href="index.html">Yay!</a></html>
# it would have:
# TAG="A"
# ARGS["HREF"]="index.html"
# TAGS="A%HTML"
# $2="Yay!"

### Prints info on all open-tags and their CDATA whenever inside an <INPUT> tag.
### Tags with no CDATA are ignored.
(TAGS ~ /(^|%)INPUT%/) && ($2 ~ /[^ \r\n\t]/) {
        print "Data for tag " TAG" of " TAGS
        for(X in ARGS) print "\t"TAG"["X"]="ARGS[X]
        print "\tCDATA="$2
}

### Your Code Here ####

Code:
$ awk -f allinput.awk allinput.xml

Data for tag C of C%INPUT%ALLINPUT%
        CDATA=111

Data for tag D of D%INPUT%ALLINPUT%
        CDATA=222

Data for tag E of E%INPUT%ALLINPUT%
        CDATA=333

Data for tag G of G%INPUT%ALLINPUT%
        CDATA=444

Data for tag K of K%INPUT%ALLINPUT%
        CDATA=C,D,E,G

Data for tag L of L%INPUT%ALLINPUT%
        CDATA=C,D,E,G

Data for tag M of M%INPUT%ALLINPUT%
        CDATA=555

Data for tag C of C%INPUT%ALLINPUT%
        CDATA=999

Data for tag D of D%INPUT%ALLINPUT%
        CDATA=792

Data for tag G of G%INPUT%ALLINPUT%
        CDATA=990

Data for tag H of H%INPUT%ALLINPUT%
        CDATA=942

Data for tag I of I%INPUT%ALLINPUT%
        CDATA=992

Data for tag K of K%INPUT%ALLINPUT%
        CDATA=C,D,G,H,I

Data for tag L of L%INPUT%ALLINPUT%
        CDATA=C,D,G,H,I

Data for tag M of M%INPUT%ALLINPUT%
        CDATA=804

$


Last edited by Corona688; 10-23-2014 at 01:20 PM..
This User Gave Thanks to Corona688 For This Post:
# 3  
Old 10-23-2014
Thanks! I will try it..... Smilie

---------- Post updated at 12:24 PM ---------- Previous update was at 12:00 PM ----------

Looks very good!... Thanks a lot.

Just two things:
1. I also need value for A & B.
2. How can I execute some shell commands after each group, which has values A to M.
# 4  
Old 10-23-2014
1) Easy enough, but what do you want to do with them?
2) Good, now we're going somewhere.

Getting the data out of awk, into the shell, is the question now. Imagine you made a loop in the shell.

Code:
while [reading xml file]
do
       # What variables do you need here, set to what, for each tag?
done

Tell me exactly how you need to use this data and I can help create a loop for you.

A little more detail on the nature of your data would be good as well. If it's not as pretty as your example -- tags and data full of newlines, etc -- that might need some mangling to fix.
# 5  
Old 10-23-2014
I want to perform database queries based on Values of A to M. I need to decide the type of query whether insert,update or delete based on the value of B. And, will update the value of database table attributes using values C to M.

I am sorry but I cannot expose the data fields. Smilie
# 6  
Old 10-23-2014
It tells me nothing about your customer credit card list or whatever to tell me that your XML might be messy and full of extra newlines which should be tossed before your script sees the data. You could at least have answered that.

I don't need the actual data. I do need to know what you want to do with it. You want to run shell commands on "something" -- well, what shell commands would you be running, based on your mockup data? Assume each tag is a single column, you can do the splitting yourself.

Is there any safe separator I can use, anything that's not found in A through M? Does it ever contain quotes or tabs?

Last edited by Corona688; 10-23-2014 at 03:56 PM..
# 7  
Old 10-23-2014
The best I can do without more information:

Code:
$ cat allinput.awk

BEGIN {
        FS=">"; OFS="\t"
        RS="<";

        # INPUTA, as in tag "input" attribute "a".  They must be allcaps here.
        split("INPUTA INPUTB A B C D E F G H I J K L M", ORDER, " ");
}

# These should be special variables for match() but aren't.
function rbefore(STR)   { return(substr(STR, N, RSTART-1)); }# before match
function rmid(STR)      { return(substr(STR, RSTART, 1)); }  # First char match
function rall(STR)      { return(substr(STR, RSTART, RLENGTH)); }# Entire match
function rafter(STR)    { return(substr(STR, RSTART+RLENGTH)); }# after match

function aquote(OUT, A, PFIX, TA) { # Turns Q SUBSEP R into A[PFIX":"Q]=R
        if(OUT)
        {
                if(PFIX) PFIX=PFIX":"
                split(OUT, TA, SUBSEP);
                A[toupper(PFIX) toupper(TA[1])]=TA[2];
        }

        return("");
}

# Intended to be less stupid about quoted text in XML/HTML.
# Splits a='b' c='d' e='f' into A[PFIX":"a]=b, A[PFIX":"c]=d, etc.
function qsplit(STR, A, PFIX, X, OUT) {
        while(STR && match(STR, /([ \n\t]+)|[\x27\x22=]/))
        {
                OUT = OUT rbefore(STR);
                RMID=rmid(STR);

                if((RMID == "'") || (RMID == "\""))     # Quote characters
                {
                        if(!Q)          Q=RMID;         # Begin quote section
                        else if(Q == RMID)      Q="";   # End quote section
                        else                    OUT = OUT RMID; # Quoted quote
                } else if(RMID == "=") {
                        if(Q)   OUT=OUT RMID; else OUT=OUT SUBSEP;
                } else if((RMID=="\r")||(RMID=="\n")||(RMID=="\t")||(RMID==" ")) {
                        if(Q)   OUT = OUT rall(STR); # Literal quoted whitespace
                        else    OUT = aquote(OUT, A, PFIX); # Unquoted WS, next block
                }
                STR=rafter(STR); # Strip off the text we've processed already.
        }

        aquote(OUT STR, A, PFIX); # Process any text we haven't already.
}


{ SPEC=0 ; TAG="" }

NR==1 {
        if(ORS == RS) print;
        next } # The first "line" is blank when RS=<

/^[!?]/ { SPEC=1    }   # XML specification junk

# Handle open-tags
match($1, /^[^\/ \r\n\t>]+/) {
        TAG=substr(toupper($1), RSTART, RLENGTH);
        if((!SPEC) && !($1 ~ /\/$/))
        {
                TAGS=TAG "%" TAGS;
                DEP++;
                LTAGS=TAGS
        }

        for(X in ARGS) delete ARGS[X];

        qsplit(rafter($1), ARGS);
}

# Handle close-tags
(!SPEC) && /^[\/]/ {
        sub(/^\//, "", $1);
        LTAGS=TAGS

#        sub("^.*" toupper($1) "%", "", TAGS);
        sub("^" toupper($1) "%", "", TAGS);
        $1="/"$1
        DEP=split(TAGS, TA, "%")-1;
        if(DEP < 0) DEP=0;
}

### Example of how to use it ###
# TAG is the name of the last open-tag
# TAGS is an array of tag names like INNER%MIDDLE%OUTERMOST
# $2 is CDATA inside the current tag
# ARGS is an array of arguments for the current tag
#
# So, when processing <a> in  <html><a href="index.html">Yay!</a></html>
# it would have:
# TAG="A"
# ARGS["HREF"]="index.html"
# TAGS="A%HTML"
# $2="Yay!"

# Handle <input> tag
(TAGS ~ /^INPUT%/) {    for(X in ARGS)  DATA[TAG X]=ARGS[X]     }

# Parse <tags> inside <input> so DATA[TAGNAME]=CONTENTS
(TAGS ~ /(^|%)INPUT%/) && ($2 ~ /[^ \r\n\t]/) && !/^\// {
        # Clean up tag contents
        sub(/^[ \r\n]+/, "", $2);
        sub(/[ \r\n]+$/, "", $2);
        DATA[TAG]=$2
}

# Handle </input>, printing and clearing collected data
toupper($1) == "/INPUT" {
        PFIX=""
        for(M=1; M in ORDER; M++)
        {
                # Convert blank fields into single spaces, since the shell will see
                # two tabs in a row as one field, skipping the blank one.
                if(DATA[ORDER[M]]=="") DATA[ORDER[M]]=" "
                printf("%s%s", PFIX, DATA[ORDER[M]]);
                PFIX=OFS;
        }

        printf("\n");

        for(X in DATA) delete DATA[X];
}

$ awk -f allinput.awk allinput.xml

2389906 install                 111     222     333             444                     C,D,E,G C,D,E,G 555
4732435 delete                  999     792                     990     942    992              C,D,G,H,I       C,D,G,H,I       804

$ awk -f allinput.awk allinput.xml |
while IFS=$'\t' read INPUTA INPUTB A B C D E F G H I J K L M
do
        # Convert all single-space fields into completely blank fields
        for X in INPUTA INPUTB A B C D E F G H I J K L M
        do
                [ "${!X}" = " " ] && read $X # Cheeky trick to set arbitrary variable contents
        done < /dev/null
        echo "doing something with $INPUTA $INPUTB $L $M"
done

doing something with 2389906 install C,D,E,G 555
doing something with 4732435 delete C,D,G,H,I 804

$

The best I can do without better information. It won't work if your data contains tabs anywhere. I've highlighted in red anywhere tag/attribute names are hardcoded.

Last edited by Corona688; 10-23-2014 at 04:20 PM..
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Parse xml file

I am trying to create a shell script that will parse an xml file (file attached). awk '/Id v=/ { print }' Test.xml | sed 's!<Id v=\"\(.*\)\"/>!\1!' > output.txt An output.txt file is created but it is empty. It should contain the value 222159 in it. Thanks. (7 Replies)
Discussion started by: cmccabe
7 Replies

2. Shell Programming and Scripting

Parse XML using xmllint

Hi All, Need help to parse the xml file in shell script using xmllint. Below is the sample xml file. <CARS> <AUDI> <Speed="45"/> <speed="55"/> <speed="75"/> <speed="95"/> </AUDI> <BMW> <Speed="30"/> <speed="75"/> <speed="120"/> <speed="135"/> </BMW>... (6 Replies)
Discussion started by: prasanna2166
6 Replies

3. UNIX for Dummies Questions & Answers

Parse xml file

HI Guys, Input .XML <xn:MeContext id="L0307"> <xn:ManagedElement id="1"> <xn:VsDataContainer id="1"> <xn:attributes> <xn:vsDataType>vsDataENodeBFunction</xn:vsDataType> ... (3 Replies)
Discussion started by: pareshkp
3 Replies

4. Programming

Parse XML file

How do I get the field info for tags ID, NAME, DESCRIPTION. Below is my current code put I can't get beyond the first_child of the file. use strict; use warnings; use XML::Simplehttp://images.intellitxt.com/ast/adTypes/icon1.png; use... (1 Reply)
Discussion started by: leemalloy
1 Replies

5. Shell Programming and Scripting

Parse XML line

Hi I am having an xml file with lines like these <d name="T2tt_350_100" title="T2tt_012j_350_100_428p4_pPF_PU" add="1" color="4" ls="1" lw="2" normf="1" xsection="0.070152" EqLumi="94651.6"... (2 Replies)
Discussion started by: Alkass
2 Replies

6. Shell Programming and Scripting

Parse XML

Hi all! I'm looking to write a quick script and in it I need to request an XML file from a service running on localhost and parse that XML file and output it. I'm looking to do it in bash although it doesn't really matter what shell it is in. The XML file returned would look like this: ... (3 Replies)
Discussion started by: mtehonica
3 Replies

7. Shell Programming and Scripting

Parse an XML task list to create each task.xml file

I have an task definition listing xml file that contains a list of tasks such as <TASKLIST <TASK definition="Completion date" id="Taskname1" Some other <CODE name="Code12" <Parameter pname="Dog" input="5.6" units="feet" etc /Parameter> <Parameter... (3 Replies)
Discussion started by: MissI
3 Replies

8. Shell Programming and Scripting

How can I parse xml file?

How can I parse file containing xml ? I am sure that its best to use perl - but my perl is not very good - can someone help? Example below contents of file containing the xml - I basically want to parse the file and have each field contained in a variable.. ie. I want to store the account... (14 Replies)
Discussion started by: frustrated1
14 Replies

9. Shell Programming and Scripting

How to parse a XML file using PERL and XML::DOm

I need to know the way. I have got parsing down some nodes. But I was unable to get the child node perfectly. If you have code please send it. It will be very useful for me. (0 Replies)
Discussion started by: girigopal
0 Replies

10. Programming

parse xml

Hi, I'm looking for an "easy" way to parse a xml file to a proper structure. The xml looks like this What shall I use? Does anybody has some example-code to share or some good links/book-references? thx for any reply -fe (5 Replies)
Discussion started by: bin-doph
5 Replies
Login or Register to Ask a Question