Sponsored Content
Top Forums UNIX for Beginners Questions & Answers UNIX shell script required to read two columns from xml Post 303031976 by Corona688 on Friday 8th of March 2019 10:39:21 AM
Old 03-08-2019
Handling XML is not trivial, but we get asked for it a lot, and I have a script for the purpose:

Code:
# yanx.awk v0.0.9, Tyler Montbriand, 2019.  Yet another noncompliant XML parser
###############################################################################
# XML is a pain to process in the shell, but people need it all the time.
# I've been using and improving this kludge since 2014 or so.  It parses and
# stacks tags and digests parameters, allowing simple XML processing and
# extraction to be managed with a handful of lines addendum.
#
# I've restricted my use of GNU features enough that this script will run on
# busybox's awk.  I think it works with mawk except -e is unsupported.
# You can work around that by running multiple files, i.e.
# mawk -f yanx.awk -f mystuff.awk inputfile
###############################################################################
# Basic use:
#
# Fed this XML, <body><html a="b">Your Web Browser Hates This</html></body>
# yanx will read it token-by-token as so:
#     Line 1:  Empty, skipped
#     Line 2:  $1="body"
#     Line 3:  $1="html a="b"", $2="Your web browser hates this"
#     Line 4:  $1="/html"
#     Line 5:  $1="/body", $2="\n"
#
# The script sets a few new "special" variables along the way.
# TAG           The name of the current tag, uppercased.
# CTAG          If close-tag, name in uppercase.
# TAGS          List of nested tags, like HTML%BODY%, including current tag
# LTAGS         List of nested tags, not including current tag
# ARGS          Array of tag parameters, uppercased.  i.e. ARGS["HREF"]
# DEP           How many tags deep it's nested, including current tag.
#
###############################################################################
# Examples:
# # Rewrite cdata of all divs
# awk -f yanx.awk -e 'TAGS ~ /^DIV%/ { $2="quux froob" } 1' input
# # Extract href's from every link
# awk -f yanx.awk -e 'TAGS~/^A%/ && ("HREF" in ARGS) {
#       print ARGS["HREF"] }' ORS="\n" input
###############################################################################
# Known Bugs:
# A short XML script can't possibly handle DOD, etc.  Entities a la &lt;
# are not translated either.
#
# I've done my best to make it swallow <!--, <? ?> and other such fancy
# XML syntax without choking, but that doesn't mean it handles them
# properly either.
#
# It's an XML parser, not an HTML parser.  It probably won't swallow a
# wild-from-the internet HTML web page without some cleanup first:
# javascript, tags inside comments, etc will be mangled instead of ignored.
#
# Last: Because of its design, when printing raw HTML, yanx adds an extra <
# to the end of the file.  This is because < belongs at the beginning of
# a token but awk is told it's printed at the end.  There is no equivalent
# "line prefix" variable that I know of, if you want it to print smarter
# you'll have to print the <'s yourself, by setting ORS=" and
# printing lines like print "<" $0
###############################################################################
BEGIN {
        FS=">"; OFS=">";
        RS="<"; ORS="<"
}

# After match("qwertyuiop", /rty/)
#       rbefore("qwertyuiop") is "qwe",
#       rmid("qwertyuipo")    is "r"
#       rall("qwertyuiop")    is "rty"
#       rafter("qwertyuiop")  is "uiop"

# !?!?!
# function rbefore(STR)   { return(substr(STR, N, RSTART-1)); }# before match
function rbefore(STR)   { return(substr(STR, 0, RSTART-1)); }# before match
function rmid(STR)      { return(substr(STR, RSTART, 1)); }  # First char match
function rall(STR)      { return(substr(STR, RSTART, RLENGTH)); }# Entire match
function rafter(STR)    { return(substr(STR, RSTART+RLENGTH)); }# after match

function aquote(OUT, A, PFIX, TA) { # Turns Q SUBSEP R into A[PFIX":"Q]=R
        if(OUT)
        {
                if(PFIX) PFIX=PFIX":"
                split(OUT, TA, SUBSEP);
                A[toupper(PFIX) toupper(TA[1])]=TA[2];
        }

        return("");
}

# Intended to be less stupid about quoted text in XML/HTML.
# Splits a='b' c='d' e='f' into A[PFIX":"a]=b, A[PFIX":"c]=d, etc.
function qsplit(STR, A, PFIX, X, OUT) {
        sub(/\/$/, "", STR);    # Self-closing tags, mumblegrumble
        while(STR && match(STR, /([ \n\t]+)|[\x27\x22=]/))
        {
                OUT = OUT rbefore(STR);
                RMID=rmid(STR);

                if((RMID == "'") || (RMID == "\""))     # Quote characters
                {
                        if(!Q)          Q=RMID;         # Begin quote section
                        else if(Q == RMID)      Q="";   # End quote section
                        else                    OUT = OUT RMID; # Quoted quote
                } else if(RMID == "=") {
                        if(Q)   OUT=OUT RMID; else OUT=OUT SUBSEP;
                } else if((RMID=="\r")||(RMID=="\n")||(RMID=="\t")||(RMID==" ")) {
                        if(Q)   OUT = OUT rall(STR); # Literal quoted whitespace
                        else    OUT = aquote(OUT, A, PFIX); # Unquoted WS, next block
                }
                STR=rafter(STR); # Strip off the text we've processed already.
        }

        aquote(OUT STR, A, PFIX); # Process any text we haven't already.
}


{ SPEC=0 ; TAG="" }

NR==1 {
        if(ORS == RS) print;
        next } # The first "line" is blank when RS=<

/^[!?]/ { SPEC=1    }   # XML specification junk

# Handle open-tags
(!SPEC) && match($1, /^[^\/ \r\n\t>]+/) {
        CTAG=""
        TAG=substr(toupper($1), RSTART, RLENGTH);
        if((!SPEC) && !($1 ~ /\/$/))
        {
                TAGS=TAG "%" TAGS;
                DEP++;
                LTAGS=TAGS
        }

        for(X in ARGS) delete ARGS[X];

        qsplit(rafter($1), ARGS, "", "", "");
}

# Handle close-tags
(!SPEC) && /^[\/]/ {
        sub(/^\//, "", $1);
        LTAGS=TAGS
        CTAG=toupper($1)
        TAG=""
#        sub("^.*" toupper($1) "%", "", TAGS);
        sub("^" toupper($1) "%", "", TAGS);
        $1="/"$1
        DEP=split(TAGS, TA, "%")-1;
        # Update TAG with tag on top of stack, if any
#       if(DEP < 0) {   DEP=0;  TAG=""  }
#       else { TAG=TA[DEP]; }
}

Case in point: Congratulations, this specific XML found a bug in yanx, hence v 0.0.9.

Code:
$ awk -f yanx-0.0.9.awk -e 'TAG=="CONNECTOR" { print ARGS["FROMINSTANCE"], ARGS["FROMFIELD"] }' ORS="\n" OFS="\t" input.xml
EPIC    DID
EPIC    PROJECT_NAME
SQ_EPIC DID
SQ_EPIC PROJECT_NAME

$

These 2 Users Gave Thanks to Corona688 For This Post:
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

shell script required to convert rows to columns

Hi Friends, I have a log file as below siteid = HYD spc = 100 rset = RS_D_M siteid = DEL spc = 200 rset = RS_K_L siteid = DEL2 spc = 210 rset = RS_D_M Now I need a output like column wise as below. siteid SPC rset HYD 100 RS_D_M (2 Replies)
Discussion started by: suresh3566
2 Replies

2. UNIX for Advanced & Expert Users

Shell script failing to read large Xml record-urgent critical help

Hi All, I have shell script running on AIX 5.3 box. It has 7 to 8 "sed" commands piped(|) together. It has a an Xml file as its input which has many records internally. There are certain record which which have more than hundered tags.The script is taking a huge amount of time more than 1.5 hrs... (10 Replies)
Discussion started by: aixjadoo
10 Replies

3. Shell Programming and Scripting

Help required on basic Unix Bourne Shell Script

Howdy People :), I'm a newbie & its my first question here. I've started learning Unix Bourne Shell scripting recently and struggling already :p Can someone PLEASE help me with the following problem. Somehow my script is not working. Display an initial prompt of the form: Welcome to... (1 Reply)
Discussion started by: methopoth
1 Replies

4. Shell Programming and Scripting

How to send email using shell script in UNIX, Is any environment setup required in Mac OS X ?

Hi All, I am using Mac OS X (Leopard OS). I am very new to UNIX. My requirement is that, by running a shell script, I create a log file. So I have to send a mail having that log file attached. What I tried to do is, I simply tried to check,whether this direct command works or not. So I... (2 Replies)
Discussion started by: Afreen
2 Replies

5. Shell Programming and Scripting

Unix Script to read the XML file from Website

Hi Experts, I need a unix shell script which can copy the xml file from the below pasted website and paste in in my unix directory. http://www.westpac.co.nz/olcontent/olcontent.nsf/fx.xml Thanks in Advance... (8 Replies)
Discussion started by: phani333
8 Replies

6. UNIX for Advanced & Expert Users

Shell Script to read XML tags and the data within that tag

Hi unix Gurus, I am really new to Unix Scripting. Please help me to create a shell script which reads the xml file and from that i need to fetch a particular information. For example <SOURCE BUSINESSNAME ="" DATABASETYPE ="Teradata" DBDNAME ="DWPROD3" DESCRIPTION ="" NAME... (2 Replies)
Discussion started by: SmilePlease
2 Replies

7. Shell Programming and Scripting

Shell Script to read XML file

Hi unix Gurus, I am really new to Unix Scripting. Please help me to create a shell script which reads the xml file and from that i need to fetch a particular information. For example <SOURCE BUSINESSNAME ="" DATABASETYPE ="Teradata" DBDNAME ="DWPROD3" DESCRIPTION ="" NAME... (5 Replies)
Discussion started by: SmilePlease
5 Replies

8. Red Hat

How to read an xml file through shell script?

Hey , can we read an xml file and make changes in it through shell script. Thanks (4 Replies)
Discussion started by: ramsavi
4 Replies

9. Shell Programming and Scripting

UNIX Shell script to work with .xml file

Hi Team, Could you please help me on below query: I want to retrieve XML elements from one .xml file. This .xml file has commented tags as well. so i am planning to write Unix command/script which 1.will chekc for this .xml file 2. it will ignore the commented XML lines. i.e. XML tags between... (3 Replies)
Discussion started by: waiting4u
3 Replies

10. Shell Programming and Scripting

Read xml tags and then remove the tag using shell script

<Start> <Header> This is header section </Header> <Body> <Body_start> This is body section <a> <b> <c> <st>111</st> </c> <d> <st>blank</st> </d> </b> </a> </Body_start> <Body_section> This is body section (3 Replies)
Discussion started by: RJG
3 Replies
All times are GMT -4. The time now is 10:12 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy