General Purpose XML Processing


 
Thread Tools Search this Thread
Top Forums UNIX for Beginners Questions & Answers General Purpose XML Processing
# 1  
Old 05-19-2016
General Purpose XML Processing

I've been kicking this around for a while now, I might as well post it here.
[edit] v0.0.9, now properly supporting self-closing tags.
[edit] v0.0.8, an important quoting fix and a minor change which should handle special <? <!-- etc. tags without seizing up as often. Otherwise the code hasn't changed much.

Code:
# yanx.awk v0.0.9, Tyler Montbriand, 2019.  Yet another noncompliant XML parser
###############################################################################
# XML is a pain to process in the shell, but people need it all the time.
# I've been using and improving this kludge since 2014 or so.  It parses and
# stacks tags and digests parameters, allowing simple XML processing and
# extraction to be managed with a handful of lines addendum.
#
# I've restricted my use of GNU features enough that this script will run on
# busybox's awk.  I think it works with mawk except -e is unsupported.
# You can work around that by running multiple files, i.e.
# mawk -f yanx.awk -f mystuff.awk inputfile
###############################################################################
# Basic use:
#
# Fed this XML, <body><html a="b">Your Web Browser Hates This</html></body>
# yanx will read it token-by-token as so:
#     Line 1:  Empty, skipped
#     Line 2:  $1="body"
#     Line 3:  $1="html a="b"", $2="Your web browser hates this"
#     Line 4:  $1="/html"
#     Line 5:  $1="/body", $2="\n"
#
# The script sets a few new "special" variables along the way.
# TAG           The name of the current tag, uppercased.
# CTAG          If close-tag, name in uppercase.
# TAGS          List of nested tags, like HTML%BODY%, including current tag
# LTAGS         List of nested tags, not including current tag
# ARGS          Array of tag parameters, uppercased.  i.e. ARGS["HREF"]
# DEP           How many tags deep it's nested, including current tag.
#
###############################################################################
# Examples:
# # Rewrite cdata of all divs
# awk -f yanx.awk -e 'TAGS ~ /^DIV%/ { $2="quux froob" } 1' input
# # Extract href's from every link
# awk -f yanx.awk -e 'TAGS~/^A%/ && ("HREF" in ARGS) {
#       print ARGS["HREF"] }' ORS="\n" input
###############################################################################
# Known Bugs:
# A short XML script can't possibly handle DOD, etc.  Entities a la &lt;
# are not translated either.
#
# I've done my best to make it swallow <!--, <? ?> and other such fancy
# XML syntax without choking, but that doesn't mean it handles them
# properly either.
#
# It's an XML parser, not an HTML parser.  It probably won't swallow a
# wild-from-the internet HTML web page without some cleanup first:
# javascript, tags inside comments, etc will be mangled instead of ignored.
#
# Last: Because of its design, when printing raw HTML, yanx adds an extra <
# to the end of the file.  This is because < belongs at the beginning of
# a token but awk is told it's printed at the end.  There is no equivalent
# "line prefix" variable that I know of, if you want it to print smarter
# you'll have to print the <'s yourself, by setting ORS=" and
# printing lines like print "<" $0
###############################################################################
BEGIN {
        FS=">"; OFS=">";
        RS="<"; ORS="<"
}

# After match("qwertyuiop", /rty/)
#       rbefore("qwertyuiop") is "qwe",
#       rmid("qwertyuipo")    is "r"
#       rall("qwertyuiop")    is "rty"
#       rafter("qwertyuiop")  is "uiop"

# !?!?!
# function rbefore(STR)   { return(substr(STR, N, RSTART-1)); }# before match
function rbefore(STR)   { return(substr(STR, 0, RSTART-1)); }# before match
function rmid(STR)      { return(substr(STR, RSTART, 1)); }  # First char match
function rall(STR)      { return(substr(STR, RSTART, RLENGTH)); }# Entire match
function rafter(STR)    { return(substr(STR, RSTART+RLENGTH)); }# after match

function aquote(OUT, A, PFIX, TA) { # Turns Q SUBSEP R into A[PFIX":"Q]=R
        if(OUT)
        {
                if(PFIX) PFIX=PFIX":"
                split(OUT, TA, SUBSEP);
                A[toupper(PFIX) toupper(TA[1])]=TA[2];
        }

        return("");
}

# Intended to be less stupid about quoted text in XML/HTML.
# Splits a='b' c='d' e='f' into A[PFIX":"a]=b, A[PFIX":"c]=d, etc.
function qsplit(STR, A, PFIX, X, OUT) {
        sub(/\/$/, "", STR);    # Self-closing tags, mumblegrumble
        while(STR && match(STR, /([ \n\t]+)|[\x27\x22=]/))
        {
                OUT = OUT rbefore(STR);
                RMID=rmid(STR);

                if((RMID == "'") || (RMID == "\""))     # Quote characters
                {
                        if(!Q)          Q=RMID;         # Begin quote section
                        else if(Q == RMID)      Q="";   # End quote section
                        else                    OUT = OUT RMID; # Quoted quote
                } else if(RMID == "=") {
                        if(Q)   OUT=OUT RMID; else OUT=OUT SUBSEP;
                } else if((RMID=="\r")||(RMID=="\n")||(RMID=="\t")||(RMID==" ")) {
                        if(Q)   OUT = OUT rall(STR); # Literal quoted whitespace
                        else    OUT = aquote(OUT, A, PFIX); # Unquoted WS, next block
                }
                STR=rafter(STR); # Strip off the text we've processed already.
        }

        aquote(OUT STR, A, PFIX); # Process any text we haven't already.
}


{ SPEC=0 ; TAG="" }

NR==1 {
        if(ORS == RS) print;
        next } # The first "line" is blank when RS=<

/^[!?]/ { SPEC=1    }   # XML specification junk

# Handle open-tags
(!SPEC) && match($1, /^[^\/ \r\n\t>]+/) {
        CTAG=""
        TAG=substr(toupper($1), RSTART, RLENGTH);
        if((!SPEC) && !($1 ~ /\/$/))
        {
                TAGS=TAG "%" TAGS;
                DEP++;
                LTAGS=TAGS
        }

        for(X in ARGS) delete ARGS[X];

        qsplit(rafter($1), ARGS, "", "", "");
}

# Handle close-tags
(!SPEC) && /^[\/]/ {
        sub(/^\//, "", $1);
        LTAGS=TAGS
        CTAG=toupper($1)
        TAG=""
#        sub("^.*" toupper($1) "%", "", TAGS);
        sub("^" toupper($1) "%", "", TAGS);
        $1="/"$1
        DEP=split(TAGS, TA, "%")-1;
        # Update TAG with tag on top of stack, if any
#       if(DEP < 0) {   DEP=0;  TAG=""  }
#       else { TAG=TA[DEP]; }
}


Last edited by Corona688; 03-08-2019 at 11:41 AM.. Reason: v0.0.8, quoting fix
These 7 Users Gave Thanks to Corona688 For This Post:
# 2  
Old 05-22-2016
Hello Corona688(One of the Gems of this forum),

First of all a big THANK YOU for writing this brilliant code Smilie(fan of you always). Could you please post a example or complex example for a Input_file and code too here, I apologies to bother you on same but it will be helpful for us to understand the code more clearly. I will be grateful to you if you could do so.

Thanks,
R. Singh
This User Gave Thanks to RavinderSingh13 For This Post:
# 3  
Old 05-24-2016
Rather than write one from scratch I'll crib from lxdorney's thread last week. He had complicated HTML tables he needed to extract into CSV. yanx.awk grew a new feature to deal with this, the CTAG variable, which tells your program when tags close!

Input:
Code:
...

								<table border="1" cellpadding="3" cellspacing="0">
<tbody>
<tr>



</tr>
<tr>
<td nowrap="nowrap" valign="top"><p><strong>AA Number. 3-456</strong></p></td>
<td style="text-align: justify;" valign="top">
<p>The quick brown fox jumps over the lazy dog near the bank of the 
river. The quick brown fox jumps over the lazy dog near the bank of the 
river.<br><em>(Hello World May 20, 2016)</em>&nbsp;<br>
<a href="http://www.mydomain.com/folder/pdf/abcd.pdf">Summary</a>&nbsp;|&nbsp;<a href="http://www.mydomain.com/folder/pdf/abcfull.pdf">Full Story</a></p>
</td>
<td style="text-align: center;" valign="top"><p>May 18, 2016</p></td>
</tr>

With output like this:
Code:
col1,col2,col3,col4,col5,col6

A few problems.
  1. This is HTML, not XML, with the usual HTML problems: &entities; and |visual garbage|. Luckily no javascript or other <!-- specal garbage -->.
  2. Tables are used visually, not logically. The columns he wants aren't exactly HTML columns but parts of them.
  3. Some needs several separate CDATA segments strung together.
  4. Others need arguments, not cdata.
  5. Some rows are completely blank.
  6. Columns must be reorganized to place links at the end

Extracting tabular data usually means appending each bit of CDATA into DATA[COLUMN] doing COLUMN++ whenever you hit a TD, and doing COLUMN=0 whenever you hit a TR. I shoehorn that one funny date into a column by counting <EM> as a column too, but that still leaves the problem of moving all the links to the end.

I could do that in a matching loop, maybe, but instead I just have two arrays. links are caught in the LINK array, leaving DATA and COLUMN alone. Finding a link in this data is as easy as looking for the HREF argument. Note that the ARGS array accumulates stuff -- if you don't delete its contents yourself, you could start counting duplicates.

Printing CSV mostly means being paranoid about escaping. Every quote and comma in the text must become \', etc. awk is silly about escapes and requires \\\\ to mean a real, physical backslash in a quoted string. I wonder if it's shorter to just use octal.

Anyway, the program:

Code:
BEGIN {
# Each csv line will take several prints.  Blank ORS and OFS to handle processing ourself
ORS="" ; OFS=""
# These are indexes for the DATA and LINKS arrays it accumulates data in.
LINK=1; COL=0 }

# Prints OFS " S " then sets OFS=","
# In effect this prints a single column of CSV and sets it up to prepend
# a comma to the next column.  It quotes and escapes S for you.
function csv(S) {
        # Escape all quotes and commas before printing.
        # Might be overkill but better safe than sorry.
        gsub(/[,"]/, "\\\\&", S);
        printf("%s\"%s\"", OFS, S); OFS=","
}

# This prints a full line of collected data.
# It tells when by looking for a <TR> tag or a </TABLE> tag.
# It also checks that COL is nonzero to avoid printing blank rows.
(TAG=="TR" || CTAG=="TABLE") && COL {
        OFS="" # Begin new line of CSV

        # Print current row if any, one column at a time.
        # csv() will intelligently escape and wrap stuff for us.
        # Delete contents of arrays so duplicates aren't printed next time.
        for(C=1; C<=COL; C++) { csv(DATA[C]);   delete DATA[C]; }
        for(C=1; C in LINKS; C++) { csv(LINKS[C]); delete LINKS[C]; }
        printf("\n"); # csv does not print newline, we do
        COL=0; LINK=1 # Reset indexes for arrays
}

# Count colums in table.  This moves the index of DATA[] whenever it
# encounters a <TD> tag or an <EM> tag.
TAG=="TD" || TAG=="EM" { COL++ }

# Clean up HTML garbage.  All whitespace, &nbsp;, and | are converted to 
# a single space each.
{ gsub(/([|])|([ \r\n\t]+)|(&nbsp;)/, " ", $2); }

# Collect attachments when found by looking for HREF arguments
# whenever we're nesed inside a TABLE tag, however deep.
TAGS ~ /TABLE/ && ARGS["HREF"] {
        # Delete everything up to and including last slash in URL.
        # This leaves nothing but the filename.
        sub(/.*[/]/, "", ARGS["HREF"]);

        # They go into LINKS, not DATA, as we must print them last.
        LINKS[LINK++]=ARGS["HREF"];
        # ARGS does not clear itself, delete to prevent duplicates
        delete ARGS["HREF"];
        # Skip to next tag, prevents below code from appending CDATA to DATA
        next 
}

# When we find CDATA that isn't blank, and are inside a TD tag anywhere,
# append CDATA i.e. column two, to DATA[COL].
TAGS ~ /(^|%)TD%/ && !($2 ~ /^[ \r\n\t]+$/) {DATA[COL] = DATA[COL] $2 }

Use it like

Code:
$ awk -f yanx.awk -f html.awk input.html

"AA Number. 3-456","The quick brown fox jumps over the lazy dog etc...","(Hello World May 20\, 2016)","May 18\, 2016","abcd.pdf","abcfull.pdf"
"BB Number. 7-890","The quick brown fox jumps over the lazy dog etc...","(Lord of the rings May 30\, 2016)","May 28\, 2016","efghi.pdf","efghifull.pdf","efghisum.pdf"

$


Last edited by Corona688; 05-24-2016 at 01:49 PM..
These 2 Users Gave Thanks to Corona688 For This Post:
# 4  
Old 10-24-2018
0.0.8 posted, with an important fix for quoting and a small but important change for <? ?> <!-- etc tags. Should digest HTML somewhat better now.
# 5  
Old 11-24-2018
Thanks for your information. can you please tell me how can a create a xml sitemap for my website. what is the main difference beetween html and xml?
# 6  
Old 11-24-2018
Quote:
Originally Posted by rahuldaso
what is the main difference beetween html and xml?
XML is the principle of using markup - "tags" - to denote special parts of a text put into a standardised format: tags are denoted "<tagname>", etc.. (there would be other ways of marking up text, like:

Code:
This is normal text [italic=on]but[italic=off] the last word was different.

which is also using the markup-principle but is not XML.)

HTML is one very specific variety of XML, with a fixed set of tags that can occur and a fixed order of tags that have to be there:

Code:
<body>....</body>
<head>....</head>

would be an error in HTML because the head-tag has to come before the body-tag. In XML this would be perfectly OK because there is no such rule. In fact there is no list of tags which are allowed and no prescribed structure they are allowed to have like there is in HTML.

So, in short: every HTML text is a valid XML text too, but not every XML has to be calid HTML.

I hope this helps.

bakunin
# 7  
Old 03-08-2019
Bugfix for v0.0.9, self-closing tags no longer dump their ending / inside the argument text.
Login or Register to Ask a Question

Previous Thread | Next Thread

7 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

General Purpose Date Script

There must be thousands of one-off solutions scattered around this forum. GNU Date is so handy because it's general but if they're asking they probably don't have it. We have some nice scripts but they tend to need dates formatted in a very particular way. This is a rough approximation which... (18 Replies)
Discussion started by: Corona688
18 Replies

2. Shell Programming and Scripting

processing xml with awk

With the following input sample extracted from a xml file <rel ver="123"> <mod name="on"> <node env="ac" env="1"> <ins ip="10.192.0.1"/> <ins ip="10.192.0.2"/> ... (1 Reply)
Discussion started by: cabrao
1 Replies

3. Shell Programming and Scripting

Help with XML file processing

I need to get all session_ID 's for product="D-0002" from a XML file: Sample input: <session session_ID="6411206" create_date="2012-04-10-10.22.13.000000"> <marketing_info> <program_id>D4AWFU</program_id> <subchannel_id>abc</subchannel_id> </marketing_info> ... (1 Reply)
Discussion started by: karumudi7
1 Replies

4. Programming

help me with perl script xml processing

Hi everyone, I have Xml files in a folder, I need to extract some attribute values form xml files and store in a hash. My xml file look like this. <?xml version="1.0" encoding="UTF-8"?> <Servicelist xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"... (0 Replies)
Discussion started by: pavani reddy
0 Replies

5. Shell Programming and Scripting

CSV processing to XML

Hi, i am really fresh with shell scripting and programming, i have an issue i am not able to solve to populate data on my server for Cisco IP phones. I have CSV file within the following format: ;;;;;;;;;;;;;;;;;; ;;;;;;;;;;;;;;;;;; ;;;;;;;;;;;;;;;;;; ;;;;;;;;;;;;;;;;;; ;;;;;;;;;;;;;;;;;;... (9 Replies)
Discussion started by: angel2008
9 Replies

6. Shell Programming and Scripting

need help on xml processing

I am trying to divide a xml file(my.xml) like this: <?xml version="1.0" encoding="UTF-8"?> <Proto PName="hmmmmmmm"> <Menu id="A" ver="1"> <P> <P name="AA" Type="X"/> <P name="BB" Type="Y"/> <P name="CC" Type="Z"/> </P> ... (4 Replies)
Discussion started by: demoprog
4 Replies

7. UNIX for Dummies Questions & Answers

Looking for a general purpose System Monitor

Does anyone have any scripts or suggestions on a general purpose Unix/Linux monitoring tool? (5 Replies)
Discussion started by: darthur
5 Replies
Login or Register to Ask a Question