Unix/Linux Go Back    


UNIX for Beginners Questions & Answers If you're not sure where to post a Unix or Linux question, post it here. All unix and Linux beginners welcome in this forum!

General Purpose XML Processing

UNIX for Beginners Questions & Answers


Reply    
 
Thread Tools Search this Thread Display Modes
    #1  
Old Unix and Linux 05-19-2016   -   Original Discussion by Corona688
Corona688's Unix or Linux Image
Corona688 Corona688 is offline Forum Staff  
Mead Rotor
 
Join Date: Aug 2005
Last Activity: 19 April 2018, 5:37 PM EDT
Location: Saskatchewan
Posts: 22,635
Thanks: 1,172
Thanked 4,306 Times in 3,972 Posts
General Purpose XML Processing

I've been kicking this around for a while now, I might as well post it here.


Code:
# yanx.awk v0.0.6, Tyler Montbriand, 2016.  Yet another noncompliant XML parser
###############################################################################
# XML is a pain to process in the shell, but people need it all the time.
# I've been using and improving this kludge since 2014 or so.  It parses and
# stacks tags and digests parameters, allowing simple XML processing and
# extraction to be managed with a handful of lines addendum.
#
# I've restricted my use of GNU features enough that this script will run on
# busybox's awk.
###############################################################################
# Basic use:
#
# Fed this XML, <body><html a="b">Your Web Browser Hates This</html></body>
# yanx will read it token-by-token as so:
#     Line 1:  Empty, skipped
#     Line 2:  $1="body"
#     Line 3:  $1="html a="b"", $2="Your web browser hates this"
#     Line 4:  $1="/html"
#     Line 5:  $1="/body", $2="\n"
#
# The script sets a few new "special" variables along the way.
# TAG           The name of the current tag, uppercased.
# CTAG          If close-tag, name in uppercase.
# TAGS          List of nested tags, like HTML%BODY%, including current tag
# LTAGS         List of nested tags, not including current tag
# ARGS          Array of tag parameters, uppercased.  i.e. ARGS["HREF"]
# DEP           How many tags deep it's nested, including current tag.
#
###############################################################################
# Examples:
# # Rewrite cdata of all divs
# awk -f yanx.awk -e 'TAGS ~ /^DIV%/ { $2="quux froob" } 1' input
# # Extract href's from every link
# awk -f yanx.awk -e 'TAGS~/^A%/ && ("HREF" in ARGS) {
#       print ARGS["HREF"] }' ORS="\n" input
###############################################################################
# Known Bugs:
# A short XML script can't possibly handle DOD, etc.  Entities a la &lt;
# are not translated either.
#
# I've done my best to make it swallow <!--, <? ?> and other such fancy
# XML syntax without choking, but that doesn't mean it handles them
# properly either.
#
# It's an XML parser, not an HTML parser.  It probably won't swallow a
# wild-from-the internet HTML web page without some cleanup first:
# javascript, tags inside comments, etc will be mangled instead of ignored.
#
# Last: Because of its design, when printing raw HTML, yanx adds an extra <
# to the end of the file.  This is because < belongs at the beginning of
# a token but awk is told it's printed at the end.  There is no equivalent
# "line prefix" variable that I know of, if you want it to print smarter
# you'll have to print the <'s yourself, by setting ORS=" and
# printing lines like print "<" $0
###############################################################################
BEGIN {
        FS=">"; OFS=">";
        RS="<"; ORS="<"
}

# After match("qwertyuiop", /rty/)
#       rbefore("qwertyuiop") is "qwe",
#       rmid("qwertyuipo")    is "r"
#       rall("qwertyuiop")    is "rty"
#       rafter("qwertyuiop")  is "uiop"
function rbefore(STR)   { return(substr(STR, N, RSTART-1)); }# before match
function rmid(STR)      { return(substr(STR, RSTART, 1)); }  # First char match
function rall(STR)      { return(substr(STR, RSTART, RLENGTH)); }# Entire match
function rafter(STR)    { return(substr(STR, RSTART+RLENGTH)); }# after match

function aquote(OUT, A, PFIX, TA) { # Turns Q SUBSEP R into A[PFIX":"Q]=R
        if(OUT)
        {
                if(PFIX) PFIX=PFIX":"
                split(OUT, TA, SUBSEP);
                A[toupper(PFIX) toupper(TA[1])]=TA[2];
        }

        return("");
}

# Intended to be less stupid about quoted text in XML/HTML.
# Splits a='b' c='d' e='f' into A[PFIX":"a]=b, A[PFIX":"c]=d, etc.
function qsplit(STR, A, PFIX, X, OUT) {
        while(STR && match(STR, /([ \n\t]+)|[\x27\x22=]/))
        {
                OUT = OUT rbefore(STR);
                RMID=rmid(STR);

                if((RMID == "'") || (RMID == "\""))     # Quote characters
                {
                        if(!Q)          Q=RMID;         # Begin quote section
                        else if(Q == RMID)      Q="";   # End quote section
                        else                    OUT = OUT RMID; # Quoted quote
                } else if(RMID == "=") {
                        if(Q)   OUT=OUT RMID; else OUT=OUT SUBSEP;
                } else if((RMID=="\r")||(RMID=="\n")||(RMID=="\t")||(RMID==" ")) {
                        if(Q)   OUT = OUT rall(STR); # Literal quoted whitespace
                        else    OUT = aquote(OUT, A, PFIX); # Unquoted WS, next block
                }
                STR=rafter(STR); # Strip off the text we've processed already.
        }

        aquote(OUT STR, A, PFIX); # Process any text we haven't already.
}


{ SPEC=0 ; TAG="" }

NR==1 {
        if(ORS == RS) print;
        next } # The first "line" is blank when RS=<

/^[!?]/ { SPEC=1    }   # XML specification junk

# Handle open-tags
match($1, /^[^\/ \r\n\t>]+/) {
        CTAG=""
        TAG=substr(toupper($1), RSTART, RLENGTH);
        if((!SPEC) && !($1 ~ /\/$/))
        {
                TAGS=TAG "%" TAGS;
                DEP++;
                LTAGS=TAGS
        }

        for(X in ARGS) delete ARGS[X];

        qsplit(rafter($1), ARGS, "", "", "");
}

# Handle close-tags
(!SPEC) && /^[\/]/ {
        sub(/^\//, "", $1);
        LTAGS=TAGS
        TAG=""
        CTAG=toupper($1)
#        sub("^.*" toupper($1) "%", "", TAGS);
        sub("^" toupper($1) "%", "", TAGS);
        $1="/"$1
        DEP=split(TAGS, TA, "%")-1;
        if(DEP < 0) DEP=0;
}


Last edited by Corona688; 06-06-2016 at 03:14 PM.. Reason: v0.0.6, clears TAG on close-tag
The Following 7 Users Say Thank You to Corona688 For This Useful Post:
Don Cragun (05-22-2016), emare (03-08-2018), fpmurphy (05-22-2016), Neo (05-24-2016), RavinderSingh13 (05-19-2016), vbe (05-24-2016), zozoo (05-28-2016)
Sponsored Links
    #2  
Old Unix and Linux 05-22-2016   -   Original Discussion by Corona688
RavinderSingh13's Unix or Linux Image
RavinderSingh13 RavinderSingh13 is offline Forum Advisor  
Registered User
 
Join Date: May 2013
Last Activity: 11 April 2018, 7:35 AM EDT
Location: Chennai
Posts: 2,715
Thanks: 609
Thanked 1,298 Times in 1,165 Posts
Hello Corona688(One of the Gems of this forum),

First of all a big THANK YOU for writing this brilliant code Linux(fan of you always). Could you please post a example or complex example for a Input_file and code too here, I apologies to bother you on same but it will be helpful for us to understand the code more clearly. I will be grateful to you if you could do so.

Thanks,
R. Singh
The Following User Says Thank You to RavinderSingh13 For This Useful Post:
Corona688 (05-24-2016)
Sponsored Links
    #3  
Old Unix and Linux 05-24-2016   -   Original Discussion by Corona688
Corona688's Unix or Linux Image
Corona688 Corona688 is offline Forum Staff  
Mead Rotor
 
Join Date: Aug 2005
Last Activity: 19 April 2018, 5:37 PM EDT
Location: Saskatchewan
Posts: 22,635
Thanks: 1,172
Thanked 4,306 Times in 3,972 Posts
Rather than write one from scratch I'll crib from lxdorney's thread last week. He had complicated HTML tables he needed to extract into CSV. yanx.awk grew a new feature to deal with this, the CTAG variable, which tells your program when tags close!

Input:

Code:
...

								<table border="1" cellpadding="3" cellspacing="0">
<tbody>
<tr>



</tr>
<tr>
<td nowrap="nowrap" valign="top"><p><strong>AA Number. 3-456</strong></p></td>
<td style="text-align: justify;" valign="top">
<p>The quick brown fox jumps over the lazy dog near the bank of the 
river. The quick brown fox jumps over the lazy dog near the bank of the 
river.<br><em>(Hello World May 20, 2016)</em>&nbsp;<br>
<a href="http://www.mydomain.com/folder/pdf/abcd.pdf">Summary</a>&nbsp;|&nbsp;<a href="http://www.mydomain.com/folder/pdf/abcfull.pdf">Full Story</a></p>
</td>
<td style="text-align: center;" valign="top"><p>May 18, 2016</p></td>
</tr>

With output like this:

Code:
col1,col2,col3,col4,col5,col6

A few problems.
  1. This is HTML, not XML, with the usual HTML problems: &entities; and |visual garbage|. Luckily no javascript or other <!-- specal garbage -->.
  2. Tables are used visually, not logically. The columns he wants aren't exactly HTML columns but parts of them.
  3. Some needs several separate CDATA segments strung together.
  4. Others need arguments, not cdata.
  5. Some rows are completely blank.
  6. Columns must be reorganized to place links at the end

Extracting tabular data usually means appending each bit of CDATA into DATA[COLUMN] doing COLUMN++ whenever you hit a TD, and doing COLUMN=0 whenever you hit a TR. I shoehorn that one funny date into a column by counting <EM> as a column too, but that still leaves the problem of moving all the links to the end.

I could do that in a matching loop, maybe, but instead I just have two arrays. links are caught in the LINK array, leaving DATA and COLUMN alone. Finding a link in this data is as easy as looking for the HREF argument. Note that the ARGS array accumulates stuff -- if you don't delete its contents yourself, you could start counting duplicates.

Printing CSV mostly means being paranoid about escaping. Every quote and comma in the text must become \', etc. awk is silly about escapes and requires \\\\ to mean a real, physical backslash in a quoted string. I wonder if it's shorter to just use octal.

Anyway, the program:



Code:
BEGIN {
# Each csv line will take several prints.  Blank ORS and OFS to handle processing ourself
ORS="" ; OFS=""
# These are indexes for the DATA and LINKS arrays it accumulates data in.
LINK=1; COL=0 }

# Prints OFS " S " then sets OFS=","
# In effect this prints a single column of CSV and sets it up to prepend
# a comma to the next column.  It quotes and escapes S for you.
function csv(S) {
        # Escape all quotes and commas before printing.
        # Might be overkill but better safe than sorry.
        gsub(/[,"]/, "\\\\&", S);
        printf("%s\"%s\"", OFS, S); OFS=","
}

# This prints a full line of collected data.
# It tells when by looking for a <TR> tag or a </TABLE> tag.
# It also checks that COL is nonzero to avoid printing blank rows.
(TAG=="TR" || CTAG=="TABLE") && COL {
        OFS="" # Begin new line of CSV

        # Print current row if any, one column at a time.
        # csv() will intelligently escape and wrap stuff for us.
        # Delete contents of arrays so duplicates aren't printed next time.
        for(C=1; C<=COL; C++) { csv(DATA[C]);   delete DATA[C]; }
        for(C=1; C in LINKS; C++) { csv(LINKS[C]); delete LINKS[C]; }
        printf("\n"); # csv does not print newline, we do
        COL=0; LINK=1 # Reset indexes for arrays
}

# Count colums in table.  This moves the index of DATA[] whenever it
# encounters a <TD> tag or an <EM> tag.
TAG=="TD" || TAG=="EM" { COL++ }

# Clean up HTML garbage.  All whitespace, &nbsp;, and | are converted to 
# a single space each.
{ gsub(/([|])|([ \r\n\t]+)|(&nbsp;)/, " ", $2); }

# Collect attachments when found by looking for HREF arguments
# whenever we're nesed inside a TABLE tag, however deep.
TAGS ~ /TABLE/ && ARGS["HREF"] {
        # Delete everything up to and including last slash in URL.
        # This leaves nothing but the filename.
        sub(/.*[/]/, "", ARGS["HREF"]);

        # They go into LINKS, not DATA, as we must print them last.
        LINKS[LINK++]=ARGS["HREF"];
        # ARGS does not clear itself, delete to prevent duplicates
        delete ARGS["HREF"];
        # Skip to next tag, prevents below code from appending CDATA to DATA
        next 
}

# When we find CDATA that isn't blank, and are inside a TD tag anywhere,
# append CDATA i.e. column two, to DATA[COL].
TAGS ~ /(^|%)TD%/ && !($2 ~ /^[ \r\n\t]+$/) {DATA[COL] = DATA[COL] $2 }

Use it like



Code:
$ awk -f yanx.awk -f html.awk input.html

"AA Number. 3-456","The quick brown fox jumps over the lazy dog etc...","(Hello World May 20\, 2016)","May 18\, 2016","abcd.pdf","abcfull.pdf"
"BB Number. 7-890","The quick brown fox jumps over the lazy dog etc...","(Lord of the rings May 30\, 2016)","May 28\, 2016","efghi.pdf","efghifull.pdf","efghisum.pdf"

$


Last edited by Corona688; 05-24-2016 at 12:49 PM..
The Following 2 Users Say Thank You to Corona688 For This Useful Post:
Neo (05-24-2016), RavinderSingh13 (05-24-2016)
Sponsored Links
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Linux More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
General Purpose Date Script Corona688 UNIX for Beginners Questions & Answers 18 02-27-2017 05:38 PM
Looking for a general purpose System Monitor darthur UNIX for Dummies Questions & Answers 5 04-09-2002 03:12 PM



All times are GMT -4. The time now is 08:52 AM.