"General Purpose XML Processing"

Rather than write one from scratch I'll crib from lxdorney's thread last week. He had complicated HTML tables he needed to extract into CSV. yanx.awk grew a new feature to deal with this, the CTAG variable, which tells your program when tags close!


								<table border="1" cellpadding="3" cellspacing="0">

<td nowrap="nowrap" valign="top"><p><strong>AA Number. 3-456</strong></p></td>
<td style="text-align: justify;" valign="top">
<p>The quick brown fox jumps over the lazy dog near the bank of the 
river. The quick brown fox jumps over the lazy dog near the bank of the 
river.<br><em>(Hello World May 20, 2016)</em>&nbsp;<br>
<a href="http://www.mydomain.com/folder/pdf/abcd.pdf">Summary</a>&nbsp;|&nbsp;<a href="http://www.mydomain.com/folder/pdf/abcfull.pdf">Full Story</a></p>
<td style="text-align: center;" valign="top"><p>May 18, 2016</p></td>

With output like this:

A few problems.
  1. This is HTML, not XML, with the usual HTML problems: &entities; and |visual garbage|. Luckily no javascript or other <!-- specal garbage -->.
  2. Tables are used visually, not logically. The columns he wants aren't exactly HTML columns but parts of them.
  3. Some needs several separate CDATA segments strung together.
  4. Others need arguments, not cdata.
  5. Some rows are completely blank.
  6. Columns must be reorganized to place links at the end

Extracting tabular data usually means appending each bit of CDATA into DATA[COLUMN] doing COLUMN++ whenever you hit a TD, and doing COLUMN=0 whenever you hit a TR. I shoehorn that one funny date into a column by counting <EM> as a column too, but that still leaves the problem of moving all the links to the end.

I could do that in a matching loop, maybe, but instead I just have two arrays. links are caught in the LINK array, leaving DATA and COLUMN alone. Finding a link in this data is as easy as looking for the HREF argument. Note that the ARGS array accumulates stuff -- if you don't delete its contents yourself, you could start counting duplicates.

Printing CSV mostly means being paranoid about escaping. Every quote and comma in the text must become \', etc. awk is silly about escapes and requires \\\\ to mean a real, physical backslash in a quoted string. I wonder if it's shorter to just use octal.

Anyway, the program:

# Each csv line will take several prints.  Blank ORS and OFS to handle processing ourself
ORS="" ; OFS=""
# These are indexes for the DATA and LINKS arrays it accumulates data in.
LINK=1; COL=0 }

# Prints OFS " S " then sets OFS=","
# In effect this prints a single column of CSV and sets it up to prepend
# a comma to the next column.  It quotes and escapes S for you.
function csv(S) {
        # Escape all quotes and commas before printing.
        # Might be overkill but better safe than sorry.
        gsub(/[,"]/, "\\\\&", S);
        printf("%s\"%s\"", OFS, S); OFS=","

# This prints a full line of collected data.
# It tells when by looking for a <TR> tag or a </TABLE> tag.
# It also checks that COL is nonzero to avoid printing blank rows.
(TAG=="TR" || CTAG=="TABLE") && COL {
        OFS="" # Begin new line of CSV

        # Print current row if any, one column at a time.
        # csv() will intelligently escape and wrap stuff for us.
        # Delete contents of arrays so duplicates aren't printed next time.
        for(C=1; C<=COL; C++) { csv(DATA[C]);   delete DATA[C]; }
        for(C=1; C in LINKS; C++) { csv(LINKS[C]); delete LINKS[C]; }
        printf("\n"); # csv does not print newline, we do
        COL=0; LINK=1 # Reset indexes for arrays

# Count colums in table.  This moves the index of DATA[] whenever it
# encounters a <TD> tag or an <EM> tag.
TAG=="TD" || TAG=="EM" { COL++ }

# Clean up HTML garbage.  All whitespace, &nbsp;, and | are converted to 
# a single space each.
{ gsub(/([|])|([ \r\n\t]+)|(&nbsp;)/, " ", $2); }

# Collect attachments when found by looking for HREF arguments
# whenever we're nesed inside a TABLE tag, however deep.
        # Delete everything up to and including last slash in URL.
        # This leaves nothing but the filename.
        sub(/.*[/]/, "", ARGS["HREF"]);

        # They go into LINKS, not DATA, as we must print them last.
        # ARGS does not clear itself, delete to prevent duplicates
        delete ARGS["HREF"];
        # Skip to next tag, prevents below code from appending CDATA to DATA

# When we find CDATA that isn't blank, and are inside a TD tag anywhere,
# append CDATA i.e. column two, to DATA[COL].
TAGS ~ /(^|%)TD%/ && !($2 ~ /^[ \r\n\t]+$/) {DATA[COL] = DATA[COL] $2 }

Use it like

$ awk -f yanx.awk -f html.awk input.html

"AA Number. 3-456","The quick brown fox jumps over the lazy dog etc...","(Hello World May 20\, 2016)","May 18\, 2016","abcd.pdf","abcfull.pdf"
"BB Number. 7-890","The quick brown fox jumps over the lazy dog etc...","(Lord of the rings May 30\, 2016)","May 28\, 2016","efghi.pdf","efghifull.pdf","efghisum.pdf"


