"General Purpose XML Processing"

Post #302973957 by Corona688 on Tuesday 24th of May 2016 12:10:29 PM

Rather than write one from scratch I'll crib from lxdorney's thread last week. He had complicated HTML tables he needed to extract into CSV. yanx.awk grew a new feature to deal with this, the CTAG variable, which tells your program when tags close!

Input:
Code:
...

								<table border="1" cellpadding="3" cellspacing="0">
<tbody>
<tr>



</tr>
<tr>
<td nowrap="nowrap" valign="top"><p><strong>AA Number. 3-456</strong></p></td>
<td style="text-align: justify;" valign="top">
<p>The quick brown fox jumps over the lazy dog near the bank of the 
river. The quick brown fox jumps over the lazy dog near the bank of the 
river.<br><em>(Hello World May 20, 2016)</em>&nbsp;<br>
<a href="http://www.mydomain.com/folder/pdf/abcd.pdf">Summary</a>&nbsp;|&nbsp;<a href="http://www.mydomain.com/folder/pdf/abcfull.pdf">Full Story</a></p>
</td>
<td style="text-align: center;" valign="top"><p>May 18, 2016</p></td>
</tr>

With output like this:
Code:
col1,col2,col3,col4,col5,col6

A few problems.
  1. This is HTML, not XML, with the usual HTML problems: &entities; and |visual garbage|. Luckily no javascript or other <!-- specal garbage -->.
  2. Tables are used visually, not logically. The columns he wants aren't exactly HTML columns but parts of them.
  3. Some needs several separate CDATA segments strung together.
  4. Others need arguments, not cdata.
  5. Some rows are completely blank.
  6. Columns must be reorganized to place links at the end

Extracting tabular data usually means appending each bit of CDATA into DATA[COLUMN] doing COLUMN++ whenever you hit a TD, and doing COLUMN=0 whenever you hit a TR. I shoehorn that one funny date into a column by counting <EM> as a column too, but that still leaves the problem of moving all the links to the end.

I could do that in a matching loop, maybe, but instead I just have two arrays. links are caught in the LINK array, leaving DATA and COLUMN alone. Finding a link in this data is as easy as looking for the HREF argument. Note that the ARGS array accumulates stuff -- if you don't delete its contents yourself, you could start counting duplicates.

Printing CSV mostly means being paranoid about escaping. Every quote and comma in the text must become \', etc. awk is silly about escapes and requires \\\\ to mean a real, physical backslash in a quoted string. I wonder if it's shorter to just use octal.

Anyway, the program:

Code:
BEGIN {
# Each csv line will take several prints.  Blank ORS and OFS to handle processing ourself
ORS="" ; OFS=""
# These are indexes for the DATA and LINKS arrays it accumulates data in.
LINK=1; COL=0 }

# Prints OFS " S " then sets OFS=","
# In effect this prints a single column of CSV and sets it up to prepend
# a comma to the next column.  It quotes and escapes S for you.
function csv(S) {
        # Escape all quotes and commas before printing.
        # Might be overkill but better safe than sorry.
        gsub(/[,"]/, "\\\\&", S);
        printf("%s\"%s\"", OFS, S); OFS=","
}

# This prints a full line of collected data.
# It tells when by looking for a <TR> tag or a </TABLE> tag.
# It also checks that COL is nonzero to avoid printing blank rows.
(TAG=="TR" || CTAG=="TABLE") && COL {
        OFS="" # Begin new line of CSV

        # Print current row if any, one column at a time.
        # csv() will intelligently escape and wrap stuff for us.
        # Delete contents of arrays so duplicates aren't printed next time.
        for(C=1; C<=COL; C++) { csv(DATA[C]);   delete DATA[C]; }
        for(C=1; C in LINKS; C++) { csv(LINKS[C]); delete LINKS[C]; }
        printf("\n"); # csv does not print newline, we do
        COL=0; LINK=1 # Reset indexes for arrays
}

# Count colums in table.  This moves the index of DATA[] whenever it
# encounters a <TD> tag or an <EM> tag.
TAG=="TD" || TAG=="EM" { COL++ }

# Clean up HTML garbage.  All whitespace, &nbsp;, and | are converted to 
# a single space each.
{ gsub(/([|])|([ \r\n\t]+)|(&nbsp;)/, " ", $2); }

# Collect attachments when found by looking for HREF arguments
# whenever we're nesed inside a TABLE tag, however deep.
TAGS ~ /TABLE/ && ARGS["HREF"] {
        # Delete everything up to and including last slash in URL.
        # This leaves nothing but the filename.
        sub(/.*[/]/, "", ARGS["HREF"]);

        # They go into LINKS, not DATA, as we must print them last.
        LINKS[LINK++]=ARGS["HREF"];
        # ARGS does not clear itself, delete to prevent duplicates
        delete ARGS["HREF"];
        # Skip to next tag, prevents below code from appending CDATA to DATA
        next 
}

# When we find CDATA that isn't blank, and are inside a TD tag anywhere,
# append CDATA i.e. column two, to DATA[COL].
TAGS ~ /(^|%)TD%/ && !($2 ~ /^[ \r\n\t]+$/) {DATA[COL] = DATA[COL] $2 }

Use it like

Code:
$ awk -f yanx.awk -f html.awk input.html

"AA Number. 3-456","The quick brown fox jumps over the lazy dog etc...","(Hello World May 20\, 2016)","May 18\, 2016","abcd.pdf","abcfull.pdf"
"BB Number. 7-890","The quick brown fox jumps over the lazy dog etc...","(Lord of the rings May 30\, 2016)","May 28\, 2016","efghi.pdf","efghifull.pdf","efghisum.pdf"

$


Last edited by Corona688; 05-24-2016 at 01:49 PM..
These 2 Users Gave Thanks to Corona688 For This Post:
 

7 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Looking for a general purpose System Monitor

Does anyone have any scripts or suggestions on a general purpose Unix/Linux monitoring tool? (5 Replies)
Discussion started by: darthur
5 Replies

2. Shell Programming and Scripting

need help on xml processing

I am trying to divide a xml file(my.xml) like this: <?xml version="1.0" encoding="UTF-8"?> <Proto PName="hmmmmmmm"> <Menu id="A" ver="1"> <P> <P name="AA" Type="X"/> <P name="BB" Type="Y"/> <P name="CC" Type="Z"/> </P> ... (4 Replies)
Discussion started by: demoprog
4 Replies

3. Shell Programming and Scripting

CSV processing to XML

Hi, i am really fresh with shell scripting and programming, i have an issue i am not able to solve to populate data on my server for Cisco IP phones. I have CSV file within the following format: ;;;;;;;;;;;;;;;;;; ;;;;;;;;;;;;;;;;;; ;;;;;;;;;;;;;;;;;; ;;;;;;;;;;;;;;;;;; ;;;;;;;;;;;;;;;;;;... (9 Replies)
Discussion started by: angel2008
9 Replies

4. Programming

help me with perl script xml processing

Hi everyone, I have Xml files in a folder, I need to extract some attribute values form xml files and store in a hash. My xml file look like this. <?xml version="1.0" encoding="UTF-8"?> <Servicelist xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"... (0 Replies)
Discussion started by: pavani reddy
0 Replies

5. Shell Programming and Scripting

Help with XML file processing

I need to get all session_ID 's for product="D-0002" from a XML file: Sample input: <session session_ID="6411206" create_date="2012-04-10-10.22.13.000000"> <marketing_info> <program_id>D4AWFU</program_id> <subchannel_id>abc</subchannel_id> </marketing_info> ... (1 Reply)
Discussion started by: karumudi7
1 Replies

6. Shell Programming and Scripting

processing xml with awk

With the following input sample extracted from a xml file <rel ver="123"> <mod name="on"> <node env="ac" env="1"> <ins ip="10.192.0.1"/> <ins ip="10.192.0.2"/> ... (1 Reply)
Discussion started by: cabrao
1 Replies

7. UNIX for Beginners Questions & Answers

General Purpose Date Script

There must be thousands of one-off solutions scattered around this forum. GNU Date is so handy because it's general but if they're asking they probably don't have it. We have some nice scripts but they tend to need dates formatted in a very particular way. This is a rough approximation which... (18 Replies)
Discussion started by: Corona688
18 Replies

Featured Tech Videos

All times are GMT -4. The time now is 02:30 AM.
Unix & Linux Forums Content Copyright 1993-2019. All Rights Reserved.
Privacy Policy