Rather than write one from scratch I'll crib from lxdorney's thread last week. He had complicated HTML tables he needed to extract into CSV. yanx.awk grew a new feature to deal with this, the CTAG variable, which tells your program when tags close!
Input:
With output like this:
A few problems.
This is HTML, not XML, with the usual HTML problems: &entities; and |visual garbage|. Luckily no javascript or other <!-- specal garbage -->.
Tables are used visually, not logically. The columns he wants aren't exactly HTML columns but parts of them.
Some needs several separate CDATA segments strung together.
Others need arguments, not cdata.
Some rows are completely blank.
Columns must be reorganized to place links at the end
Extracting tabular data usually means appending each bit of CDATA into DATA[COLUMN] doing COLUMN++ whenever you hit a TD, and doing COLUMN=0 whenever you hit a TR. I shoehorn that one funny date into a column by counting <EM> as a column too, but that still leaves the problem of moving all the links to the end.
I could do that in a matching loop, maybe, but instead I just have two arrays. links are caught in the LINK array, leaving DATA and COLUMN alone. Finding a link in this data is as easy as looking for the HREF argument. Note that the ARGS array accumulates stuff -- if you don't delete its contents yourself, you could start counting duplicates.
Printing CSV mostly means being paranoid about escaping. Every quote and comma in the text must become \', etc. awk is silly about escapes and requires \\\\ to mean a real, physical backslash in a quoted string. I wonder if it's shorter to just use octal.
Anyway, the program:
Use it like
Last edited by Corona688; 05-24-2016 at 01:49 PM..
These 2 Users Gave Thanks to Corona688 For This Post:
Hi,
i am really fresh with shell scripting and programming,
i have an issue i am not able to solve to populate data on my server for Cisco IP phones.
I have CSV file within the following format:
;;;;;;;;;;;;;;;;;;
;;;;;;;;;;;;;;;;;;
;;;;;;;;;;;;;;;;;;
;;;;;;;;;;;;;;;;;;
;;;;;;;;;;;;;;;;;;... (9 Replies)
Hi everyone,
I have Xml files in a folder, I need to extract some attribute values form xml files and store in a hash. My xml file look like this.
<?xml version="1.0" encoding="UTF-8"?>
<Servicelist xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"... (0 Replies)
I need to get all session_ID 's for product="D-0002" from a XML file:
Sample input:
<session session_ID="6411206" create_date="2012-04-10-10.22.13.000000">
<marketing_info>
<program_id>D4AWFU</program_id>
<subchannel_id>abc</subchannel_id>
</marketing_info>
... (1 Reply)
With the following input sample extracted from a xml file
<rel ver="123">
<mod name="on">
<node env="ac" env="1">
<ins ip="10.192.0.1"/>
<ins ip="10.192.0.2"/>
... (1 Reply)
There must be thousands of one-off solutions scattered around this forum. GNU Date is so handy because it's general but if they're asking they probably don't have it. We have some nice scripts but they tend to need dates formatted in a very particular way.
This is a rough approximation which... (18 Replies)