This is a "work around" for handling arbitrary XML in shell, my yanx.awk library:
Code:
BEGIN {
FS=">"; OFS=">";
RS="<"; ORS="<"
}
# After match("qwertyuiop", /rty/)
# rbefore("qwertyuiop") is "qwe",
# rmid("qwertyuipo") is "r"
# rall("qwertyuiop") is "rty"
# rafter("qwertyuiop") is "uiop"
function rbefore(STR) { return(substr(STR, N, RSTART-1)); }# before match
function rmid(STR) { return(substr(STR, RSTART, 1)); } # First char match
function rall(STR) { return(substr(STR, RSTART, RLENGTH)); }# Entire match
function rafter(STR) { return(substr(STR, RSTART+RLENGTH)); }# after match
function aquote(OUT, A, PFIX, TA) { # Turns Q SUBSEP R into A[PFIX":"Q]=R
if(OUT)
{
if(PFIX) PFIX=PFIX":"
split(OUT, TA, SUBSEP);
A[toupper(PFIX) toupper(TA[1])]=TA[2];
}
return("");
}
# Intended to be less stupid about quoted text in XML/HTML.
# Splits a='b' c='d' e='f' into A[PFIX":"a]=b, A[PFIX":"c]=d, etc.
function qsplit(STR, A, PFIX, X, OUT) {
while(STR && match(STR, /([ \n\t]+)|[\x27\x22=]/))
{
OUT = OUT rbefore(STR);
RMID=rmid(STR);
if((RMID == "'") || (RMID == "\"")) # Quote characters
{
if(!Q) Q=RMID; # Begin quote section
else if(Q == RMID) Q=""; # End quote section
else OUT = OUT RMID; # Quoted quote
} else if(RMID == "=") {
if(Q) OUT=OUT RMID; else OUT=OUT SUBSEP;
} else if((RMID=="\r")||(RMID=="\n")||(RMID=="\t")||(RMID==" ")) {
if(Q) OUT = OUT rall(STR); # Literal quoted whitespace
else OUT = aquote(OUT, A, PFIX); # Unquoted WS, next block
}
STR=rafter(STR); # Strip off the text we've processed already.
}
aquote(OUT STR, A, PFIX); # Process any text we haven't already.
}
{ SPEC=0 ; TAG="" }
NR==1 {
if(ORS == RS) print;
next } # The first "line" is blank when RS=<
/^[!?]/ { SPEC=1 } # XML specification junk
# Handle open-tags
match($1, /^[^\/ \r\n\t>]+/) {
CTAG=""
TAG=substr(toupper($1), RSTART, RLENGTH);
if((!SPEC) && !($1 ~ /\/$/))
{
TAGS=TAG "%" TAGS;
DEP++;
LTAGS=TAGS
}
for(X in ARGS) delete ARGS[X];
qsplit(rafter($1), ARGS, "", "", "");
}
# Handle close-tags
(!SPEC) && /^[\/]/ {
sub(/^\//, "", $1);
LTAGS=TAGS
CTAG=toupper($1)
# sub("^.*" toupper($1) "%", "", TAGS);
sub("^" toupper($1) "%", "", TAGS);
$1="/"$1
DEP=split(TAGS, TA, "%")-1;
if(DEP < 0) DEP=0;
}
And here is how you use it, html.awk :
Code:
BEGIN { ORS="" ; OFS="" ; LINK=1; COL=0 }
# Print a column of data in CSV format
function csv(S) {
gsub(/[,"]/, "\\\\&", S); printf("%s\"%s\"", OFS, S); OFS=","
}
# When a table row starts, or the entire table ends, print row
(TAG=="TR" || CTAG=="TABLE") && COL {
OFS=""
# Print current row if any
for(C=1; C<=COL; C++) { csv(DATA[C]); delete DATA[C]; }
for(C=1; C in LINKS; C++) { csv(LINKS[C]); delete LINKS[C]; }
printf("\n");
COL=0; LINK=1 # Reset indexes for arrays
}
# Count colums in table. count em as a row to separate date comment
TAG=="TD" || TAG=="EM" { COL++ }
# Clean up HTML garbage
{ gsub(/([|])|([ \r\n\t]+)|( )/, " ", $2); }
# Collect attachments when found
TAGS ~ /TABLE/ && ARGS["HREF"] {
sub(/.*[/]/, "", ARGS["HREF"]);
LINKS[LINK++]=ARGS["HREF"];
delete ARGS["HREF"];
next # Skip to next tags, we dont want link title
}
# Append text to the current row and col
TAGS ~ /(^|%)TD%/ && !($2 ~ /^[ \r\n\t]+$/) {DATA[COL] = DATA[COL] $2 }
And here is how you run it:
Code:
$ awk -f yanx.awk -f html.awk input.html
"AA Number. 3-456","The quick brown fox jumps over the lazy dog near the bank of the river. The quick brown fox jumps over the lazy dog near the bank of the river.","(Hello World May 20\, 2016)","May 18\, 2016","abcd.pdf","abcfull.pdf"
"BB Number. 7-890","The quick brown fox jumps over the lazy dog near the bank of the river1.The quick brown fox jumps over the lazy dog near the bank of the river2.","(Lord of the rings May 30\, 2016)","May 28\, 2016","efghi.pdf","efghifull.pdf","efghisum.pdf"
$
Hello people,
Need favour. The problem I have is that, I need to develop a unix shell script that performs recurring exports of data from a csv file to an oracle database. Basically, the csv file contains just the first name and last name will be dumped to an Unix server. The data from these... (3 Replies)
Hello All,
I have a perl script that prints a HMTL table. I want to convert this data into a report and this want to export this information into Excel. How can I do this?
Regards,
garric (3 Replies)
Hi
I need help on this. Its very urgent for me.. please try to help me out..
I have data in tables in DB2 database. I would like to export the data from DB2 tables into a text file, which has to be space delimited. so that I can carry out awk, grep operations on that file. I tried to export... (2 Replies)
Hi ,
I would like to get some suggestion from the experts.
My requirement is to export oracle table data as an xml file.
Any unix/linux tools, scripts available?
Regards, (2 Replies)
Hi
I need to write a bash script to take the data stored in 3 oracle tables .. and filter them and store the results in a csv file.
It is an Oracle database
Thank you (1 Reply)
Hi. I need to create html table from file which contains data. No awk please :) In example,
->cat file
num1 num2 num3
23 3 5
2 3 4 (between numbers and words single TAB).
after running mycode i need to get (heading is the first line):
<table>... (2 Replies)
Hi ,
I need an help in perl scripting.
I have an perl script written and i have an for loop in that ,where as it writes some data to a file and it has details like below.
cat out.txt
This is the first line
this is the second line.
.....Now, this file needs to be send in mail in HTML... (2 Replies)
Hi All ,
I am stuck on the below situation.I have a table called "test" which are created on hive.I need to export the data from hive to a file(test.txt) on local unix system.I have tried the below command ,but its giving the exception .
hive -e "select * from test " > /home/user/test.txt ;
... (1 Reply)
Greetings,
After a few hours of trial and error, I decide to ask for some help.
I am new to AWK and shell script, so please don't laugh :p
I made the below script, to gather data from some logs and have the output into a CSV file :
#!/bin/sh
#Script to collect Errors
... (9 Replies)
HI ,
I have a HTML tables as below.
It has 2 tables ,I want to extract the second table .
Please help me in doing it.
<html>
<body>
<b><br>Running Date: </b>11-JAN-2019 03:07</br>
<h2> Schema mapping and info </h2>
<BR><TABLE width="100%" class="x1h" cellpadding="1"... (3 Replies)