Hi!
I have a bunch of HTML files, which I want to parse to CSV files. Every page has a table in it, and I need to parse each row into a csv record.
With
awk and
sed, I managed to put every table row in separate lines. So my file looks like this:
HTML Code:
<TR> .... </TR>
<TR> .... </TR>
...
One line looks like this:
HTML Code:
<TR><A NAME="1,1"><TD CLASS="small" WIDTH="30" ALIGN="right" VALIGN="top">1,1</TD><TD WIDTH="380" ALIGN="left" VALIGN="top">
<FONT COLOR="black">Here is a text part</FONT></TD>
<TD BGCOLOR="green" WIDTH="1px"></TD>
<TD BGCOLOR="white" WIDTH="1px"></TD>
<TD BGCOLOR="white" WIDTH="1px"></TD>
<TD BGCOLOR="white" WIDTH="1px"></TD>
<TD CLASS="small" ALIGN="left" VALIGN="top">
<A TARGET='index' CLASS='small' HREF='target.php?newtab=1&from=1,1&b=19&ch=121&v=2&SID=...'>Textlink1</A>; <A TARGET='index' CLASS='small' HREF='target.php?newtab=1&from=1,1&b=19&ch=146&v=6-8&SID=...'>Textlink2</A></TD>
<TD BGCOLOR="white" WIDTH="1px"></TD><TD BGCOLOR="white" WIDTH="1px"></TD><TD CLASS="small" ALIGN="left" VALIGN="top"></TD></TR>
I need these information:
<A NAME="
1,
1">
Here is a text part
1,1,19,121,2
1,1,19,146,6-8
name(1),name(2),between font tags,atarget1,atarget2...atargetN
NUMBER,NUMBER,TEXTPART,LINK1,LINK2,...,LINKN
where LINKi is like:
from(1),from(2),b,ch,v
The number of links can be none, or more. I don't know the maximum.
Can you help me with extracting these infos? I can find these parts with regexp, but don't know how to put the info in parameters and how to it for every line.. And the number of links is unknown, but it's fine, I'll can parse the csv.
Thx,
Andras