Sponsored Content
Top Forums Shell Programming and Scripting Parse HTML tag parameters and text Post 302368847 by senszey on Thursday 5th of November 2009 07:48:30 PM
Old 11-05-2009
Parse HTML tag parameters and text

Hi!

I have a bunch of HTML files, which I want to parse to CSV files. Every page has a table in it, and I need to parse each row into a csv record.

With awk and sed, I managed to put every table row in separate lines. So my file looks like this:

HTML Code:
<TR> .... </TR>
<TR> .... </TR>
...
One line looks like this:
HTML Code:
<TR><A NAME="1,1"><TD CLASS="small" WIDTH="30" ALIGN="right" VALIGN="top">1,1</TD><TD WIDTH="380" ALIGN="left" VALIGN="top">
<FONT COLOR="black">Here is a text part</FONT></TD>
    <TD BGCOLOR="green" WIDTH="1px"></TD>
    <TD BGCOLOR="white" WIDTH="1px"></TD>
    <TD BGCOLOR="white" WIDTH="1px"></TD>
    <TD BGCOLOR="white" WIDTH="1px"></TD>
    <TD CLASS="small" ALIGN="left" VALIGN="top">
    <A TARGET='index' CLASS='small' HREF='target.php?newtab=1&from=1,1&b=19&ch=121&v=2&SID=...'>Textlink1</A>; <A TARGET='index' CLASS='small' HREF='target.php?newtab=1&from=1,1&b=19&ch=146&v=6-8&SID=...'>Textlink2</A></TD>
<TD BGCOLOR="white" WIDTH="1px"></TD><TD BGCOLOR="white" WIDTH="1px"></TD><TD CLASS="small" ALIGN="left" VALIGN="top"></TD></TR>
I need these information:
<A NAME="1,1">
Here is a text part
1,1,19,121,2
1,1,19,146,6-8


name(1),name(2),between font tags,atarget1,atarget2...atargetN
NUMBER,NUMBER,TEXTPART,LINK1,LINK2,...,LINKN
where LINKi is like:
from(1),from(2),b,ch,v

The number of links can be none, or more. I don't know the maximum.

Can you help me with extracting these infos? I can find these parts with regexp, but don't know how to put the info in parameters and how to it for every line.. And the number of links is unknown, but it's fine, I'll can parse the csv.

Thx,


Andras
 

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

How do I extract text only from html file without HTML tag

I have a html file called myfile. If I simply put "cat myfile.html" in UNIX, it shows all the html tags like <a href=r/26><img src="http://www>. But I want to extract only text part. Same problem happens in "type" command in MS-DOS. I know you can do it by opening it in Internet Explorer,... (4 Replies)
Discussion started by: los111
4 Replies

2. Shell Programming and Scripting

How can i delete html attributes from tag ?

Input: <table class="pixelBorderTable faqTable" width="100%" border="1" cellpadding="3" cellspacing="0"> <tbody><tr> <td class="pixelBorderTableHeaderTd" valign="top" width="20%" bgcolor="#666666"><p>&nbsp;</p></td> <td class="pixelBorderTableHeaderTd" valign="top"... (1 Reply)
Discussion started by: cola
1 Replies

3. Shell Programming and Scripting

extracting Line between HTML tag

Hi everyone: I want to extract string which is in between certain html tag. e.g. I tried with grep,cut, awk but could not find exact syntax for this one. :wall: PS>Sorry about bad english. (8 Replies)
Discussion started by: newlook2011
8 Replies

4. Shell Programming and Scripting

how to retrieve specific parameters using a xml tag

Hi, I have the following code in my xml file: <aaaRule loginIdPattern=".*" orgIdPattern=".*" deny="false" /> <aaaRuleGroup name="dpaas"> <aaaRule loginIdPattern=".*" orgIdPattern=".*" deny="false" /> I want to retrieve orgIdPattern and loginIdPattern parameter value based on... (2 Replies)
Discussion started by: mjavalkar
2 Replies

5. Shell Programming and Scripting

awk Script to parse a XML tag

I have an XML tag like this: <property name="agent" value="/var/tmp/root/eclipse" /> Is there way using awk that i can get the value from the above tag. So the output should be: /var/tmp/root/eclipse Help will be appreciated. Regards, Adi (6 Replies)
Discussion started by: asirohi
6 Replies

6. Emergency UNIX and Linux Support

Trying to parse a xml file for only one tag

I have a xml file in where I need to parse only a particular tag and print the output in the shell script. Here is the tag info in the xml file <dp:file> This is dp file output </dp:file> Output should be printed as This is dp file output. Please help.Thank you. (5 Replies)
Discussion started by: chandu123
5 Replies

7. Shell Programming and Scripting

Extracting a string from html tag

Hi I am new to string extractions in shell script... I am trying to extract a string such as #1753 from html tag looks like below. <a class="model-link tl-tr" href="lastSuccessfulBuild/">Last successful build (#1753), 40 min ago</a> and want the value as 1753 Could someone help me to... (3 Replies)
Discussion started by: hicharbo
3 Replies

8. Shell Programming and Scripting

Search for a html tag and print the entire tag

I want to print from <fruits> to </fruits> tag which have <fruit> as mango. Also i want both <fruits> and </fruits> in output. Please help eg. <fruits> <fruit id="111">mango<fruit> . another 20 lines . </fruits> (3 Replies)
Discussion started by: Ashik409
3 Replies

9. Shell Programming and Scripting

XML Parse between to tag with upper tag

Hi Guys Here is my Input : <?xml version="1.0" encoding="UTF-8"?> <xn:MeContext id="01736"> <xn:VsDataContainer id="01736"> <xn:attributes> <xn:vsDataType>vsDataMeContext</xn:vsDataType> ... (12 Replies)
Discussion started by: pareshkp
12 Replies

10. UNIX for Beginners Questions & Answers

Multiline html tag parse shell script

Hello, I want to parse the contents of a multiline html tag ex: <html> <body> <p>some other text</p> <div> <p class="margin-bottom-0"> text1 <br> text2 <br> <br> text3 </p> </div> </body> (15 Replies)
Discussion started by: SorcRR
15 Replies
All times are GMT -4. The time now is 06:06 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy