09-20-2005
How do I extract text only from html file without HTML tag
I have a html file called myfile. If I simply put "cat myfile.html" in UNIX, it shows all the html tags like <a href=r/26><img src="http://www>. But I want to extract only text part.
Same problem happens in "type" command in MS-DOS.
I know you can do it by opening it in Internet Explorer, then "save as text", then open it again. But I need to do this from UNIX, as I have thousands of html files and have no time to convert them to text files one by one. I went through many books, but can't find a way. I would really appreciate your help.
10 More Discussions You Might Find Interesting
1. Shell Programming and Scripting
Hi!
I have a bunch of HTML files, which I want to parse to CSV files. Every page has a table in it, and I need to parse each row into a csv record.
With awk and sed, I managed to put every table row in separate lines. So my file looks like this:
<TR> .... </TR>
<TR> .... </TR>
...One... (1 Reply)
Discussion started by: senszey
1 Replies
2. Shell Programming and Scripting
I am attempting to extract weather data from the following website, but for the Victoria area only:
Text Forecasts - Environment Canada
I use this:
sed -n "/Greater Victoria./,/Fraser Valley./p"
But that phrasing does not sometimes get it all and think perhaps the website has more... (2 Replies)
Discussion started by: lagagnon
2 Replies
3. Shell Programming and Scripting
Hi there, I'm quite new to the forum and shell scripting.
I want to filter out the "166.0 points". The results, that i found in google / the forum search didn't helped me :(
<a href="/user/test" class="headitem menu" style="color:rgb(83,186,224);">test</a><a href="/points" class="headitem... (1 Reply)
Discussion started by: Mysthik
1 Replies
4. Shell Programming and Scripting
I tried to find elegant (or at least simple) way to remove all but couple of html tags from html file, but all examples I found dealt with removing all the tags.
The logic of the script would be:
- if there is <li> or <ul> on the line, do nothing (=write same line to output)
- if there is:... (0 Replies)
Discussion started by: juubuntu
0 Replies
5. Shell Programming and Scripting
Hi,
i have 30 html files and i want to add the html tag first (<html>) and end of the line </html> tag..How to do it in script.
Thanks, (7 Replies)
Discussion started by: bmk
7 Replies
6. Shell Programming and Scripting
I want to print from <fruits> to </fruits> tag which have <fruit> as mango. Also i want both <fruits> and </fruits> in output. Please help
eg.
<fruits>
<fruit id="111">mango<fruit>
.
another 20 lines
.
</fruits> (3 Replies)
Discussion started by: Ashik409
3 Replies
7. UNIX for Dummies Questions & Answers
I want to extract a table from an HTML file. the table starts with
<table class="tableinfo"
and ends with next closing table tag
</table>
how can I do this with awk/sed...
---------- Post updated at 04:34 PM ---------- Previous update was at 04:28 PM ----------
also I want to... (4 Replies)
Discussion started by: koutroul
4 Replies
8. Shell Programming and Scripting
Hi
This is my first post and I'm just a beginner. So please be nice to me.
I have a couple of html files where a pattern beginning with "http://www.site.com" and ending with "/resource.dat" is present on every 241st line. How do I extract this to a new text file?
I have tried sed -n 241,241p... (13 Replies)
Discussion started by: dejavo
13 Replies
9. Shell Programming and Scripting
Hi there,
Print IP Address:
grep 'HostID :' 10.244.9.124\ nessus.html | awk -F '<br>' '{print $12}' | tr -s ' ' | awk -F ':' '{print "<tr><td>" $2 "</td><td>"}'
Print Respective Ports:
grep 'classsubsection\|./tcp\|./udp' 10.244.9.124\ nessus.html | grep -v 'h2.classsubsection... (3 Replies)
Discussion started by: alvinoo
3 Replies
10. Shell Programming and Scripting
I am trying to extract text after keywords fron an html file. The keywords are reportLink":, "barcodedSamples": {", "barcodedSamples": {". Both the perl and awk run but the output is just the entire index.html not the desired output. Also for the reportLink": only the text after the second / until... (5 Replies)
Discussion started by: cmccabe
5 Replies
LEARN ABOUT PHP
tidy_get_html
TIDY_GET_HTML(3) 1 TIDY_GET_HTML(3)
tidy::html - Returns atidyNodeobject starting from the <html> tag of the tidy parse tree
Object oriented style
SYNOPSIS
tidyNode tidy::html (void )
DESCRIPTION
Procedural style
tidyNode tidy_get_html (tidy $object)
Returns a tidyNode object starting from the <html> tag of the tidy parse tree.
PARAMETERS
o $object
- The Tidy object.
RETURN VALUES
Returns the tidyNode object.
EXAMPLES
Example #1
tidy_get_html(3) example
<?php
$html = '
<html>
<head>
<title>test</title>
</head>
<body>
<p>paragraph</p>
</body>
</html>';
$tidy = tidy_parse_string($html);
$html = $tidy->html();
echo $html->value;
?>
The above example will output:
<html>
<head>
<title>test</title>
</head>
<body>
<p>paragraph</p>
</body>
</html>
NOTES
Note
This function is only available with Zend Engine 2 (PHP >= 5.0.0).
SEE ALSO
tidy.body(3), tidy.head(3).
PHP Documentation Group TIDY_GET_HTML(3)