Sponsored Content
Full Discussion: Removing HTML tags
Top Forums UNIX for Advanced & Expert Users Removing HTML tags Post 302568241 by click on Wednesday 26th of October 2011 06:55:26 PM
Old 10-26-2011
html2text a.k.a html2txt - html2text: THE ASCIINATOR (aka html2txt) if you use Linux it should be in the packages, it is also in the FreeBSD ports. Other option is

Code:
lynx -dump

 

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

removing html tags via parameter expansion

Hi all- I have a variable that contains a web page: echo $STUFF <html> <head> <title>my page</title></head> <body> blah blah etc.. Can I use the shell's parameter expansion abilities to remove just the tags? I thought that FIXHTML=${STUFF//<*>/} might do it, but it didn't seem to... (2 Replies)
Discussion started by: rev66
2 Replies

2. Shell Programming and Scripting

searching & replacing/removing only certain HTML tags

I generally save a lot of web pages for reading offline which works out great for school. Now I have to spend a lot of time on the bus and I am looking for the best way to read some of these webpages using my Nokia 7610. I have uploaded the files to my phone, but they are deadly deadly slow to... (2 Replies)
Discussion started by: naphelge
2 Replies

3. Shell Programming and Scripting

removing html format with sed

Hello i am trying to remove the html format from the file using sed. for example remove <p> </p> i tried to do this : sed -e 's/<*>//g' test > test.t but still i have some html format . please help if you have any suggestions lets say this is the html file 1... (11 Replies)
Discussion started by: koricha
11 Replies

4. Shell Programming and Scripting

Parsing HTML, get text between 2 HTML tags

Hi there, I'm quite new to the forum and shell scripting. I want to filter out the "166.0 points". The results, that i found in google / the forum search didn't helped me :( <a href="/user/test" class="headitem menu" style="color:rgb(83,186,224);">test</a><a href="/points" class="headitem... (1 Reply)
Discussion started by: Mysthik
1 Replies

5. Shell Programming and Scripting

Remove html tags with particular string inside the tags

Could someone, please provide a solution to the following: I would like to remove some tags from the "head" of multiple html documents across the web site. They look like <link rel="alternate" type="application/rss+xml" title="Business and Investment in the Philippines"... (2 Replies)
Discussion started by: georgi58
2 Replies

6. Shell Programming and Scripting

Removing html tags

I store different variance of the below in an xml file. and apparently, xml has an issue loading up data like this because it contains html tags. i would like to preserve this data as it is, but unfortunately, xml says i cant. so i have to strip out all the html tags. the examples i found... (9 Replies)
Discussion started by: SkySmart
9 Replies

7. Shell Programming and Scripting

Removing all except couple of html tags from html file

I tried to find elegant (or at least simple) way to remove all but couple of html tags from html file, but all examples I found dealt with removing all the tags. The logic of the script would be: - if there is <li> or <ul> on the line, do nothing (=write same line to output) - if there is:... (0 Replies)
Discussion started by: juubuntu
0 Replies

8. Homework & Coursework Questions

Script: Removing HTML tags and duplicate lines

Use and complete the template provided. The entire template must be completed. If you don't, your post may be deleted! 1. The problem statement, all variables and given/known data: You will write a script that will remove all HTML tags from an HTML document and remove any consecutive... (3 Replies)
Discussion started by: tburns517
3 Replies

9. UNIX for Beginners Questions & Answers

Html - Removing transparency on tooltips

I want to use the tooltip in html, however the tranparency is creating problem for detailed tooltips as the text from the back interferes with the readability of the tooltip text. I have done the following changes, however the normal tooltip es still transparent I call it using <a... (3 Replies)
Discussion started by: kristinu
3 Replies
html2text(1)						      General Commands Manual						      html2text(1)

NAME
html2text - an advanced HTML-to-text converter SYNOPSIS
html2text -help html2text -version html2text [ -unparse | -check ] [ -debug-scanner ] [ -debug-parser ] [ -rcfile path ] [ -style ( compact | pretty ) ] [ -width width ] [ -o output-file ] [ -nobs ] [ -ascii | -utf8 ] [ -nometa ] [ input-url ... ] DESCRIPTION
html2text reads HTML documents from the input-urls, formats each of them into a stream of plain text characters, and writes the result to standard output (or into output-file, if the -o command line option is used). If no input-urls are specified on the command line, html2text reads from standard input. A dash as the input-url is an alternate way to specify standard input. html2text understands all HTML 3.2 constructs, but can render only part of them due to the limitations of the text output format. However, the program attempts to provide good substitutes for the elements it cannot render. html2text parses HTML 4 input, too, but not always as successful as other HTML processors. It also accepts syntactically incorrect input, and attempts to interpret it "reasonably". The way html2text formats the HTML documents is controlled by formatting properties read from an RC file. html2text attempts to read $HOME/.html2textrc (or the file specified by the -rcfile command line option); if that file cannot be read, html2text attempts to read /etc/html2textrc. If no RC file can be read (or if the RC file does not override all formatting properties), then "reasonable" defaults are assumed. The RC file format is described in the html2textrc(5) manual page. (open)SUSE version of html2text also can do input and output recoding. html2text tries to fetch encoding from HTML document. If encoding is not specified, you can use -ascii and -utf8 options. Output is converted to user's locale charset (LC_CTYPE). OPTIONS
-nometa By default, (open)SUSE version of html2text use 'meta http-equiv' tag for input recoding. This option cancels this behavior. -ascii By default, when -nometa is supplied, html2text uses ISO 8859-1 for the input. Specifying this option, plain ASCII is used instead. To find out how non-ASCII characters are rendered, refer to the file "ascii.substitutes". -utf8 By default, when -nometa is supplied, html2text uses ISO 8859-1 for the input. Specifying this option, UTF-8 is used instead (both for input and output). This option implies -nobs. -check This option is for diagnostic purposes: The HTML document is only parsed and not processed otherwise. In this mode of operation, html2text will report on parse errors and scan errors, which it does not in other modes of operation. Note that parse and scan errors are not fatal for html2text, but may cause mis-interpretation of the HTML code and/or portions of the document being swal- lowed. -debug-parser Let html2text report on the tokens being shifted, rules being applied, etc., while scanning the HTML document. This option is for diagnostic purposes. -debug-scanner Let html2text report on each lexical token scanned, while scanning the HTML document. This option is for diagnostic purposes. -help Print command line summary and exit. -nobs By default, original html2text renders underlined letters with sequences like "underscore-backspace-character" and boldface letters like "character-backspace-character". Because of issues with UTF-8, (open)SUSE version of html2text doesn't produce backspaces, so this option really does nothing. -o output-file Write the output to output-file instead of standard output. A dash as the output-file is an alternate way to specify the standard output. -rcfile path Attempt to read the file specified in path as RC file. -style ( compact | pretty ) Style pretty changes some of the default values of the formatting parameters documented in html2textrc(5). To find out which and how the formatting parameter defaults are changed, check the file "pretty.style". If this option is omitted, style compact is assumed as default. -unparse This option is for diagnostic purposes: Instead of formatting the parsed document, generate HTML code, that is guaranteed to be syn- tactically correct. If html2text has problems parsing a syntactically incorrect HTML document, this option may help you to under- stand what html2text thinks that the original HTML code means. -version Print program version and exit. -width width By default, html2text formats the HTML documents for a screen width of 79 characters. If redirecting the output into a file, or if your terminal has a width other than 80 characters, or if you just want to get an idea how html2text deals with large tables and different terminal widths, you may want to specify a different width. FILES
/etc/html2textrc System wide parser configuration file. $HOME/.html2textrc Personal parser configuration file, overrides the system wide values. CONFORMING TO
HTML 3.2 (HTML 3.2 Reference Specification - http://www.w3.org/TR/REC-html32), RESTRICTIONS
(open)SUSE version of html2text have no http support. Use html2text through pipes with curl or wget instead. html2text was written to convert HTML 3.2 documents. When using it with HTML 4 or even XHTML 1 documents, some constructs present only in these HTML versions might not be rendered. AUTHOR
html2text was written up to version 1.2.2 by Arno Unkrig <arno@unkrig.de> for GMRS Software GmbH, Unterschleissheim. Current maintainer and primary download location is: Martin Bayer <mail@mbayer.de> http://www.mbayer.de/html2text/files.shtml This man page was modified for Debian by Eugene V. Lyubimkin <jackyf.devel@gmail.com> This man page was modified for (open)SUSE by Klaus Singvogel <klaus@singvogel.net> <> SEE ALSO
html2textrc(5), less(1), more(1) 2008-09-20 html2text(1)
All times are GMT -4. The time now is 04:29 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy