Remove html tags with bash Post: 302198091

10 More Discussions You Might Find Interesting

1. Linux

How to remove only html tags inside a file?

Hi All, I have following example file i want to remove all html tags only, Input File: <html> <head> <title>Software Solutions Inc., </title> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> </head> <body bgcolor=white leftmargin="0" topmargin="0"...

2. Shell Programming and Scripting

How to use sed to remove html tags including text between them

How to use sed to remove html tags including text between them? Example: User <b> rolvak </b> is stupid. It does not using <b>OOP</b>! and should output: User is stupid. It does not using ! Thank you..

3. Shell Programming and Scripting

HTML code remove

Hello, I have one file which has been inserted intermittently with HTML web page. I would like to remove all text between "<html xmlns="http://www.w3.org/1999/xhtml">" and </html> tags. Can any one please suggest me sed regular expression for it. Thanks

4. Shell Programming and Scripting

remove html tags,consecutive duplicate lines

I need help with a script that will remove all HTML tags from an HTML document and remove any consecutive duplicate lines, and save it as a text document. The user should have the option of including the name of an html file as an argument for the script, but if none is provided, then the script...

5. Shell Programming and Scripting

BASH parsing for html tags

Hello can anyone help me parse this line. <tr><td>United States of America</td><td>Dollar</td><td>43.309</td></tr><tr><td>Japan</td><td>Yen</td><td>0.5579</td></tr> the line above did not break. so i would like to have a result like this United States of America Dollar 43.309 Japan...

6. Shell Programming and Scripting

Parsing HTML, get text between 2 HTML tags

Hi there, I'm quite new to the forum and shell scripting. I want to filter out the "166.0 points". The results, that i found in google / the forum search didn't helped me :( <a href="/user/test" class="headitem menu" style="color:rgb(83,186,224);">test</a><a href="/points" class="headitem...

7. Shell Programming and Scripting

Remove html tags with particular string inside the tags

Could someone, please provide a solution to the following: I would like to remove some tags from the "head" of multiple html documents across the web site. They look like <link rel="alternate" type="application/rss+xml" title="Business and Investment in the Philippines"...

8. Shell Programming and Scripting

Removing all except couple of html tags from html file

I tried to find elegant (or at least simple) way to remove all but couple of html tags from html file, but all examples I found dealt with removing all the tags. The logic of the script would be: - if there is <li> or <ul> on the line, do nothing (=write same line to output) - if there is:...

9. Shell Programming and Scripting

How to remove the values inside the html tags?

Hi, I have a txt file which contain this: <a href="linux">Linux</a> <a href="unix">Unix</a> <a href="oracle">Oracle</a> <a href="perl">Perl</a> I'm trying to extract the text in between these anchor tag and ignoring everything else using grep. I managed to ignore the tags but unable to...

10. Shell Programming and Scripting

How to remove multiline HTML tags from a file?

I am trying to remove a multiline HTML tag and its contents from a few HTML files following the same basic pattern. So far using regex and sed have been unsuccessful. The HTML has a basic structure like this (with the normal HTML stuff around it): <div id="div1"> <div class="div2"> <other...

LEARN ABOUT DEBIAN

html::linkextractor

LinkExtractor(3pm)					User Contributed Perl Documentation					LinkExtractor(3pm)

NAME

       HTML::LinkExtractor - Extract links from an HTML document

DESCRIPTION

       HTML::LinkExtractor is used for extracting links from HTML.  It is very similar to HTML::LinkExtor, except that besides getting the URL,
       you also get the link-text.

       Example ( please run the examples ):

	   use HTML::LinkExtractor;
	   use Data::Dumper;

	   my $input = q{If <a href="http://perl.com/"> I am a LINK!!! </a>};
	   my $LX = new HTML::LinkExtractor();

	   $LX->parse($input);

	   print Dumper($LX->links);
	   __END__
	   # the above example will yield
	   $VAR1 = [
		     {
		       '_TEXT' => '<a href="http://perl.com/"> I am a LINK!!! </a>',
		       'href' => bless(do{(my $o = 'http://perl.com/')}, 'URI::http'),
		       'tag' => 'a'
		     }
		   ];

       "HTML::LinkExtractor" will also correctly extract nested link-type tags.

SYNOPSIS

	   ## the demo
	   perl LinkExtractor.pm
	   perl LinkExtractor.pm file.html othefile.html

	   ## or if the module is installed, but you don't know where

	   perl -MHTML::LinkExtractor -e" system $^X, $INC{q{HTML/LinkExtractor.pm}} "
	   perl -MHTML::LinkExtractor -e' system $^X, $INC{q{HTML/LinkExtractor.pm}} '

	   ## or

	   use HTML::LinkExtractor;
	   use LWP qw( get ); #     use LWP::Simple qw( get );

	   my $base = 'http://search.cpan.org';
	   my $html = get($base.'/recent');
	   my $LX = new HTML::LinkExtractor();

	   $LX->parse($html);

	   print qq{<base href="$base">
};

	   for my $Link( @{ $LX->links } ) {
	   ## new modules are linked  by /author/NAME/Dist
	       if( $$Link{href}=~ m{^/author/w+} ) {
		   print $$Link{_TEXT}."
";
	       }
	   }

	   undef $LX;
	   __END__

	   ## or

	   use HTML::LinkExtractor;
	   use Data::Dumper;

	   my $input = q{If <a href="http://perl.com/"> I am a LINK!!! </a>};
	   my $LX = new HTML::LinkExtractor(
	       sub {
		   print Data::Dumper::Dumper(@_);
	       },
	       'http://perlFox.org/',
	   );

	   $LX->parse($input);
	   $LX->strip(1);
	   $LX->parse($input);
	   __END__

	   #### Calculate to total size of a web-page
	   #### adds up the sizes of all the images and stylesheets and stuff

	   use strict;
	   use LWP; #	  use LWP::Simple;
	   use HTML::LinkExtractor;
							       #
	   my $url  = shift || 'http://www.google.com';
	   my $html = get($url);
	   my $Total = length $html;
							       #
	   print "initial size $Total
";
							       #
	   my $LX = new HTML::LinkExtractor(
	       sub {
		   my( $X, $tag ) = @_;
							       #
		   unless( grep {$_ eq $tag->{tag} } @HTML::LinkExtractor::TAGS_IN_NEED ) {
							       #
	   print "$$tag{tag}
";
							       #
		       for my $urlAttr ( @{$HTML::LinkExtractor::TAGS{$$tag{tag}}} ) {
			   if( exists $$tag{$urlAttr} ) {
			       my $size = (head( $$tag{$urlAttr} ))[1];
			       $Total += $size if $size;
	   print "adding $size
" if $size;
			   }
		       }
		   }
	       },
	       $url,
	       0
	   );
							       #
	   $LX->parse($html);
							       #
	   print "The total size of 
$url
 is $Total bytes
";
	   __END__

METHODS

   "$LX->new([&callback, [$baseUrl, [1]]])"
       Accepts 3 arguments, all of which are optional.	If for example you want to pass a $baseUrl, but don't want to have a callback invoked,
       just put "undef" in place of a subref.

       This is the only class method.

       1.  a callback ( a sub reference, as in "sub{}", or "&sub") which is to be called each time a new LINK is encountered ( for
	   @HTML::LinkExtractor::TAGS_IN_NEED this means
	    after the closing tag is encountered )

	   The callback receives an object reference($LX) and a link hashref.

       2.  and a base URL ( URI->new, so its up to you to make sure it's valid which is used to convert all relative URI's to absolute ones.

	       $ALinkP{href} = URI->new_abs( $ALink{href}, $base );

       3.  A "boolean" (just stick with 1).  See the example in "DESCRIPTION".	Normally, you'd get back _TEXT that looks like

	       '_TEXT' => '<a href="http://perl.com/"> I am a LINK!!! </a>',

	   If you turn this option on, you'll get the following instead

	       '_TEXT' => ' I am a LINK!!! ',

	   The private utility function "_stripHTML" does this by using HTML::TokeParsers method get_trimmed_text.

	   You can turn this feature on an off by using "$LX->strip(undef || 0 || 1)"

   "$LX->parse( $filename || *FILEHANDLE || $FileContent )"
       Each time you call "parse", you should pass it a $filename a *FILEHANDLE or a "$FileContent"

       Each time you call "parse" a new "HTML::TokeParser" object is created and stored in "$this->{_tp}".

       You shouldn't need to mess with the TokeParser object.

   "$LX->links()"
       Only after you call "parse" will this method return anything.  This method returns a reference to an ArrayOfHashes, which basically looks
       like (Data::Dumper output)

	   $VAR1 = [ { tag => 'img', src => 'image.png' }, ];

       Please note that if yo provide a callback this array will be empty.

   "$LX->strip( [ 0 || 1 ])"
       If you pass in "undef" (or nothing), returns the state of the option.  Passing in a true or false value sets the option.

       If you wanna know what the option does see "$LX->new([&callback, [$baseUrl, [1]]])"

WHAT'S A LINK-type tag
       Take a look at %HTML::LinkExtractor::TAGS to see what I consider to be link-type-tag.

       Take a look at @HTML::LinkExtractor::VALID_URL_ATTRIBUTES to see all the possible tag attributes which can contain URI's (the links!!)

       Take a look at @HTML::LinkExtractor::TAGS_IN_NEED to see the tags for which the '_TEXT' attribute is provided, like "<a href="#"> TEST
       </a>"

   How can that be?!?!
       I took at look at %HTML::Tagset::linkElements and the following URL's

	   http://www.blooberry.com/indexdot/html/tagindex/all.htm

	   http://www.blooberry.com/indexdot/html/tagpages/a/a-hyperlink.htm
	   http://www.blooberry.com/indexdot/html/tagpages/a/applet.htm
	   http://www.blooberry.com/indexdot/html/tagpages/a/area.htm

	   http://www.blooberry.com/indexdot/html/tagpages/b/base.htm
	   http://www.blooberry.com/indexdot/html/tagpages/b/bgsound.htm

	   http://www.blooberry.com/indexdot/html/tagpages/d/del.htm
	   http://www.blooberry.com/indexdot/html/tagpages/d/div.htm

	   http://www.blooberry.com/indexdot/html/tagpages/e/embed.htm
	   http://www.blooberry.com/indexdot/html/tagpages/f/frame.htm

	   http://www.blooberry.com/indexdot/html/tagpages/i/ins.htm
	   http://www.blooberry.com/indexdot/html/tagpages/i/image.htm
	   http://www.blooberry.com/indexdot/html/tagpages/i/iframe.htm
	   http://www.blooberry.com/indexdot/html/tagpages/i/ilayer.htm
	   http://www.blooberry.com/indexdot/html/tagpages/i/inputimage.htm

	   http://www.blooberry.com/indexdot/html/tagpages/l/layer.htm
	   http://www.blooberry.com/indexdot/html/tagpages/l/link.htm

	   http://www.blooberry.com/indexdot/html/tagpages/o/object.htm

	   http://www.blooberry.com/indexdot/html/tagpages/q/q.htm

	   http://www.blooberry.com/indexdot/html/tagpages/s/script.htm
	   http://www.blooberry.com/indexdot/html/tagpages/s/sound.htm

	   And the special cases

	   <!DOCTYPE HTML SYSTEM "http://www.w3.org/DTD/HTML4-strict.dtd">
	   http://www.blooberry.com/indexdot/html/tagpages/d/doctype.htm
	   '!doctype'  is really a process instruction, but is still listed
	   in %TAGS with 'url' as the attribute

	   and

	   <meta HTTP-EQUIV="Refresh" CONTENT="5; URL=http://www.foo.com/foo.html">
	   http://www.blooberry.com/indexdot/html/tagpages/m/meta.htm
	   If there is a valid url, 'url' is set as the attribute.
	   The meta tag has no 'attributes' listed in %TAGS.

SEE ALSO

       HTML::LinkExtor, HTML::TokeParser, HTML::Tagset.

AUTHOR

       D.H (PodMaster)

       Please use http://rt.cpan.org/ to report bugs.

       Just go to http://rt.cpan.org/NoAuth/Bugs.html?Dist=HTML-Scrubber to see a bug list and/or repot new ones.

LICENSE

       Copyright (c) 2003, 2004 by D.H. (PodMaster).  All rights reserved.

       This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.  The LICENSE file contains the
       full text of the license.

perl v5.10.1							    2005-01-07							LinkExtractor(3pm)

10 More Discussions You Might Find Interesting

1. Linux

How to remove only html tags inside a file?

Discussion started by: btech_raju

2. Shell Programming and Scripting

How to use sed to remove html tags including text between them

Discussion started by: alphagon

3. Shell Programming and Scripting

HTML code remove

Discussion started by: nrbhole

4. Shell Programming and Scripting

remove html tags,consecutive duplicate lines

Discussion started by: clicstic