Multiline html tag parse shell script Post: 303040151

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

How do I extract text only from html file without HTML tag

I have a html file called myfile. If I simply put "cat myfile.html" in UNIX, it shows all the html tags like <a href=r/26><img src="http://www>. But I want to extract only text part. Same problem happens in "type" command in MS-DOS. I know you can do it by opening it in Internet Explorer,...

2. Shell Programming and Scripting

how to use html tag in shell scripting

Hai friends I have a small doubt.. how can we use html tag in shell scripting code : echo "<html>" echo "<body>" echo " welcome to peace world " echo "</body>" echo "</html>" output displayed like this: <html> <body> welcome to peace world </body> </html>

3. UNIX for Advanced & Expert Users

shell script to parse html file

hi all, i have a html file something similar to this. <tr class="evenrow"> <td class="data">added</td><td class="data">xyz@abc.com</td> <td class="data">filename.sql</td><td class="modifications-data">08/25/2009 07:58:40</td><td class="data">Added TK prof script</td> </tr> <tr...

4. Shell Programming and Scripting

Parse HTML tag parameters and text

Hi! I have a bunch of HTML files, which I want to parse to CSV files. Every page has a table in it, and I need to parse each row into a csv record. With awk and sed, I managed to put every table row in separate lines. So my file looks like this: <TR> .... </TR> <TR> .... </TR> ...One...

5. Shell Programming and Scripting

Script to delete HTML tag

Guys, I have a little script that I got of the internet and that I use in Squid to block ads. I used that script with linux but now i have moved my servers to freebsd. I have a step learning curve there but it is fun: Back to the script issue. The script used to work i with linux but...

6. Shell Programming and Scripting

awk Script to parse a XML tag

I have an XML tag like this: <property name="agent" value="/var/tmp/root/eclipse" /> Is there way using awk that i can get the value from the above tag. So the output should be: /var/tmp/root/eclipse Help will be appreciated. Regards, Adi

7. Shell Programming and Scripting

Search for a html tag and print the entire tag

I want to print from <fruits> to </fruits> tag which have <fruit> as mango. Also i want both <fruits> and </fruits> in output. Please help eg. <fruits> <fruit id="111">mango<fruit> . another 20 lines . </fruits>

8. Shell Programming and Scripting

Using shell command need to parse multiple nested tag value of a XML file

I have this XML file - <gp> <mms>1110012</mms> <tg>988</tg> <mm>LongTime</mm> <lv> <lkid>StartEle=ONE, Desti = Motion</lkid> <kk>12</kk> </lv> <lv> <lkid>StartEle=ONE, Source = Velocity</lkid> <kk>2</kk> </lv> <lv> ...

9. Shell Programming and Scripting

XML Parse between to tag with upper tag

Hi Guys Here is my Input : <?xml version="1.0" encoding="UTF-8"?> <xn:MeContext id="01736"> <xn:VsDataContainer id="01736"> <xn:attributes> <xn:vsDataType>vsDataMeContext</xn:vsDataType> ...

10. Shell Programming and Scripting

How to remove html tag which has multiple lines in SHELL?

I want to clean a html file. I try to remove the script part in the html and remove the rest of tags and empty lines. The code I try to use is the following: sed '/<script/,/<\/script>/d' webpage.html | sed -e 's/<*>//g' | sed '/^\s*$/d' > output.txt However, in this method, I can not...

LEARN ABOUT DEBIAN

html::linkextractor

LinkExtractor(3pm)					User Contributed Perl Documentation					LinkExtractor(3pm)

NAME

       HTML::LinkExtractor - Extract links from an HTML document

DESCRIPTION

       HTML::LinkExtractor is used for extracting links from HTML.  It is very similar to HTML::LinkExtor, except that besides getting the URL,
       you also get the link-text.

       Example ( please run the examples ):

	   use HTML::LinkExtractor;
	   use Data::Dumper;

	   my $input = q{If <a href="http://perl.com/"> I am a LINK!!! </a>};
	   my $LX = new HTML::LinkExtractor();

	   $LX->parse($input);

	   print Dumper($LX->links);
	   __END__
	   # the above example will yield
	   $VAR1 = [
		     {
		       '_TEXT' => '<a href="http://perl.com/"> I am a LINK!!! </a>',
		       'href' => bless(do{(my $o = 'http://perl.com/')}, 'URI::http'),
		       'tag' => 'a'
		     }
		   ];

       "HTML::LinkExtractor" will also correctly extract nested link-type tags.

SYNOPSIS

	   ## the demo
	   perl LinkExtractor.pm
	   perl LinkExtractor.pm file.html othefile.html

	   ## or if the module is installed, but you don't know where

	   perl -MHTML::LinkExtractor -e" system $^X, $INC{q{HTML/LinkExtractor.pm}} "
	   perl -MHTML::LinkExtractor -e' system $^X, $INC{q{HTML/LinkExtractor.pm}} '

	   ## or

	   use HTML::LinkExtractor;
	   use LWP qw( get ); #     use LWP::Simple qw( get );

	   my $base = 'http://search.cpan.org';
	   my $html = get($base.'/recent');
	   my $LX = new HTML::LinkExtractor();

	   $LX->parse($html);

	   print qq{<base href="$base">
};

	   for my $Link( @{ $LX->links } ) {
	   ## new modules are linked  by /author/NAME/Dist
	       if( $$Link{href}=~ m{^/author/w+} ) {
		   print $$Link{_TEXT}."
";
	       }
	   }

	   undef $LX;
	   __END__

	   ## or

	   use HTML::LinkExtractor;
	   use Data::Dumper;

	   my $input = q{If <a href="http://perl.com/"> I am a LINK!!! </a>};
	   my $LX = new HTML::LinkExtractor(
	       sub {
		   print Data::Dumper::Dumper(@_);
	       },
	       'http://perlFox.org/',
	   );

	   $LX->parse($input);
	   $LX->strip(1);
	   $LX->parse($input);
	   __END__

	   #### Calculate to total size of a web-page
	   #### adds up the sizes of all the images and stylesheets and stuff

	   use strict;
	   use LWP; #	  use LWP::Simple;
	   use HTML::LinkExtractor;
							       #
	   my $url  = shift || 'http://www.google.com';
	   my $html = get($url);
	   my $Total = length $html;
							       #
	   print "initial size $Total
";
							       #
	   my $LX = new HTML::LinkExtractor(
	       sub {
		   my( $X, $tag ) = @_;
							       #
		   unless( grep {$_ eq $tag->{tag} } @HTML::LinkExtractor::TAGS_IN_NEED ) {
							       #
	   print "$$tag{tag}
";
							       #
		       for my $urlAttr ( @{$HTML::LinkExtractor::TAGS{$$tag{tag}}} ) {
			   if( exists $$tag{$urlAttr} ) {
			       my $size = (head( $$tag{$urlAttr} ))[1];
			       $Total += $size if $size;
	   print "adding $size
" if $size;
			   }
		       }
		   }
	       },
	       $url,
	       0
	   );
							       #
	   $LX->parse($html);
							       #
	   print "The total size of 
$url
 is $Total bytes
";
	   __END__

METHODS

   "$LX->new([&callback, [$baseUrl, [1]]])"
       Accepts 3 arguments, all of which are optional.	If for example you want to pass a $baseUrl, but don't want to have a callback invoked,
       just put "undef" in place of a subref.

       This is the only class method.

       1.  a callback ( a sub reference, as in "sub{}", or "&sub") which is to be called each time a new LINK is encountered ( for
	   @HTML::LinkExtractor::TAGS_IN_NEED this means
	    after the closing tag is encountered )

	   The callback receives an object reference($LX) and a link hashref.

       2.  and a base URL ( URI->new, so its up to you to make sure it's valid which is used to convert all relative URI's to absolute ones.

	       $ALinkP{href} = URI->new_abs( $ALink{href}, $base );

       3.  A "boolean" (just stick with 1).  See the example in "DESCRIPTION".	Normally, you'd get back _TEXT that looks like

	       '_TEXT' => '<a href="http://perl.com/"> I am a LINK!!! </a>',

	   If you turn this option on, you'll get the following instead

	       '_TEXT' => ' I am a LINK!!! ',

	   The private utility function "_stripHTML" does this by using HTML::TokeParsers method get_trimmed_text.

	   You can turn this feature on an off by using "$LX->strip(undef || 0 || 1)"

   "$LX->parse( $filename || *FILEHANDLE || $FileContent )"
       Each time you call "parse", you should pass it a $filename a *FILEHANDLE or a "$FileContent"

       Each time you call "parse" a new "HTML::TokeParser" object is created and stored in "$this->{_tp}".

       You shouldn't need to mess with the TokeParser object.

   "$LX->links()"
       Only after you call "parse" will this method return anything.  This method returns a reference to an ArrayOfHashes, which basically looks
       like (Data::Dumper output)

	   $VAR1 = [ { tag => 'img', src => 'image.png' }, ];

       Please note that if yo provide a callback this array will be empty.

   "$LX->strip( [ 0 || 1 ])"
       If you pass in "undef" (or nothing), returns the state of the option.  Passing in a true or false value sets the option.

       If you wanna know what the option does see "$LX->new([&callback, [$baseUrl, [1]]])"

WHAT'S A LINK-type tag
       Take a look at %HTML::LinkExtractor::TAGS to see what I consider to be link-type-tag.

       Take a look at @HTML::LinkExtractor::VALID_URL_ATTRIBUTES to see all the possible tag attributes which can contain URI's (the links!!)

       Take a look at @HTML::LinkExtractor::TAGS_IN_NEED to see the tags for which the '_TEXT' attribute is provided, like "<a href="#"> TEST
       </a>"

   How can that be?!?!
       I took at look at %HTML::Tagset::linkElements and the following URL's

	   http://www.blooberry.com/indexdot/html/tagindex/all.htm

	   http://www.blooberry.com/indexdot/html/tagpages/a/a-hyperlink.htm
	   http://www.blooberry.com/indexdot/html/tagpages/a/applet.htm
	   http://www.blooberry.com/indexdot/html/tagpages/a/area.htm

	   http://www.blooberry.com/indexdot/html/tagpages/b/base.htm
	   http://www.blooberry.com/indexdot/html/tagpages/b/bgsound.htm

	   http://www.blooberry.com/indexdot/html/tagpages/d/del.htm
	   http://www.blooberry.com/indexdot/html/tagpages/d/div.htm

	   http://www.blooberry.com/indexdot/html/tagpages/e/embed.htm
	   http://www.blooberry.com/indexdot/html/tagpages/f/frame.htm

	   http://www.blooberry.com/indexdot/html/tagpages/i/ins.htm
	   http://www.blooberry.com/indexdot/html/tagpages/i/image.htm
	   http://www.blooberry.com/indexdot/html/tagpages/i/iframe.htm
	   http://www.blooberry.com/indexdot/html/tagpages/i/ilayer.htm
	   http://www.blooberry.com/indexdot/html/tagpages/i/inputimage.htm

	   http://www.blooberry.com/indexdot/html/tagpages/l/layer.htm
	   http://www.blooberry.com/indexdot/html/tagpages/l/link.htm

	   http://www.blooberry.com/indexdot/html/tagpages/o/object.htm

	   http://www.blooberry.com/indexdot/html/tagpages/q/q.htm

	   http://www.blooberry.com/indexdot/html/tagpages/s/script.htm
	   http://www.blooberry.com/indexdot/html/tagpages/s/sound.htm

	   And the special cases

	   <!DOCTYPE HTML SYSTEM "http://www.w3.org/DTD/HTML4-strict.dtd">
	   http://www.blooberry.com/indexdot/html/tagpages/d/doctype.htm
	   '!doctype'  is really a process instruction, but is still listed
	   in %TAGS with 'url' as the attribute

	   and

	   <meta HTTP-EQUIV="Refresh" CONTENT="5; URL=http://www.foo.com/foo.html">
	   http://www.blooberry.com/indexdot/html/tagpages/m/meta.htm
	   If there is a valid url, 'url' is set as the attribute.
	   The meta tag has no 'attributes' listed in %TAGS.

SEE ALSO

       HTML::LinkExtor, HTML::TokeParser, HTML::Tagset.

AUTHOR

       D.H (PodMaster)

       Please use http://rt.cpan.org/ to report bugs.

       Just go to http://rt.cpan.org/NoAuth/Bugs.html?Dist=HTML-Scrubber to see a bug list and/or repot new ones.

LICENSE

       Copyright (c) 2003, 2004 by D.H. (PodMaster).  All rights reserved.

       This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.  The LICENSE file contains the
       full text of the license.

perl v5.10.1							    2005-01-07							LinkExtractor(3pm)

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

How do I extract text only from html file without HTML tag

Discussion started by: los111

2. Shell Programming and Scripting

how to use html tag in shell scripting

Discussion started by: jrex1983

3. UNIX for Advanced & Expert Users

shell script to parse html file

Discussion started by: sais

4. Shell Programming and Scripting

Parse HTML tag parameters and text

Discussion started by: senszey