Extracting anchor text and its URL from HTML files in BASH Post: 302504774

Sponsored Content

Top Forums Shell Programming and Scripting Extracting anchor text and its URL from HTML files in BASH Post 302504774 by shoaibjameel123 on Tuesday 15th of March 2011 11:48:35 AM

03-15-2011

Registered User

Extracting anchor text and its URL from HTML files in BASH

Hi All,

I have some HTML files and my requirement is to extract all the anchor text words from the HTML files along with their URLs and store the result in a separate text file separated by space. For example,

Code:

<a href="/kid/stay_healthy/">Staying Healthy</a>

which has /kid/stay_healthy/ as the URL or path and Staying Healthy as the anchor text.
I want to extract both the above and store in a text file separated by spaces like

Code:

/kid/stay_healthy/ Staying Healthy

New path and new anchor now comes in another line (newline) and so on.

This is what I have tried so far. Got this code from the internet (to be very honest!):

Code:

awk 'BEGIN{
RS="</a>"
IGNORECASE=1
}
{
  for(q=1;q<=NF;q++){
    if ( $q ~ /href/){
      gsub(/.*href=\042/,"",$q)
      gsub(/\042.*/,"",$q)
      print $(q)
    }
  }
}' file1.html

The problem with the above code is that it is not able to extract the anchor text, second it is doing for a single HTML file. For storing the result in a separate file, I can just redirect the output to a text file using >

shoaibjameel123

View Public Profile for shoaibjameel123

Find all posts by shoaibjameel123

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Extracting/condensing text from multiple files to multiples files

Hi Everyone, I'm really new to all this so I'm really hoping someone can help. I have a directory with ~1000 lists from which I want to extract lines from and write to new files. For simplicity lets say they are shopping lists and I want to write out the lines corresponding to apples to a new...

2. Programming

extracting text files

i m unable to extract data from one text files to different text files..i am able to concat two text files in d same file ---------- Post updated at 03:21 PM ---------- Previous update was at 03:16 PM ---------- i want a c program for it

3. Shell Programming and Scripting

Bash shell script that inserts a text data file into an HTML table

hi , i need to create a bash shell script that insert a text data file into an html made table, this table output has to mailed.I am new to shell scripting and have a very minimum idea of shell scripting. please help.

4. Shell Programming and Scripting

Extracting the file name from the specified URL

Hello Everyone, I am trying to write a shell script(or Perl Script) that would do the following: I have a file that contains the following lines: File: https://ims-svnus.com/dev/DB/trunk/feeds/templates/shell_script.txt -r860...

5. Shell Programming and Scripting

URL/HTML encoding

Hey guys, looking for a way to encode a string into URL and HTML in a bash script that I'm making to encode strings in various different digests etc. Can't find anything on it anywhere else on the forums. Any help much appreciated, still very new to bash and programming etc.

6. Shell Programming and Scripting

Extracting the column containing URL from a text file

I have the file like this: Timestamp URL Text 1331635241000 http://example.com Peoples footage at www.test.com,http://example4.com 1331635231000 http://example1.net crack the nuts http://example6.com 1331635280000 http://example2.net ...

7. Shell Programming and Scripting

Extracting the column containing URL from a text file

8. Shell Programming and Scripting

Extracting the column containing URL from a text file

9. UNIX for Dummies Questions & Answers

Extracting URL with domain

I have a file like this: http://article.wn.com/view/2010/11/26/IV_drug_policy_feels_HIV_patients_Red_Cross/ http://aidsjournal.com/,www.cfpa.org.cn/page1/page2 , www.youtube.com http://seattletimes.nwsource.com/html/jerrybrewer/2013517803_brewer25.html...

10. Shell Programming and Scripting

Bash not removing all .tar.bz2 files after extracting

In the bash below each .tar.bz2 (usually 2) are extracted and then the original .tar.bz2 is removed. However, only one (presumably the first extracted) is being removed, however both are extracted. I am not sure why this is? Thank you :). tar.bz2 folders in /home/cmccabe/Desktop/NGS/API ...

LEARN ABOUT MOJAVE

html::filter

HTML::Filter(3) 					User Contributed Perl Documentation					   HTML::Filter(3)

NAME

       HTML::Filter - Filter HTML text through the parser

NOTE

       This module is deprecated. The "HTML::Parser" now provides the functionally of "HTML::Filter" much more efficiently with the the "default"
       handler.

SYNOPSIS

	require HTML::Filter;
	$p = HTML::Filter->new->parse_file("index.html");

DESCRIPTION

       "HTML::Filter" is an HTML parser that by default prints the original text of each HTML element (a slow version of cat(1) basically).  The
       callback methods may be overridden to modify the filtering for some HTML elements and you can override output() method which is called to
       print the HTML text.

       "HTML::Filter" is a subclass of "HTML::Parser". This means that the document should be given to the parser by calling the $p->parse() or
       $p->parse_file() methods.

EXAMPLES

       The first example is a filter that will remove all comments from an HTML file.  This is achieved by simply overriding the comment method to
       do nothing.

	 package CommentStripper;
	 require HTML::Filter;
	 @ISA=qw(HTML::Filter);
	 sub comment { }  # ignore comments

       The second example shows a filter that will remove any <TABLE>s found in the HTML file.	We specialize the start() and end() methods to
       count table tags and then make output not happen when inside a table.

	 package TableStripper;
	 require HTML::Filter;
	 @ISA=qw(HTML::Filter);
	 sub start
	 {
	    my $self = shift;
	    $self->{table_seen}++ if $_[0] eq "table";
	    $self->SUPER::start(@_);
	 }

	 sub end
	 {
	    my $self = shift;
	    $self->SUPER::end(@_);
	    $self->{table_seen}-- if $_[0] eq "table";
	 }

	 sub output
	 {
	     my $self = shift;
	     unless ($self->{table_seen}) {
		 $self->SUPER::output(@_);
	     }
	 }

       If you want to collect the parsed text internally you might want to do something like this:

	 package FilterIntoString;
	 require HTML::Filter;
	 @ISA=qw(HTML::Filter);
	 sub output { push(@{$_[0]->{fhtml}}, $_[1]) }
	 sub filtered_html { join("", @{$_[0]->{fhtml}}) }

SEE ALSO

       HTML::Parser

COPYRIGHT

       Copyright 1997-1999 Gisle Aas.

       This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

perl v5.18.2							    2013-03-25							   HTML::Filter(3)

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Extracting/condensing text from multiple files to multiples files

Discussion started by: born2phase

2. Programming

extracting text files

Discussion started by: asd123

3. Shell Programming and Scripting

Bash shell script that inserts a text data file into an HTML table

Discussion started by: intern123

4. Shell Programming and Scripting

Extracting the file name from the specified URL

Discussion started by: filter

5. Shell Programming and Scripting

URL/HTML encoding

Discussion started by: 3therk1ll

6. Shell Programming and Scripting

Extracting the column containing URL from a text file

Discussion started by: csim_mohan

7. Shell Programming and Scripting

Extracting the column containing URL from a text file

Discussion started by: csim_mohan

8. Shell Programming and Scripting

Extracting the column containing URL from a text file

Discussion started by: csim_mohan

9. UNIX for Dummies Questions & Answers

Extracting URL with domain

Discussion started by: csim_mohan

10. Shell Programming and Scripting

Bash not removing all .tar.bz2 files after extracting

Discussion started by: cmccabe

LEARN ABOUT MOJAVE

html::filter