How to extract url from html page?

10-17-2010

Administrator

19,118, 3,359

Join Date: Sep 2000

Last Activity: 15 July 2022, 8:51 AM EDT

Location: Asia Pacific, Cyberspace, in the Dark Dystopia

Posts: 19,118

Thanks Given: 2,351

Thanked 3,359 Times in 1,878 Posts

I used to use Regex Buddy (to create and test regex) for this. They had some stock regex that was quite good for extracting URLs from text. This is really a great tool but sadly only runs on Windows (and on Linux using Wine), as I recall. Using the tool, you create, test and debug complex regex. You can even optimize the regex for performance. Then, you cut-and-paste the regex into your code or application. I highly recommend this tool. I would be running it now, but sadly my XP machine died and I'm running OSX on the desktop and only Android on the go.

Neo

View Public Profile for Neo

Visit Neo's homepage!

Find all posts by Neo

10-17-2010

Registered User

602, 83

Join Date: Dec 2009

Last Activity: 6 February 2016, 7:08 AM EST

Posts: 602

Thanks Given: 2

Thanked 83 Times in 78 Posts

Quote:

Originally Posted by Scrutinizer

This because there are underline tags with angular brackets in the description. I give up without a library

Nevertheless, its a good effort.

---------- Post updated at 03:02 AM ---------- Previous update was at 02:58 AM ----------

Quote:

Originally Posted by Neo

I used to use Regex Buddy (to create and test regex) for this. They had some stock regex that was quite good for extracting URLs from text. This is really a great tool but sadly only runs on Windows (and on Linux using Wine), as I recall.

There are also many online regex sites for creating and testing regex as well. But that said, regex is really not the best tool to parse HTML, unless the requirement is really really really simple

kurumi

View Public Profile for kurumi

Find all posts by kurumi

10-17-2010

Administrator

19,118, 3,359

Join Date: Sep 2000

Last Activity: 15 July 2022, 8:51 AM EDT

Location: Asia Pacific, Cyberspace, in the Dark Dystopia

Posts: 19,118

Thanks Given: 2,351

Thanked 3,359 Times in 1,878 Posts

Quote:

Originally Posted by kurumi

There are also many online regex sites for creating and testing regex as well. But that said, regex is really not the best tool to parse HTML, unless the requirement is really really really simple Smilie

Hi kurumi,

I thought this discussion was about extracting URLs from HTML, not parsing HTML.

There is a difference, you know, between a generic HTML parser, and simply extracting a URL.

URLs can easily be extracted with regex.

Neo

View Public Profile for Neo

Visit Neo's homepage!

Find all posts by Neo

10-17-2010

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

The added difficulty here in this case is that besides the url's also the descriptions had to be extracted which in themselves can contain tags with angular bracket, which I used as record separators. This became to complicated with the approach I had chosen, where I wanted to allow the tags to be spread out over multiple lines.. My approach would work fine in many situations, though.

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

10-17-2010

Administrator

19,118, 3,359

Join Date: Sep 2000

Last Activity: 15 July 2022, 8:51 AM EDT

Location: Asia Pacific, Cyberspace, in the Dark Dystopia

Posts: 19,118

Thanks Given: 2,351

Thanked 3,359 Times in 1,878 Posts

Quote:

Originally Posted by Scrutinizer

The added difficulty here in this case is that besides the url's also the descriptions had to be extracted which in themselves can contain tags. This became to complicated with the approach I had chosen..

Yes, I understand....

I have seen efficient regex that can easily extract entire URLs, even with tags and more complex, generalized URLs. I don't have them in front of me, so I can't back up my claims at the moment.

Neo

View Public Profile for Neo

Visit Neo's homepage!

Find all posts by Neo

10-17-2010

Registered User

602, 83

Join Date: Dec 2009

Last Activity: 6 February 2016, 7:08 AM EST

Posts: 602

Thanks Given: 2

Thanked 83 Times in 78 Posts

Hi Neo

Quote:

Originally Posted by Neo

I thought this discussion was about extracting URLs from HTML, not parsing HTML.

I think extraction/parsing has no real big difference. We are still getting information out of something anyway.

Quote:

Originally Posted by Neo

There is a difference, you know, between a generic HTML parser, and simply extracting a URL.
URLs can easily be extracted with regex.

Yup, urls indeed can be easily extracted (or is it? well....

) . That is, if the requirement is only urls, nothing else. But not so for this particular question/thread since OP wanted to get the inner text as well. As demonstrated by Scrutinizer, its possible using gawk+regex, but there are still some corner cases left out. Anyway, i think OP (where ever he is) will find Scrutinizer's gawk code to be good enough for his purpose.

kurumi

View Public Profile for kurumi

Find all posts by kurumi

10-17-2010

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Different approach

Code:

sed 's|</a>|&\n|g' infile | sed -n '/<a /s|.*<a [^>]*href="\([^"]*\)[^>]*>\(.*\)</a>$|\1 \2|p'

Last edited by Scrutinizer; 10-17-2010 at 07:51 AM..

This User Gave Thanks to Scrutinizer For This Post:

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

Shell Programming and Scripting

How to extract url from html page?

10 More Discussions You Might Find Interesting

1. Post Here to Contact Site Administrators and Moderators

Page Not Found error while parsing url

Discussion started by: Akshay Hegde

2. Shell Programming and Scripting

Use curl to send a static xml file using url encoding to a web page using pos

Discussion started by: Paul Walker

3. Shell Programming and Scripting

URL/HTML encoding

Discussion started by: 3therk1ll

4. Shell Programming and Scripting

Extracting anchor text and its URL from HTML files in BASH

Discussion started by: shoaibjameel123

5. Red Hat

Publishing HTML Page

Discussion started by: deepakgang

6. UNIX for Dummies Questions & Answers

Publishing HTML Page

Discussion started by: deepakgang

7. Web Development

findstr in html page

Discussion started by: webmunkey23

8. Solaris

Accessing a HTML page

Discussion started by: pkm_oec

9. UNIX for Dummies Questions & Answers

How do I extract text only from html file without HTML tag

Discussion started by: los111

10. Shell Programming and Scripting

How to get the page size (of a url) using wget

Discussion started by: rajbal