Extract URLs from HTML code using sed

11-29-2009

Registered User

2,288, 480

Join Date: Apr 2007

Last Activity: 3 May 2020, 8:28 AM EDT

Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris

Posts: 2,288

Thanks Given: 430

Thanked 480 Times in 395 Posts

Hi.

You can use lynx to do most of the hard work:

Code:

#!/usr/bin/env bash

# @(#) s3	Demonstrate extraction of links with "lynx".

echo
set +o nounset
LC_ALL=C ; LANG=C ; export LC_ALL LANG
echo "Environment: LC_ALL = $LC_ALL, LANG = $LANG"
echo "(Versions displayed with local utility \"version\")"
version >/dev/null 2>&1 && version "=o" $(_eat $0 $1) lynx 
set -o nounset
echo

URL="http://www.google.com/search?q=chondroitin&num=100&start=200"

echo " Looking at URL: $URL"

echo
echo " Results:"
lynx -dump -listonly -hiddenlinks=ignore "$URL" |
grep -i -v google |
sed -e 's/^[ 	]*[0-9.]*[ 	]*//' > t1

echo " Extracted about $( wc -l <t1 ) non-google links; first 10:"
head -10 t1

exit 0

producing:

Code:

% ./s3

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0 
GNU bash 3.2.39
Lynx Version 2.8.7dev.9 (27 Apr 2008)

 Looking at URL: http://www.google.com/search?q=chondroitin&num=100&start=200

 Results:
 Extracted about 157 non-google links; first 10:

References

http://www.wisegeek.com/what-is-chondroitin.htm
http://74.125.95.132/search?q=cache:ZwdfFnJUbecJ:www.wisegeek.com/what-is-chondroitin.htm+chondroitin&cd=201&hl=en&ct=clnk&gl=us&ie=UTF-8
http://shaokang2002.en.busytrade.com/products/info/24144/_Chondroitin_sulphate.html
http://74.125.95.132/search?q=cache:B32adgCixNAJ:shaokang2002.en.busytrade.com/products/info/24144/_Chondroitin_sulphate.html+chondroitin&cd=202&hl=en&ct=clnk&gl=us&ie=UTF-8
http://www.ebmonline.org/cgi/content/abstract/226/2/144
http://www.ebmonline.org/cgi/content/abstract/230/4/255
http://linkinghub.elsevier.com/retrieve/pii/S0255270106002182

Some clean-up is done by getting rid of google references, and removing the initial whitespace and integer on each line ... cheers, drl

drl

View Public Profile for drl

Find all posts by drl

11-29-2009

Registered User

2,163, 123

Join Date: Nov 2007

Last Activity: 31 July 2016, 9:42 AM EDT

Location: H3X

Posts: 2,163

Thanks Given: 11

Thanked 123 Times in 116 Posts

Try this awk solution

Code:

wget -q -U "Mozilla/5.001" -O - "http://www.google.com/search?q=searchphrase&num=100&start=200" | awk -F\" '{for(i=0;++i<=NF;){if($i ~ /^http/ && $i !~ "google\|74.125.95"){print $i}}}'

Last edited by danmero; 11-29-2009 at 12:24 PM.. Reason: Filter out google links

danmero

View Public Profile for danmero

Find all posts by danmero

11-29-2009

Registered User

2,669, 20

Join Date: Sep 2006

Last Activity: 28 January 2015, 8:30 AM EST

Posts: 2,669

Thanks Given: 0

Thanked 20 Times in 20 Posts

gawk

Code:

wget -q -U "Mozilla/5.001" -O - "http://www.google.com/search?q=searchphrase&num=100&start=200" | awk -vRS="</a>" '
{
  gsub(/.*<a +href=\042/,"")
  gsub(/\042.*/,"")
  print 
}'

ghostdog74

View Public Profile for ghostdog74

Find all posts by ghostdog74

11-29-2009

Registered User

5,690, 630

Join Date: Jan 2007

Last Activity: 9 January 2017, 4:40 AM EST

Location: Варна, България / Milano, Italia

Posts: 5,690

Thanks Given: 184

Thanked 630 Times in 587 Posts

Or:

Code:

wget -q -U "Mozilla/5.001" -O - 'http://www.google.com/search?q=searchphrase&num=100&start=200' | 
  perl -lne'
    print$1while/<\s*a\s+href\s*=\s*["\047]([^"\047]+)/ig
    '

radoulov

View Public Profile for radoulov

Find all posts by radoulov

11-30-2009

Registered User

3, 0

Join Date: Nov 2009

Last Activity: 30 November 2009, 3:00 PM EST

Posts: 3

Thanks Given: 0

Thanked 0 Times in 0 Posts

Hello, and thx at all for so much replies !

@ ghostdog74 and radoulov
i get all links(all what apears after href=" in code) by using yours solutions, but i need just search results.

@ danmero and drl
your scripts produce more usefull results, but google and google-cache urls must be keept out.

KenJackson's brings just 5 lines some code, sory, where are links ?

and Scrutinizer's solution ist the best presented here. Its simply and it works.

Code:

wget -q -U "Mozilla/5.001" -O - "http://www.google.com/search?q=searchphrase&num=100&start=200" | \
grep -o '<a href="http[^"]*"'|grep -v 'search?q=cache:'|grep -v '\.google\.'|sed 's/<a href="//;s/"$//'

Some filtering is needed but its ok.

anyone to make his solution more simplier ?

L0rd

View Public Profile for L0rd

Find all posts by L0rd

11-30-2009

Registered User

2,163, 123

Join Date: Nov 2007

Last Activity: 31 July 2016, 9:42 AM EDT

Location: H3X

Posts: 2,163

Thanks Given: 11

Thanked 123 Times in 116 Posts

Quote:

Originally Posted by L0rd

@ danmero and drl
your scripts produce more usefull results, but google and google-cache urls must be keept out.

Code:

# wget -q -U "Mozilla/5.001" -O - "http://www.google.com/search?q=searchphrase&num=100&start=200"|awk -F\" '{for(i=0;++i<=NF;){if($i ~ /^http/ && $i !~ "google\|cache:"){print $i}}}'

And the output is(first and last 10 records):

Code:

http://www.youtube.com/results?q=searchphrase&num=100&um=1&ie=UTF-8&sa=N&hl=en&tab=w1
http://www.christopherwardforum.com/viewtopic.php?f=3&t=372
https://www.unix.com/shell-programming-scripting/113921-getting-15-characters-after-search-phrase.html
http://www.zeromillion.com/srs-search-phrase
http://www.fwicki.com/fwickis/science/Humanities
http://www.articlealley.com/tags-9091.html
http://www.workz.com/content/view_content.html?section_id=466&content_id=6151
http://www.onlinemarketingelite.com/tag/search-phrase/
http://www.discoveres.com/search-phrase_SRS/
http://www.articlesfactory.com/search/Search%20Phrase/
......
http://www.learninghownow.com/blog/tag/search-phrase/
http://www.quakerranter.org/tag/search%20phrase
http://forums.digital-m.co.za/showthread.php?t=73
http://drupal.org/node/527084
http://www.smartertools.com/forums/ThreadNavigation.aspx?PostID=57990&NavType=Previous
http://www.tversoft.com/computer/search-phrase.html
http://www.hotfroguk.co.uk/Companies/Search-Phrase-Builder
http://search.infospace.com/ispace/ws/redir/qcat=News/qcoll=relevance/qkw=Enter%20A%20Search%20Phrase/rfcp=RightNav/rfcid=302364/_iceUrlFlag=11?_IceUrl=true
http://www.stumbleupon.com/stumbler/kaylavincent/tag/search-phrase/
http://en.drigger.com/e/1012902/obscure_search_phrase

danmero

View Public Profile for danmero

Find all posts by danmero

12-01-2009

Registered User

307, 29

Join Date: May 2008

Last Activity: 7 September 2011, 6:25 AM EDT

Location: Maryland, USA

Posts: 307

Thanks Given: 2

Thanked 29 Times in 21 Posts

Quote:

Originally Posted by L0rd

KenJackson's brings just 5 lines some code, sory, where are links ? Smilie

I didn't give a complete solution, that's why I called it a starter and referenced the looping.

I am awed by the power of sed. I routinely use it's regular expression capability, but I rarely use hold buffer and looping command. It has been my goal for some time to become skilful at using these. Your question would have been the perfect opportunity for me to dig in and come up with a solution that demonstrates that power. But I flat did not have the time then, and it looks like you have a solution now that you find satisfying. I'll work on it off-line.

Quote:

Originally Posted by L0rd

and Scrutinizer's solution ist the best presented here. Its simply and it works.

Yeah, I've noticed Scrutinizer writes some good and straight-forward code. Stick with him.

KenJackson

View Public Profile for KenJackson

Find all posts by KenJackson

Shell Programming and Scripting

Extract URLs from HTML code using sed

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Awk/sed HTML extract

Discussion started by: p1ne

2. Shell Programming and Scripting

Trying to extract domain and tld from list of urls.

Discussion started by: chamb1

3. Shell Programming and Scripting

How to remove urls from html files

Discussion started by: georgi58

4. Shell Programming and Scripting

help with sed needed to extract content from html tags

Discussion started by: seb001

5. Shell Programming and Scripting

Remove external urls from .html file

Discussion started by: CowCow339

6. Shell Programming and Scripting

Extract urls from index.html downloaded using wget

Discussion started by: mnanavati

7. Shell Programming and Scripting

SED to extract HTML text data, not quite right!

Discussion started by: lagagnon

8. Shell Programming and Scripting

sed to extract only floating point numbers from HTML

Discussion started by: pondlife

9. UNIX for Advanced & Expert Users

sed to extract HTML content

Discussion started by: stargazerr

10. UNIX for Dummies Questions & Answers

How do I extract text only from html file without HTML tag

Discussion started by: los111