Extracting urls from curl output

01-16-2016

Registered User

15, 0

Join Date: Jan 2016

Last Activity: 20 January 2016, 4:28 AM EST

Posts: 15

Thanks Given: 5

Thanked 0 Times in 0 Posts

Extracting urls from curl output

Hello.

I use curl to fetch a website, then, I want to extract the URLs from this curls output.

I tried both sed and grep, but couldnt figure it out.

Ive tried :

Code:

sed -n 's/href="\([^"]*\).*/\1/p' results.txt

and grep -o

Code:

grep -o '<a href="http://[a-z]*.[a-z]*.[a-z]*/[a-z]*">' results.txt

.

What pattern shall I use and whats wrong with mine ?

EDIT:

Added some of the data I use

EDIT 2:
Removed the data sample, because it ruines the thread width, but just curl whatever website, and use that output as data.

Last edited by jozo95; 01-16-2016 at 05:11 PM..

jozo95

View Public Profile for jozo95

Find all posts by jozo95

01-16-2016

Moderator

3,105, 1,603

Join Date: May 2013

Last Activity: 31 August 2020, 1:46 AM EDT

Location: Chennai

Posts: 3,105

Thanks Given: 1,269

Thanked 1,603 Times in 1,369 Posts

Hello jozo95,

Could you please try following and let me know if this helps you.

Code:

awk '{match($0,/<a href=\"http.*><img/);A=substr($0,RSTART,RLENGTH-4);if(A){print A;A=""}}'  Input_file

Output will be as follows.

Code:

<a href="http://www.bth.se/web/utbildning.nsf/sidor/program?OpenDocument&expand=int">
<a href="https://www.antagning.se/se/triggerlogin?triggerloginurl=/se/mypages">
<a href="http://edu.bth.se/utbildning/utb_sok_resultat.asp?lang=sv&KtTermin=20161&PtStartTermin=20161&vy=hitta">
<a href="https://www.hogskoleprov.nu ">
<a href="http://www.bth.se/web/nyheter.nsf/sidor/8F5E44896F091A3AC1257E9F0045AC7D?OpenDocument">
<a href="http://edu.bth.se/utbildning/utb_sok_resultat.asp?lang=sv&KtTermin=20152&PtStartTermin=20152&vy=hitta&sortering=amne&sortering=installd&grupperingar=1">
<a href="http://www.bth.se/info/ophus.nsf/sidor/oppet-hus-pa-bth">
<a href="http://edu.bth.se/utbildning/utb_sok_resultat.asp?KtTermin=inne&PtStartTermin=inne&KtTyp=SOMM&lang=sv">

Thanks,
R. Singh

RavinderSingh13

View Public Profile for RavinderSingh13

Find all posts by RavinderSingh13

01-16-2016

Registered User

15, 0

Join Date: Jan 2016

Last Activity: 20 January 2016, 4:28 AM EST

Posts: 15

Thanks Given: 5

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by RavinderSingh13

Hello jozo95,

Could you please try following and let me know if this helps you.

Code:

awk '{match($0,/<a href=\"http.*><img/);A=substr($0,RSTART,RLENGTH-4);if(A){print A;A=""}}'  Input_file

Output will be as follows.

Code:

<a href="http://www.bth.se/web/utbildning.nsf/sidor/program?OpenDocument&expand=int">
<a href="https://www.antagning.se/se/triggerlogin?triggerloginurl=/se/mypages">
<a href="http://edu.bth.se/utbildning/utb_sok_resultat.asp?lang=sv&KtTermin=20161&PtStartTermin=20161&vy=hitta">
<a href="https://www.hogskoleprov.nu ">
<a href="http://www.bth.se/web/nyheter.nsf/sidor/8F5E44896F091A3AC1257E9F0045AC7D?OpenDocument">
<a href="http://edu.bth.se/utbildning/utb_sok_resultat.asp?lang=sv&KtTermin=20152&PtStartTermin=20152&vy=hitta&sortering=amne&sortering=installd&grupperingar=1">
<a href="http://www.bth.se/info/ophus.nsf/sidor/oppet-hus-pa-bth">
<a href="http://edu.bth.se/utbildning/utb_sok_resultat.asp?KtTermin=inne&PtStartTermin=inne&KtTyp=SOMM&lang=sv">

Thanks,
R. Singh

Doesnt work :/

I get output as this along the way :

Code:

g">Studiev�gledning</a></li><li><a href="http://www.bth.se/jobb">Lediga tj�nster</a></li></ul></div><div class="footer-section set"><h5><br></h5><ul class="footer-list"><li><a href="http://www.bth.se/web/ombth.nsf/sidor/organisation">Organisation</a></li><li><a href="http://www.bth.se/bib">Bibliotek</a></li><li><a href="http://careergate.bth.se/">BTH Career Gate</a></li><li><a href="http://www.bth.se/for/Sakerhet.nsf/sidor/593cb6bf948640dac1257f1f00365b42?OpenDocument">I h�ndelse av kris</a></li><li><a href=""></a></li><li><a href=""></a></li></ul> </div></span></div></div></div><div class="footer-info"><a class="footer-logo" href="http://www.bth.se">

jozo95

View Public Profile for jozo95

Find all posts by jozo95

01-16-2016

Moderator

3,105, 1,603

Join Date: May 2013

Last Activity: 31 August 2020, 1:46 AM EDT

Location: Chennai

Posts: 3,105

Thanks Given: 1,269

Thanked 1,603 Times in 1,369 Posts

Hello jozo95,

Sorry I haven't seen links without <img, so only it didn't match it properly.
Could you please try following and let me know if this helps you.

Code:

awk -F"[><]" '{for(i=1;i<=NF;i++){if($i ~ /a href=.*\//){print "<" $i ">"}}}'   Input_file

Output will be as follows.

Code:

<a href="http://www.bth.se/web/nyheter.nsf/AllaDok?OpenView">
<a href="http://www.bth.se/web/utbildning.nsf/sidor/program?OpenDocument&expand=int">
<a href="https://www.antagning.se/se/triggerlogin?triggerloginurl=/se/mypages">
<a href="http://edu.bth.se/utbildning/utb_sok_resultat.asp?lang=sv&KtTermin=20161&PtStartTermin=20161&vy=hitta">
<a href="https://www.hogskoleprov.nu ">
<a href="http://www.bth.se/web/nyheter.nsf/sidor/8F5E44896F091A3AC1257E9F0045AC7D?OpenDocument">
<a href="http://edu.bth.se/utbildning/utb_sok_resultat.asp?lang=sv&KtTermin=20152&PtStartTermin=20152&vy=hitta&sortering=amne&sortering=installd&grupperingar=1">
<a href="http://www.bth.se/info/ophus.nsf/sidor/oppet-hus-pa-bth">
<a href="http://edu.bth.se/utbildning/utb_sok_resultat.asp?KtTermin=inne&PtStartTermin=inne&KtTyp=SOMM&lang=sv">
<a href="/web/kalendarium.nsf/sidor/52CB572F173DEE64C1257F3400428859?OpenDocument">
<a href="/web/kalendarium.nsf/sidor/2743A2376777BC7BC1257F3400530744?OpenDocument">
<a href="/web/kalendarium.nsf/sidor/0F9CAD034B2DD920C1257F3400533F5A?OpenDocument">
<a href="http://www.bth.se/web/kalendarium.nsf">
<a href="http://www.bth.se/web/kalendarium.nsf">
<a href="http://www.bth.se/web/kalendarium.nsf/AllaDok?OpenView">
<a href="/web/pressmeddelande.nsf/sidor/8422F16DAC76024FC1257F390042E05C?OpenDocument">
<a href="/web/pressmeddelande.nsf/sidor/8410EC0AA8C20BD5C1257F39004301F0?OpenDocument">
<a href="/web/pressmeddelande.nsf/sidor/EA2119AB45CE9648C1257F1E002D44E0?OpenDocument">
<a href="/web/pressmeddelande.nsf/sidor/5992F8120E2655F0C1257F22002CCD89?OpenDocument">
<a href="http://www.bth.se/web/pressmeddelande.nsf/AllaDok?OpenView">
<a href="/web/utmarkelser.nsf/sidor/4CC79392B8F8D211C1257D88003709CB?OpenDocument">
<a href="/web/utmarkelser.nsf/sidor/C5C8D8F87E6EC6DCC1257D39004CE1D0?OpenDocument">
<a href="/web/utmarkelser.nsf/sidor/6121811FF55C891AC1257D8800366D5C?OpenDocument">
<a href="/web/utmarkelser.nsf/sidor/936596A2A8C92FBEC1257D6300322897?OpenDocument">

Thanks,
R. Singh

This User Gave Thanks to RavinderSingh13 For This Post:

RavinderSingh13

View Public Profile for RavinderSingh13

Find all posts by RavinderSingh13

01-16-2016

Registered User

1,781, 705

Join Date: May 2008

Last Activity: 10 November 2021, 5:38 PM EST

Posts: 1,781

Thanks Given: 62

Thanked 705 Times in 653 Posts

Any href:

Code:

perl -nle 'while(/(href="[^"]*")/g){print $1}' curl_href

Code:

[...]
href="#"
href="#"
href="#"
href="#"
href="#"
href="/web/pressmeddelande.nsf/sidor/8422F16DAC76024FC1257F390042E05C?OpenDocument"
href="/web/pressmeddelande.nsf/sidor/8410EC0AA8C20BD5C1257F39004301F0?OpenDocument"
href="/web/pressmeddelande.nsf/sidor/EA2119AB45CE9648C1257F1E002D44E0?OpenDocument"
href="/web/pressmeddelande.nsf/sidor/5992F8120E2655F0C1257F22002CCD89?OpenDocument"
href="http://www.bth.se/web/pressmeddelande.nsf/AllaDok?OpenView"
href="/web/utmarkelser.nsf/sidor/4CC79392B8F8D211C1257D88003709CB?OpenDocument"
href="/web/utmarkelser.nsf/sidor/C5C8D8F87E6EC6DCC1257D39004CE1D0?OpenDocument"
href="/web/utmarkelser.nsf/sidor/6121811FF55C891AC1257D8800366D5C?OpenDocument"
href="/web/utmarkelser.nsf/sidor/936596A2A8C92FBEC1257D6300322897?OpenDocument"
href="http://www.bth.se/web/utmarkelser.nsf/AllaDok?OpenView"
href="http://www.bth.se/for/address-book.nsf/addressbook.xsp?lang=sv"
href="http://www.bth.se/web/ombth.nsf/sidor/hitta-till-bth"
[...]

Hrefs starting with / or http:

Code:

perl -nle 'while(/href=("(?:http|\/)[^"]*")/g){print $1}' curl_href

Code:

[...]
"http://greencharge.se/?p=5691"
"/web/pressmeddelande.nsf/sidor/8422F16DAC76024FC1257F390042E05C?OpenDocument"
"/web/pressmeddelande.nsf/sidor/8410EC0AA8C20BD5C1257F39004301F0?OpenDocument"
"/web/pressmeddelande.nsf/sidor/EA2119AB45CE9648C1257F1E002D44E0?OpenDocument"
"/web/pressmeddelande.nsf/sidor/5992F8120E2655F0C1257F22002CCD89?OpenDocument"
"http://www.bth.se/web/pressmeddelande.nsf/AllaDok?OpenView"
"/web/utmarkelser.nsf/sidor/4CC79392B8F8D211C1257D88003709CB?OpenDocument"
"/web/utmarkelser.nsf/sidor/C5C8D8F87E6EC6DCC1257D39004CE1D0?OpenDocument"
"/web/utmarkelser.nsf/sidor/6121811FF55C891AC1257D8800366D5C?OpenDocument"
"/web/utmarkelser.nsf/sidor/936596A2A8C92FBEC1257D6300322897?OpenDocument"
"http://www.bth.se/web/utmarkelser.nsf/AllaDok?OpenView"
"http://www.bth.se/for/address-book.nsf/addressbook.xsp?lang=sv"
"http://www.bth.se/web/ombth.nsf/sidor/hitta-till-bth"
[...]

Only domain names:

Code:

perl -nle 'while(m|href="(http://[^/"]*)|g){print $1}' curl_href

Code:

[...]
http://www.bth.se
http://www.bth.se
http://www.bth.se
http://www.bth.se
http://edu.bth.se
http://www.bth.se
http://edu.bth.se
http://www.bth.se
http://edu.bth.se
http://www.youtube.com
http://www.bth.se
http://www.bth.se
http://www.bth.se
http://www.bth.se
http://www.bth.se
http://singingsingapore.wordpress.com
http://singingsingapore.wordpress.com
[...]

Unique domain names:

Code:

perl -nle 'while(m|href="(http://[^/"]*)|g){$sites{$1}++}END{for(keys %sites){print $_}}' curl_href

Code:

http://twitter.com
http://www.bth.se
http://greencharge.se
http://www.flickr.com
http://singingsingapore.wordpress.com
http://edu.bth.se
http://careergate.bth.se
http://se.linkedin.com
http://www.youtube.com
http://www.facebook.com

Last edited by Aia; 01-16-2016 at 03:21 PM..

This User Gave Thanks to Aia For This Post:

Aia

View Public Profile for Aia

Find all posts by Aia

01-16-2016

Registered User

15, 0

Join Date: Jan 2016

Last Activity: 20 January 2016, 4:28 AM EST

Posts: 15

Thanks Given: 5

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by RavinderSingh13

Hello jozo95,

Sorry I haven't seen links without <img, so only it didn't match it properly.
Could you please try following and let me know if this helps you.

Code:

awk -F"[><]" '{for(i=1;i<=NF;i++){if($i ~ /a href=.*\//){print "<" $i ">"}}}'   Input_file

Thanks,
R. Singh

'

That works good.

I solved it using this code:

Code:

grep -o '<a href="[a-z]\+[^>"]*' | sed -ne 's/^<a href="\(.*\)/\1/p'

---------- Post updated at 04:14 PM ---------- Previous update was at 04:12 PM ----------

Quote:

Originally Posted by Aia

Any href:

Unfortunately I dont know perl, yet, but thanks for your input anyways, much appreciated

jozo95

View Public Profile for jozo95

Find all posts by jozo95

01-16-2016

Registered User

1,000, 237

Join Date: Jun 2011

Last Activity: 2 August 2017, 9:27 AM EDT

Location: From far

Posts: 1,000

Thanks Given: 21

Thanked 237 Times in 231 Posts

If you can use lynx then

Code:

lynx -dump URL

produces a good text output of a page. Every link on the page goes to References section in the end of the output.

yazu

View Public Profile for yazu

Find all posts by yazu

Shell Programming and Scripting

Extracting urls from curl output

10 More Discussions You Might Find Interesting

1. Web Development

Filename output in curl

Discussion started by: locoroco

2. Shell Programming and Scripting

Filter output in curl

Discussion started by: genius90

3. Shell Programming and Scripting

Encapsulating output of CURL and/or WGET

Discussion started by: SkySmart

4. Shell Programming and Scripting

ery weird wget/curl output - what should I do?

Discussion started by: jstilby

5. Shell Programming and Scripting

web service call: curl output to xsltproc input

Discussion started by: webuser

6. Shell Programming and Scripting

Getting cURL to output verbose to a file

Discussion started by: caramandi

7. Shell Programming and Scripting

script to output curl result as html

Discussion started by: squidusr

8. Shell Programming and Scripting

Pattern matching extracting urls from rss, shell scripts

Discussion started by: BremboloIV

9. Shell Programming and Scripting

let curl output to stdout AND save to a file

Discussion started by: scarfake

10. Shell Programming and Scripting

Extracting fields from an output 8-)

Discussion started by: csaha