Visit The New, Modern Unix Linux Community


Extracting urls from curl output


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Extracting urls from curl output
# 1  
Extracting urls from curl output

Hello.

I use curl to fetch a website, then, I want to extract the URLs from this curls output.

I tried both sed and grep, but couldnt figure it out.

Ive tried :
Code:
sed -n 's/href="\([^"]*\).*/\1/p' results.txt

and grep -o

Code:
grep -o '<a href="http://[a-z]*.[a-z]*.[a-z]*/[a-z]*">' results.txt

.

What pattern shall I use and whats wrong with mine ?


EDIT:

Added some of the data I use

EDIT 2:
Removed the data sample, because it ruines the thread width, but just curl whatever website, and use that output as data.

Last edited by jozo95; 01-16-2016 at 05:11 PM..
# 2  
Hello jozo95,

Could you please try following and let me know if this helps you.
Code:
awk '{match($0,/<a href=\"http.*><img/);A=substr($0,RSTART,RLENGTH-4);if(A){print A;A=""}}'  Input_file

Output will be as follows.
Code:
<a href="http://www.bth.se/web/utbildning.nsf/sidor/program?OpenDocument&expand=int">
<a href="https://www.antagning.se/se/triggerlogin?triggerloginurl=/se/mypages">
<a href="http://edu.bth.se/utbildning/utb_sok_resultat.asp?lang=sv&KtTermin=20161&PtStartTermin=20161&vy=hitta">
<a href="https://www.hogskoleprov.nu ">
<a href="http://www.bth.se/web/nyheter.nsf/sidor/8F5E44896F091A3AC1257E9F0045AC7D?OpenDocument">
<a href="http://edu.bth.se/utbildning/utb_sok_resultat.asp?lang=sv&KtTermin=20152&PtStartTermin=20152&vy=hitta&sortering=amne&sortering=installd&grupperingar=1">
<a href="http://www.bth.se/info/ophus.nsf/sidor/oppet-hus-pa-bth">
<a href="http://edu.bth.se/utbildning/utb_sok_resultat.asp?KtTermin=inne&PtStartTermin=inne&KtTyp=SOMM&lang=sv">

Thanks,
R. Singh
# 3  
Quote:
Originally Posted by RavinderSingh13
Hello jozo95,

Could you please try following and let me know if this helps you.
Code:
awk '{match($0,/<a href=\"http.*><img/);A=substr($0,RSTART,RLENGTH-4);if(A){print A;A=""}}'  Input_file

Output will be as follows.
Code:
<a href="http://www.bth.se/web/utbildning.nsf/sidor/program?OpenDocument&expand=int">
<a href="https://www.antagning.se/se/triggerlogin?triggerloginurl=/se/mypages">
<a href="http://edu.bth.se/utbildning/utb_sok_resultat.asp?lang=sv&KtTermin=20161&PtStartTermin=20161&vy=hitta">
<a href="https://www.hogskoleprov.nu ">
<a href="http://www.bth.se/web/nyheter.nsf/sidor/8F5E44896F091A3AC1257E9F0045AC7D?OpenDocument">
<a href="http://edu.bth.se/utbildning/utb_sok_resultat.asp?lang=sv&KtTermin=20152&PtStartTermin=20152&vy=hitta&sortering=amne&sortering=installd&grupperingar=1">
<a href="http://www.bth.se/info/ophus.nsf/sidor/oppet-hus-pa-bth">
<a href="http://edu.bth.se/utbildning/utb_sok_resultat.asp?KtTermin=inne&PtStartTermin=inne&KtTyp=SOMM&lang=sv">

Thanks,
R. Singh
Doesnt work :/

I get output as this along the way :

Code:
g">Studievägledning</a></li><li><a href="http://www.bth.se/jobb">Lediga tjänster</a></li></ul></div><div class="footer-section set"><h5><br></h5><ul class="footer-list"><li><a href="http://www.bth.se/web/ombth.nsf/sidor/organisation">Organisation</a></li><li><a href="http://www.bth.se/bib">Bibliotek</a></li><li><a href="http://careergate.bth.se/">BTH Career Gate</a></li><li><a href="http://www.bth.se/for/Sakerhet.nsf/sidor/593cb6bf948640dac1257f1f00365b42?OpenDocument">I händelse av kris</a></li><li><a href=""></a></li><li><a href=""></a></li></ul> </div></span></div></div></div><div class="footer-info"><a class="footer-logo" href="http://www.bth.se">

# 4  
Hello jozo95,

Sorry I haven't seen links without <img, so only it didn't match it properly.
Could you please try following and let me know if this helps you.
Code:
awk -F"[><]" '{for(i=1;i<=NF;i++){if($i ~ /a href=.*\//){print "<" $i ">"}}}'   Input_file

Output will be as follows.
Code:
<a href="http://www.bth.se/web/nyheter.nsf/AllaDok?OpenView">
<a href="http://www.bth.se/web/utbildning.nsf/sidor/program?OpenDocument&expand=int">
<a href="https://www.antagning.se/se/triggerlogin?triggerloginurl=/se/mypages">
<a href="http://edu.bth.se/utbildning/utb_sok_resultat.asp?lang=sv&KtTermin=20161&PtStartTermin=20161&vy=hitta">
<a href="https://www.hogskoleprov.nu ">
<a href="http://www.bth.se/web/nyheter.nsf/sidor/8F5E44896F091A3AC1257E9F0045AC7D?OpenDocument">
<a href="http://edu.bth.se/utbildning/utb_sok_resultat.asp?lang=sv&KtTermin=20152&PtStartTermin=20152&vy=hitta&sortering=amne&sortering=installd&grupperingar=1">
<a href="http://www.bth.se/info/ophus.nsf/sidor/oppet-hus-pa-bth">
<a href="http://edu.bth.se/utbildning/utb_sok_resultat.asp?KtTermin=inne&PtStartTermin=inne&KtTyp=SOMM&lang=sv">
<a href="/web/kalendarium.nsf/sidor/52CB572F173DEE64C1257F3400428859?OpenDocument">
<a href="/web/kalendarium.nsf/sidor/2743A2376777BC7BC1257F3400530744?OpenDocument">
<a href="/web/kalendarium.nsf/sidor/0F9CAD034B2DD920C1257F3400533F5A?OpenDocument">
<a href="http://www.bth.se/web/kalendarium.nsf">
<a href="http://www.bth.se/web/kalendarium.nsf">
<a href="http://www.bth.se/web/kalendarium.nsf/AllaDok?OpenView">
<a href="/web/pressmeddelande.nsf/sidor/8422F16DAC76024FC1257F390042E05C?OpenDocument">
<a href="/web/pressmeddelande.nsf/sidor/8410EC0AA8C20BD5C1257F39004301F0?OpenDocument">
<a href="/web/pressmeddelande.nsf/sidor/EA2119AB45CE9648C1257F1E002D44E0?OpenDocument">
<a href="/web/pressmeddelande.nsf/sidor/5992F8120E2655F0C1257F22002CCD89?OpenDocument">
<a href="http://www.bth.se/web/pressmeddelande.nsf/AllaDok?OpenView">
<a href="/web/utmarkelser.nsf/sidor/4CC79392B8F8D211C1257D88003709CB?OpenDocument">
<a href="/web/utmarkelser.nsf/sidor/C5C8D8F87E6EC6DCC1257D39004CE1D0?OpenDocument">
<a href="/web/utmarkelser.nsf/sidor/6121811FF55C891AC1257D8800366D5C?OpenDocument">
<a href="/web/utmarkelser.nsf/sidor/936596A2A8C92FBEC1257D6300322897?OpenDocument">

Thanks,
R. Singh
This User Gave Thanks to RavinderSingh13 For This Post:
# 5  
Any href:
Code:
perl -nle 'while(/(href="[^"]*")/g){print $1}' curl_href

Code:
[...]
href="#"
href="#"
href="#"
href="#"
href="#"
href="/web/pressmeddelande.nsf/sidor/8422F16DAC76024FC1257F390042E05C?OpenDocument"
href="/web/pressmeddelande.nsf/sidor/8410EC0AA8C20BD5C1257F39004301F0?OpenDocument"
href="/web/pressmeddelande.nsf/sidor/EA2119AB45CE9648C1257F1E002D44E0?OpenDocument"
href="/web/pressmeddelande.nsf/sidor/5992F8120E2655F0C1257F22002CCD89?OpenDocument"
href="http://www.bth.se/web/pressmeddelande.nsf/AllaDok?OpenView"
href="/web/utmarkelser.nsf/sidor/4CC79392B8F8D211C1257D88003709CB?OpenDocument"
href="/web/utmarkelser.nsf/sidor/C5C8D8F87E6EC6DCC1257D39004CE1D0?OpenDocument"
href="/web/utmarkelser.nsf/sidor/6121811FF55C891AC1257D8800366D5C?OpenDocument"
href="/web/utmarkelser.nsf/sidor/936596A2A8C92FBEC1257D6300322897?OpenDocument"
href="http://www.bth.se/web/utmarkelser.nsf/AllaDok?OpenView"
href="http://www.bth.se/for/address-book.nsf/addressbook.xsp?lang=sv"
href="http://www.bth.se/web/ombth.nsf/sidor/hitta-till-bth"
[...]

Hrefs starting with / or http:
Code:
perl -nle 'while(/href=("(?:http|\/)[^"]*")/g){print $1}' curl_href

Code:
[...]
"http://greencharge.se/?p=5691"
"/web/pressmeddelande.nsf/sidor/8422F16DAC76024FC1257F390042E05C?OpenDocument"
"/web/pressmeddelande.nsf/sidor/8410EC0AA8C20BD5C1257F39004301F0?OpenDocument"
"/web/pressmeddelande.nsf/sidor/EA2119AB45CE9648C1257F1E002D44E0?OpenDocument"
"/web/pressmeddelande.nsf/sidor/5992F8120E2655F0C1257F22002CCD89?OpenDocument"
"http://www.bth.se/web/pressmeddelande.nsf/AllaDok?OpenView"
"/web/utmarkelser.nsf/sidor/4CC79392B8F8D211C1257D88003709CB?OpenDocument"
"/web/utmarkelser.nsf/sidor/C5C8D8F87E6EC6DCC1257D39004CE1D0?OpenDocument"
"/web/utmarkelser.nsf/sidor/6121811FF55C891AC1257D8800366D5C?OpenDocument"
"/web/utmarkelser.nsf/sidor/936596A2A8C92FBEC1257D6300322897?OpenDocument"
"http://www.bth.se/web/utmarkelser.nsf/AllaDok?OpenView"
"http://www.bth.se/for/address-book.nsf/addressbook.xsp?lang=sv"
"http://www.bth.se/web/ombth.nsf/sidor/hitta-till-bth"
[...]

Only domain names:
Code:
perl -nle 'while(m|href="(http://[^/"]*)|g){print $1}' curl_href

Code:
[...]
http://www.bth.se
http://www.bth.se
http://www.bth.se
http://www.bth.se
http://edu.bth.se
http://www.bth.se
http://edu.bth.se
http://www.bth.se
http://edu.bth.se
http://www.youtube.com
http://www.bth.se
http://www.bth.se
http://www.bth.se
http://www.bth.se
http://www.bth.se
http://singingsingapore.wordpress.com
http://singingsingapore.wordpress.com
[...]

Unique domain names:
Code:
perl -nle 'while(m|href="(http://[^/"]*)|g){$sites{$1}++}END{for(keys %sites){print $_}}' curl_href

Code:
http://twitter.com
http://www.bth.se
http://greencharge.se
http://www.flickr.com
http://singingsingapore.wordpress.com
http://edu.bth.se
http://careergate.bth.se
http://se.linkedin.com
http://www.youtube.com
http://www.facebook.com


Last edited by Aia; 01-16-2016 at 03:21 PM..
This User Gave Thanks to Aia For This Post:
# 6  
Quote:
Originally Posted by RavinderSingh13
Hello jozo95,

Sorry I haven't seen links without <img, so only it didn't match it properly.
Could you please try following and let me know if this helps you.
Code:
awk -F"[><]" '{for(i=1;i<=NF;i++){if($i ~ /a href=.*\//){print "<" $i ">"}}}'   Input_file


Thanks,
R. Singh
'

That works good.

I solved it using this code:

Code:
grep -o '<a href="[a-z]\+[^>"]*' | sed -ne 's/^<a href="\(.*\)/\1/p'

---------- Post updated at 04:14 PM ---------- Previous update was at 04:12 PM ----------

Quote:
Originally Posted by Aia
Any href:
Unfortunately I dont know perl, yet, but thanks for your input anyways, much appreciated Smilie
# 7  
If you can use lynx then
Code:
lynx -dump URL

produces a good text output of a page. Every link on the page goes to References section in the end of the output.

Previous Thread | Next Thread
Thread Tools Search this Thread
Search this Thread:
Advanced Search

Test Your Knowledge in Computers #405
Difficulty: Medium
Cygwin consists of three parts: a dynamic-link library (DLL) as an API compatibility layer in the form of a C standard library providing a substantial part of the POSIX API functionality, an extensive collection of software tools and applications that provide a Unix-like look and feel, and an MOTIF-based window manager.
True or False?

10 More Discussions You Might Find Interesting

1. Web Development

Filename output in curl

How can I get the name of the default output filename from curl using the argument -O? Using -o one can choose a filename. I want to get the name of the original file, but don't understand how to get it. curl -o filename http://www.website.com curl -O http://www.website.com The... (3 Replies)
Discussion started by: locoroco
3 Replies

2. Shell Programming and Scripting

Filter output in curl

Hello guys, I'm writing a little script which sends me sms with my shell script via api of a sms provider. problem is I can't filter my curl output for this site: site url:... (1 Reply)
Discussion started by: genius90
1 Replies

3. Shell Programming and Scripting

Encapsulating output of CURL and/or WGET

i use curl and wget quite often. i set up alarms on their output. for instance, i would run a "wget" on a url and then search for certain strings within the output given by the "wget". the problem is, i cant get the entire output or response of my wget/curl command to show up correctly in... (3 Replies)
Discussion started by: SkySmart
3 Replies

4. Shell Programming and Scripting

ery weird wget/curl output - what should I do?

Hi, I'm trying to write a script to download RedHat's errata digest. It comes in a txt.gz format, and i can get it easily with firefox. HOWEVER: output is VERY strange when donwloading it in a script. It seems I'm getting a file of the same size - but partially text and partly binary! It... (5 Replies)
Discussion started by: jstilby
5 Replies

5. Shell Programming and Scripting

web service call: curl output to xsltproc input

I need to invoke a web service and extract what I need from the response using a combination of curl and xsltproc. However, any file-based parameters that must be supplied to both these programs must be from stdin and not actual files. At least with curl, it seems to think that I am supplying a... (3 Replies)
Discussion started by: webuser
3 Replies

6. Shell Programming and Scripting

Getting cURL to output verbose to a file

This is about to drive me crazy. What I want to do is simple: output ALL the verbose information from curl to a file I have read the manual, tried several options and searched this forum but no salvation... I'm using curl -k -Q "command" --user user:passwd --ftp-pasv --ftp-ssl -v... (1 Reply)
Discussion started by: caramandi
1 Replies

7. Shell Programming and Scripting

script to output curl result as html

hi, new to scripting and would like to know how can I have a script which will curl a few URLs and have the results such as the URLs being curled, dns lookup time, connection time, total time, etc save in a html format in a form of table with column and rows. thank you. (4 Replies)
Discussion started by: squidusr
4 Replies

8. Shell Programming and Scripting

Pattern matching extracting urls from rss, shell scripts

Hi all, how could i do ? I have a Rss file, i want to extract only the Urls (many) matching http://www.xxx.com/trailers/ from that file and copy into another file. like " <pubDate>Wed, 29 Apr 2009 00:00:00 PST</pubDate> <content:encoded><!Apple - Movie Trailers - The Hangover"><img... (3 Replies)
Discussion started by: BremboloIV
3 Replies

9. Shell Programming and Scripting

let curl output to stdout AND save to a file

hello hackers. i have a curl process running as cgi directly pushing stdout to the client. but i want to additionally save that stream to a file at the same time. any directions madly welcome. thanks in advance (3 Replies)
Discussion started by: scarfake
3 Replies

10. Shell Programming and Scripting

Extracting fields from an output 8-)

I am getting a variable as x=2006/01/18 now I have to extract each field from it. Like x1=2006, x2=01 and x3=18. Any idea how? Thanks a lot for help. Thanks CSaha (6 Replies)
Discussion started by: csaha
6 Replies

Featured Tech Videos