The UNIX and Linux Forums  
Hello and Welcome from United States to the UNIX and Linux Forums! Thank You for Visiting and Joining Our Global Community.

Go Back   The UNIX and Linux Forums > Top Forums > Shell Programming and Scripting
.
google unix.com



Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here.

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
shell script for extracting out the shortest substring from the given starting and en pankajd Shell Programming and Scripting 18 03-10-2008 06:20 AM
Extracting a substring starting from last occurance of a string/character krramkumar Shell Programming and Scripting 2 12-19-2007 03:16 AM
Extracting a string from one file and searching the same string in other files mohancrr Shell Programming and Scripting 1 09-19-2007 03:17 AM
problem extracting substring in korn shell nashrul UNIX for Dummies Questions & Answers 3 08-15-2007 02:45 AM
AWK - Extracting matched line not4google Shell Programming and Scripting 9 11-02-2006 11:02 AM

Closed Thread
English Japanese Spanish French German Portuguese Italian Dutch Swedish Russian Norwegian Hungarian Hebrew Danish Bulgarian Greek Powered by Powered by Google
 
LinkBack Thread Tools Search this Thread Rate Thread Display Modes
  #1 (permalink)  
Old 05-23-2006
ropers's Avatar
ropers ropers is offline
Registered User
  
 

Join Date: Dec 2001
Location: Dublin
Posts: 48
sed, grep, awk, regex -- extracting a matched substring from a file/string

Ok, I'm stumped and can't seem to find relevant info.
(I'm not even sure, I might have asked something similar before.):

I'm trying to use shell scripting/UNIX commands to extract URLs from a fairly large web page, with a view to ultimately wrapping this in PHP with exec() and including the URLs in a webpage that I'm trying to then generate for myself.

Here's what I have so far:

I'm catching the page with cURL:
Code:
# curl -s http://archive.wbai.org/
I'm then grepping all the lines that include the (case-insensitive) string "talkback":
Code:
# curl -s http://archive.wbai.org/ | grep -i talkback
NB: I wanted to grep for either "talk back" or "talkback", and the docs I found said that I could use "talk*back", wherein the asterisk would signify zero or more characters, but I can't seem to get this to work.

Having gotten this far, I have a number of strings like this one (which is supposed to be one single line):
Code:
<td width="5%" valign="top" bgcolor="#767676" align="center"> \
<span class=headline3>1 \
<td width="10%" valign="top" bgcolor="#EFEFEF"><span class=archivelink> \
<a href="pls.php?mp3fil=4541"><u>Play</u></a></span>&nbsp; \
<span class=archivelink> \
<a href="http://archive.wbai.org/files/mp3/060222_150002talkback.MP3"> \
<u>Download</u></a>
I now want to extract just the second URL, the one with the .mp3 file.
I have tried to match
Code:
href*\.MP3
and then to somehow only get the URLs printed, but it just doesn't seem to work.

NB: I am aware that if I got awk to work right, I might no longer need to grep. For now, I'm still grepping though. Also, I heard that first grepping and then awking might hypothetically be a tiny bit quicker as allegedly, reportedly grep is (reputed to be) somewhat faster than awk. (Comments welcome.)

I am also aware that it's probably perfectly possible (and conceivably even quicker) to do all of this in PHP. I haven't tried that because (a) my PHP skills blow even harder than my scripting skills and (b) I'd really like to know how to do this kind of manipulation using "standard" UNIX shell commands. (Yes, yes, I know, a lot of people consider PHP "standard" as well, and one can even write php shell scripts, etc. etc... but PLEASE, have mercy on my soul. )

I'd be extremely grateful for any help on this.

Edit:
PS: I should add that I want to make as little assumptions about the source page as possible, so I don't want to just extract the nth $something (say, e.g. $9) with awk, because I don't want to assume that the talkback .mp3 URL always stays in the same place.
  #2 (permalink)  
Old 05-23-2006
vino's Avatar
vino vino is offline Forum Staff  
Supporter (in vino veritas)
  
 

Join Date: Feb 2005
Location: Bangalore, India
Posts: 2,798
How about

Code:
sed -n -e "s_.*a href=.\([^\"]*[Tt][Aa][Ll][Kk][.]*[Bb][Aa][Cc][Kk].[Mm][Pp]3\).*_\1_p"
It says, capture everything within the quotes until you encounter .mp3 or .MP3 or anything similiar. It should handle "talk back", "talkback" and is case insensitive.
  #3 (permalink)  
Old 05-23-2006
ropers's Avatar
ropers ropers is offline
Registered User
  
 

Join Date: Dec 2001
Location: Dublin
Posts: 48
Thanks a bunch vino. Your kung fu is strong.
As for myself, I will actually continue to read and search for a while, until I am darn sure I fully understand everything. But now I definitely know which way to go! Thanks again!
Closed Thread

Bookmarks

Tags
unix commands

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On




All times are GMT -4. The time now is 02:18 PM.


Powered by: vBulletin, Copyright ©2000 - 2006, Jelsoft Enterprises Limited. Language Translations Powered by .
vBCredits v1.4 Copyright ©2007 - 2008, PixelFX Studios
The UNIX and Linux Forums Content Copyright ©1993-2009. All Rights Reserved.Ad Management by RedTyger

Content Relevant URLs by vBSEO 3.2.0