The UNIX and Linux Forums  

Go Back   The UNIX and Linux Forums > Top Forums > Shell Programming and Scripting
.
google unix.com



Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here.

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
help me in sending parameters from sqlplus script to unix shell script Hara Shell Programming and Scripting 2 01-29-2008 03:31 PM
Shell Script: want to insert values in database when update script runs ring Shell Programming and Scripting 1 10-25-2007 04:06 AM
here document to automate perl script that call script hogger84 Shell Programming and Scripting 3 10-22-2007 11:15 AM
returning to the parent shell after invoking a script within a script gurukottur Shell Programming and Scripting 5 09-26-2006 08:05 AM
return valuse from child script to parent script borncrazy Shell Programming and Scripting 1 08-20-2004 04:39 PM

Closed Thread
English Japanese Spanish French German Portuguese Italian Dutch Swedish Russian Norwegian Hungarian Hebrew Danish Bulgarian Greek Powered by Powered by Google
 
LinkBack Thread Tools Search this Thread Rate Thread Display Modes
  #1 (permalink)  
Old 07-15-2007
mike171562 mike171562 is offline
Registered User
  
 

Join Date: Jul 2007
Posts: 16
script

Hello, I have been searching for a way to extact urls from google cache url search results,

I have a file with a list of urls like this

""http://64.233.167.104/search?q=cache:ts2G04wctD0J:www.worldwidewords.org/qa/qa-shi3.htm+%22shit%22&hl=en&ct=clnk&cd=12&gl=ca&ie=UTF-8""

what i need to do is extract the actual url www.worldwidewords.org/qa/qa-shi3.htm which lies between the : and the + and remove the google cache url from the list so I will have a list of regular urls, I also have normal urls in the list which I would like to keep in the list.

any help would be appreciated
  #2 (permalink)  
Old 07-15-2007
mike171562 mike171562 is offline
Registered User
  
 

Join Date: Jul 2007
Posts: 16
This is the bash script I am using, it searches google and give you a list of urls, takes out everything but the link and pipes them to a file


Code:
#!/bin/bash
#
# google.sh
# ---------
#  Automatic Google search from the command line.
#
#    Syntax : $ google {search terms}
#
if [ -z $1 ]
then
  # If no keyword is entered echo try again
  #
  echo "you didnt tell me what to search....try again"
else
  #url variable with the maximum search results (100) per page
  #
  url='http://google.ca/search?num=100&hl=en&safe=off&q='

  appended=0
  for searchTerm in "$@"
  do
    # Replace white spaces in the search terms
    #
    searchTerm=`echo $searchTerm | sed 's/ /%20/g'`

    url="$url%22$searchTerm%22"

    if [ $appended -lt `expr $# - 1` ]
    then
      url="$url"\+
    else
      url="$url"\&btnG\=Google\+Search\&meta\=
    fi

    let "appended+=1"
  done

  lynx -dump $url >> googleresult1
  sed 's/http/\^http/g' googleresult1 | tr -s "^" "\n" | grep http| sed 's/\ .*//g' >> googleresults2 #this command extract only the urs
  rm googleresult1
  cat googleresults2
  sed -e '/google/d' googleresults2 >> urls.txt
fi

The sed command at the end removes the results with google.com in them which are the following pages of results
I have tried this sed -n '/:/,/+/p' url.txt but there are three colons in the cache url and I need the text between the third : and the +

Last edited by mike171562; 07-15-2007 at 12:32 PM..
  #3 (permalink)  
Old 07-15-2007
Franklin52 Franklin52 is offline Forum Staff  
Moderator
  
 

Join Date: Feb 2007
Posts: 4,342
Try this:


Code:
line='http://64.233.167.104/search?q=cache:ts2G04wctD0J:www.worldwidewords.org/qa/qa-shi3.htm+%22shit%22&hl=en&ct=clnk&cd=12&gl=ca&ie=UTF-8'

echo $line|sed 's/\(.*\):www\(.*\)+\(.*\)/www\2/'

Regards
  #4 (permalink)  
Old 07-15-2007
reborg's Avatar
reborg reborg is offline Forum Staff  
Administrator
  
 

Join Date: Mar 2005
Location: Ireland
Posts: 4,245
Quote:
Originally Posted by Franklin52 View Post
Try this:


Code:
line='http://64.233.167.104/search?q=cache:ts2G04wctD0J:www.worldwidewords.org/qa/qa-shi3.htm+%22shit%22&hl=en&ct=clnk&cd=12&gl=ca&ie=UTF-8'

echo $line|sed 's/\(.*\):www\(.*\)+\(.*\)/www\2/'

Regards
Three absolute wildcards and no anchors? It may work my personally I would not consider using that code. Also why copy to buffer what you don't use?


Code:
echo $url_to_strip | awk -F'[:+]' '{print $4}'

If this does not give the correct result used nawk instead of awk, as you may have an 'old awk' on some system eg. Solaris.
  #5 (permalink)  
Old 07-15-2007
Franklin52 Franklin52 is offline Forum Staff  
Moderator
  
 

Join Date: Feb 2007
Posts: 4,342
Quote:
Originally Posted by reborg View Post
Three absolute wildcards and no anchors? It may work my personally I would not consider using that code. Also why copy to buffer what you don't use?


Code:
echo $url_to_strip | awk -F'[:+]' '{print $4}'

If this does not give the correct result used nawk instead of awk, as you may have an 'old awk' on some system eg. Solaris.
I may wrong, I don't know how the adresses are formatted but this works only with exact three colons before the adresses.

Regards
  #6 (permalink)  
Old 07-15-2007
reborg's Avatar
reborg reborg is offline Forum Staff  
Administrator
  
 

Join Date: Mar 2005
Location: Ireland
Posts: 4,245
That is the format of a google cache entry

However in the interest of being more flexible (note that the match is not a generic wildcard) :


Code:
echo $url_to_strip | sed -n 's_.*:\(www[^+]*\)+.*_\1_p'

Closed Thread

Bookmarks

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On




All times are GMT -4. The time now is 09:24 PM.


Powered by: vBulletin, Copyright ©2000 - 2006, Jelsoft Enterprises Limited. Language Translations Powered by .
vBCredits v1.4 Copyright ©2007 - 2008, PixelFX Studios
The UNIX and Linux Forums Content Copyright ©1993-2009. All Rights Reserved.Ad Management by RedTyger

Content Relevant URLs by vBSEO 3.2.0