![]() |
|
|
google unix.com
|
|||||||
| Forums | Register | Forum Rules | Links | Albums | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here. |
More UNIX and Linux Forum Topics You Might Find Helpful
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| help me in sending parameters from sqlplus script to unix shell script | Hara | Shell Programming and Scripting | 2 | 01-29-2008 03:31 PM |
| Shell Script: want to insert values in database when update script runs | ring | Shell Programming and Scripting | 1 | 10-25-2007 04:06 AM |
| here document to automate perl script that call script | hogger84 | Shell Programming and Scripting | 3 | 10-22-2007 11:15 AM |
| returning to the parent shell after invoking a script within a script | gurukottur | Shell Programming and Scripting | 5 | 09-26-2006 08:05 AM |
| return valuse from child script to parent script | borncrazy | Shell Programming and Scripting | 1 | 08-20-2004 04:39 PM |
![]() |
|
|
LinkBack | Thread Tools | Search this Thread | Rate Thread | Display Modes |
|
|
|
||||
|
script
Hello, I have been searching for a way to extact urls from google cache url search results,
I have a file with a list of urls like this ""http://64.233.167.104/search?q=cache:ts2G04wctD0J:www.worldwidewords.org/qa/qa-shi3.htm+%22shit%22&hl=en&ct=clnk&cd=12&gl=ca&ie=UTF-8"" what i need to do is extract the actual url www.worldwidewords.org/qa/qa-shi3.htm which lies between the : and the + and remove the google cache url from the list so I will have a list of regular urls, I also have normal urls in the list which I would like to keep in the list. any help would be appreciated |
|
||||
|
This is the bash script I am using, it searches google and give you a list of urls, takes out everything but the link and pipes them to a file Code:
#!/bin/bash
#
# google.sh
# ---------
# Automatic Google search from the command line.
#
# Syntax : $ google {search terms}
#
if [ -z $1 ]
then
# If no keyword is entered echo try again
#
echo "you didnt tell me what to search....try again"
else
#url variable with the maximum search results (100) per page
#
url='http://google.ca/search?num=100&hl=en&safe=off&q='
appended=0
for searchTerm in "$@"
do
# Replace white spaces in the search terms
#
searchTerm=`echo $searchTerm | sed 's/ /%20/g'`
url="$url%22$searchTerm%22"
if [ $appended -lt `expr $# - 1` ]
then
url="$url"\+
else
url="$url"\&btnG\=Google\+Search\&meta\=
fi
let "appended+=1"
done
lynx -dump $url >> googleresult1
sed 's/http/\^http/g' googleresult1 | tr -s "^" "\n" | grep http| sed 's/\ .*//g' >> googleresults2 #this command extract only the urs
rm googleresult1
cat googleresults2
sed -e '/google/d' googleresults2 >> urls.txt
fi
The sed command at the end removes the results with google.com in them which are the following pages of results I have tried this sed -n '/:/,/+/p' url.txt but there are three colons in the cache url and I need the text between the third : and the + Last edited by mike171562; 07-15-2007 at 12:32 PM.. |
|
||||
|
Quote:
Regards |
![]() |
| Bookmarks |
| Thread Tools | Search this Thread |
| Display Modes | Rate This Thread |
|
|