script


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting script
# 1  
Old 07-15-2007
script

Hello, I have been searching for a way to extact urls from google cache url search results,

I have a file with a list of urls like this

""http://64.233.167.104/search?q=cache:ts2G04wctD0J:www.worldwidewords.org/qa/qa-shi3.htm+%22shit%22&hl=en&ct=clnk&cd=12&gl=ca&ie=UTF-8""

what i need to do is extract the actual url www.worldwidewords.org/qa/qa-shi3.htm which lies between the : and the + and remove the google cache url from the list so I will have a list of regular urls, I also have normal urls in the list which I would like to keep in the list.

any help would be appreciated
# 2  
Old 07-15-2007
This is the bash script I am using, it searches google and give you a list of urls, takes out everything but the link and pipes them to a file

Code:
#!/bin/bash
#
# google.sh
# ---------
#  Automatic Google search from the command line.
#
#    Syntax : $ google {search terms}
#
if [ -z $1 ]
then
  # If no keyword is entered echo try again
  #
  echo "you didnt tell me what to search....try again"
else
  #url variable with the maximum search results (100) per page
  #
  url='http://google.ca/search?num=100&hl=en&safe=off&q='

  appended=0
  for searchTerm in "$@"
  do
    # Replace white spaces in the search terms
    #
    searchTerm=`echo $searchTerm | sed 's/ /%20/g'`

    url="$url%22$searchTerm%22"

    if [ $appended -lt `expr $# - 1` ]
    then
      url="$url"\+
    else
      url="$url"\&btnG\=Google\+Search\&meta\=
    fi

    let "appended+=1"
  done

  lynx -dump $url >> googleresult1
  sed 's/http/\^http/g' googleresult1 | tr -s "^" "\n" | grep http| sed 's/\ .*//g' >> googleresults2 #this command extract only the urs
  rm googleresult1
  cat googleresults2
  sed -e '/google/d' googleresults2 >> urls.txt
fi

The sed command at the end removes the results with google.com in them which are the following pages of results
I have tried this sed -n '/:/,/+/p' url.txt but there are three colons in the cache url and I need the text between the third : and the +

Last edited by mike171562; 07-15-2007 at 12:32 PM..
# 3  
Old 07-15-2007
Try this:

Code:
line='http://64.233.167.104/search?q=cache:ts2G04wctD0J:www.worldwidewords.org/qa/qa-shi3.htm+%22shit%22&hl=en&ct=clnk&cd=12&gl=ca&ie=UTF-8'

echo $line|sed 's/\(.*\):www\(.*\)+\(.*\)/www\2/'

Regards
# 4  
Old 07-15-2007
Quote:
Originally Posted by Franklin52
Try this:

Code:
line='http://64.233.167.104/search?q=cache:ts2G04wctD0J:www.worldwidewords.org/qa/qa-shi3.htm+%22shit%22&hl=en&ct=clnk&cd=12&gl=ca&ie=UTF-8'

echo $line|sed 's/\(.*\):www\(.*\)+\(.*\)/www\2/'

Regards
Three absolute wildcards and no anchors? It may work my personally I would not consider using that code. Also why copy to buffer what you don't use?

Code:
echo $url_to_strip | awk -F'[:+]' '{print $4}'

If this does not give the correct result used nawk instead of awk, as you may have an 'old awk' on some system eg. Solaris.
# 5  
Old 07-15-2007
Quote:
Originally Posted by reborg
Three absolute wildcards and no anchors? It may work my personally I would not consider using that code. Also why copy to buffer what you don't use?

Code:
echo $url_to_strip | awk -F'[:+]' '{print $4}'

If this does not give the correct result used nawk instead of awk, as you may have an 'old awk' on some system eg. Solaris.
I may wrong, I don't know how the adresses are formatted but this works only with exact three colons before the adresses.

Regards
# 6  
Old 07-15-2007
That is the format of a google cache entry

However in the interest of being more flexible (note that the match is not a generic wildcard) :

Code:
echo $url_to_strip | sed -n 's_.*:\(www[^+]*\)+.*_\1_p'

# 7  
Old 07-15-2007
script

thanks for the replies everyone, I know the above script is sloppy, but im not very good at scripting yet. Im using it to list urls with certain words to import into a web spider for my company web filter. I decided the easist way would just be to add an extra line to filter out lines with "cache" in them something like:
sed -e '/cache/d' googleresults2 >> urls.txt

Does anyone know of a good web spider scipt that is keyword based?
Login or Register to Ask a Question

Previous Thread | Next Thread

5 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to block first bash script until second bash script script launches web server/site?

I'm new to utilities like socat and netcat and I'm not clear if they will do what I need. I have a "compileDeployStartWebServer.sh" script and a "StartBrowser.sh" script that are started by emacs/elisp at the same time in two different processes. I'm using Cygwin bash on Windows 10. My... (3 Replies)
Discussion started by: siegfried
3 Replies

2. Shell Programming and Scripting

Shell script works fine as a standalone script but not as part of a bigger script

Hello all, I am facing a weird issue while executing a code below - #!/bin/bash cd /wload/baot/home/baotasa0/sandboxes_finance/ext_ukba_bde/pset sh UKBA_publish.sh UKBA 28082015 3 if then echo "Param file conversion for all the areas are completed, please check in your home directory"... (2 Replies)
Discussion started by: ektubbe
2 Replies

3. UNIX for Dummies Questions & Answers

Calling a script from master script to get value from called script

I am trying to call a script(callingscript.sh) from a master script(masterscript.sh) to get string type value from calling script to master script. I have used scripts mentioned below. #masterscript.sh ./callingscript.sh echo $fileExist #callingscript.sh echo "The script is called"... (2 Replies)
Discussion started by: Raj Roy
2 Replies

4. Shell Programming and Scripting

Script will keep checking running status of another script and also restart called script at night

I am using blow script :-- #!/bin/bash FIND=$(ps -elf | grep "snmp_trap.sh" | grep -v grep) #check snmp_trap.sh is running or not if then # echo "process found" exit 0; else echo "process not found" exec /home/Ketan_r /snmp_trap.sh 2>&1 & disown -h ... (1 Reply)
Discussion started by: ketanraut
1 Replies

5. Shell Programming and Scripting

create a shell script that calls another script and and an awk script

Hi guys I have a shell script that executes sql statemets and sends the output to a file.the script takes in parameters executes sql and sends the result to an output file. #!/bin/sh echo " $2 $3 $4 $5 $6 $7 isql -w400 -U$2 -S$5 -P$3 << xxx use $4 go print"**Changes to the table... (0 Replies)
Discussion started by: magikminox
0 Replies
Login or Register to Ask a Question