The UNIX and Linux Forums  
Hello and Welcome from United States to the UNIX and Linux Forums! Thank You for Visiting and Joining Our Global Community.

Go Back   The UNIX and Linux Forums > Top Forums > Shell Programming and Scripting
.
google unix.com



Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here.

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
how to get number of pages in a PDF file prvnrk Shell Programming and Scripting 8 04-21-2009 05:14 PM
Finding a specific UID on a site with hundreads of users. maxalarie AIX 1 03-03-2008 01:17 PM
Finding a specific pattern from thousands of files ???? aarora_98 Shell Programming and Scripting 6 02-17-2006 08:28 AM
finding specific values in a within a file Gerry405 UNIX for Dummies Questions & Answers 3 11-21-2005 11:37 AM
Split text file by pages ranri UNIX for Dummies Questions & Answers 2 06-01-2001 03:43 AM

Closed Thread
English Japanese Spanish French German Portuguese Italian Dutch Swedish Russian Norwegian Hungarian Hebrew Danish Bulgarian Greek Powered by Powered by Google
 
LinkBack Thread Tools Search this Thread Rate Thread Display Modes
  #1 (permalink)  
Old 08-14-2008
iansocool iansocool is offline
Registered User
  
 

Join Date: Aug 2008
Posts: 4
Finding what pages link to a specific file

First time poster (so please excuse me in advance)

I have a webserver running linux, apache, etc. I have a list of HTML webpages that I want to delete because I think they are old. While I could delete them then check for broken links, I'd like to be more pro-active.

I want to write a shell script that will search all the pages in my site for links to the pages in my list.

Let's say I have a potential file to delete at www.fake-url.com/foo/bar/index.html

I can't just grep for it because the page can be written within pages in a number of way:
1) It could be a full or root relative (that's easy enough to search for)
2) It could be a relative link!

I can't grep for "index.html" because there are multiple index pages. I've written some shell scripts, but searching for relative links like this seems overwhelming.

Hopefully my question makes some sense and I'm posting it in an appropriate place. I was thinking of writing my own script, but you know of an existing script or program that does this it would certainly be appreciated!
  #2 (permalink)  
Old 08-21-2008
otheus's Avatar
otheus otheus is offline Forum Staff  
Moderator ala Mode
  
 

Join Date: Feb 2007
Location: Innsbruck, Austria
Posts: 1,886
I would approach this in another way. Configure Apache for logging so that you can see the referring page (%{Referer}i) AND the file being served (%f). Now, crawl the site using a standard crawler (wget -r -l inf -nd --delete-after ). Next, analyze the logfile against your deletion-candidate list. You can do something like:

Code:
cat candidate-deletion-files.txt |
while read old_file;   do 
   grep -- "$old_file" access.log >/dev/null || echo "Safe to delete: $old_file"
done
To be more particular, I'd make sure I'm looking at the correct field in the logfile. You can do this with awk, or by first extracting the file names from the logfile and searching that output using the grep above.
  #3 (permalink)  
Old 08-26-2008
iansocool iansocool is offline
Registered User
  
 

Join Date: Aug 2008
Posts: 4
That's a different approach. I think it makes sense. Thanks!
Closed Thread

Bookmarks

Tags
unix pages links

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On




All times are GMT -4. The time now is 02:00 AM.


Powered by: vBulletin, Copyright ©2000 - 2006, Jelsoft Enterprises Limited. Language Translations Powered by .
vBCredits v1.4 Copyright ©2007 - 2008, PixelFX Studios
The UNIX and Linux Forums Content Copyright ©1993-2009. All Rights Reserved.Ad Management by RedTyger

Content Relevant URLs by vBSEO 3.2.0