Finding what pages link to a specific file

08-14-2008

Registered User

4, 0

Join Date: Aug 2008

Last Activity: 30 June 2009, 7:45 PM EDT

Posts: 4

Thanks Given: 0

Thanked 0 Times in 0 Posts

Finding what pages link to a specific file

First time poster (so please excuse me in advance)

I have a webserver running linux, apache, etc. I have a list of HTML webpages that I want to delete because I think they are old. While I could delete them then check for broken links, I'd like to be more pro-active.

I want to write a shell script that will search all the pages in my site for links to the pages in my list.

Let's say I have a potential file to delete at www.fake-url.com/foo/bar/index.html

I can't just grep for it because the page can be written within pages in a number of way:
1) It could be a full or root relative (that's easy enough to search for)
2) It could be a relative link!

I can't grep for "index.html" because there are multiple index pages. I've written some shell scripts, but searching for relative links like this seems overwhelming.

Hopefully my question makes some sense and I'm posting it in an appropriate place. I was thinking of writing my own script, but you know of an existing script or program that does this it would certainly be appreciated!

iansocool

View Public Profile for iansocool

Find all posts by iansocool

08-21-2008

Registered User

2,157, 51

Join Date: Feb 2007

Last Activity: 6 September 2017, 5:43 AM EDT

Location: Innsbruck, Austria

Posts: 2,157

Thanks Given: 12

Thanked 51 Times in 48 Posts

I would approach this in another way. Configure Apache for logging so that you can see the referring page (%{Referer}i) AND the file being served (%f). Now, crawl the site using a standard crawler (wget -r -l inf -nd --delete-after ). Next, analyze the logfile against your deletion-candidate list. You can do something like:

Code:

cat candidate-deletion-files.txt |
while read old_file;   do 
   grep -- "$old_file" access.log >/dev/null || echo "Safe to delete: $old_file"
done

To be more particular, I'd make sure I'm looking at the correct field in the logfile. You can do this with awk, or by first extracting the file names from the logfile and searching that output using the grep above.

otheus

View Public Profile for otheus

Find all posts by otheus

08-26-2008

Registered User

4, 0

Join Date: Aug 2008

Last Activity: 30 June 2009, 7:45 PM EDT

Posts: 4

Thanks Given: 0

Thanked 0 Times in 0 Posts

That's a different approach. I think it makes sense. Thanks!

iansocool

View Public Profile for iansocool

Find all posts by iansocool

Shell Programming and Scripting

Finding what pages link to a specific file

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Finding specific string in file and storing in another file

Discussion started by: aankita30

2. Shell Programming and Scripting

Finding duplicates in a file excluding specific pattern

Discussion started by: shiva2985

3. Shell Programming and Scripting

Finding 4 current files having specific File Name pattern

Discussion started by: lancesunny

4. Shell Programming and Scripting

finding file with a specific range

Discussion started by: sujit_kashyap

5. UNIX for Advanced & Expert Users

Finding a specific range of character in file

Discussion started by: techmoris

6. Shell Programming and Scripting

Finding a specific word

Discussion started by: rpatty

7. Shell Programming and Scripting

add newline in file after finding specific text

Discussion started by: jxh461

8. Shell Programming and Scripting

Finding file in specific subdirectories

Discussion started by: The_Archer

9. UNIX for Dummies Questions & Answers

help with finding specific files

Discussion started by: linuxlaptop

10. UNIX for Dummies Questions & Answers

finding specific values in a within a file

Discussion started by: Gerry405