Remove external urls from .html file

02-17-2011

Registered User

2, 0

Join Date: Feb 2011

Last Activity: 17 February 2011, 1:36 PM EST

Posts: 2

Thanks Given: 0

Thanked 0 Times in 0 Posts

Remove external urls from .html file

Hi everyone. I have an html file with lines like so:
link href="localFolder/...">
link href="htp://...">
img src="localFolder/...">
img src="htp://...">

I want to remove the links with http in the href and imgs with http in its src. I'm having trouble removing them because there could be multiple attributes in the tags.
It is possible to have multiple and on the same line in the html file. So I can't just remove the entire line.
1) Bash
2) Linux Ubuntu
Thanks!

CowCow339

View Public Profile for CowCow339

Find all posts by CowCow339

02-17-2011

Registered User

1,203, 103

Join Date: Mar 2007

Last Activity: 28 January 2020, 10:33 PM EST

Location: Orlando, Florida

Posts: 1,203

Thanks Given: 1

Thanked 103 Times in 100 Posts

Make your question easier by displaying part of the input data and desired output.

Shell_Life

View Public Profile for Shell_Life

Find all posts by Shell_Life

02-17-2011

Registered User

2, 0

Join Date: Feb 2011

Last Activity: 17 February 2011, 1:36 PM EST

Posts: 2

Thanks Given: 0

Thanked 0 Times in 0 Posts

Sorry, it's a little hard because the site doesn't want me to include html tags.

Input: I mistyped link & img so that the forum would allow me to put tags
<linkk rel="stylesheet" type="text/css" href="localFolder/my.css"> <linkk rel="stylesheet" type="text/css" href="http://www.noob.com">
<imgg src="localFolder/sad.jpg"> <imgg src="htp://www.noob.com/sad.jpg">
<aa href="http://www.google.com">

Output:
<linkk rel="stylesheet" type="text/css" href="localFolder/my.css">
<imgg src="localFolder/sad.jpg">
<aa href="http://www.google.com">

----------
Essentially, I want to remove everything between < > if there is an http inside the < >, except for <a href>

Last edited by CowCow339; 02-17-2011 at 02:33 PM..

CowCow339

View Public Profile for CowCow339

Find all posts by CowCow339

02-17-2011

Registered User

1,203, 103

Join Date: Mar 2007

Last Activity: 28 January 2020, 10:33 PM EST

Location: Orlando, Florida

Posts: 1,203

Thanks Given: 1

Thanked 103 Times in 100 Posts

See if this works for you:

Code:

sed '/aa href/!s/ <.*http.*>//' input_file

Shell_Life

View Public Profile for Shell_Life

Find all posts by Shell_Life

02-17-2011

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Hmmm. Two ways to deal with this that I see:

Using a full-fledged HTML parser. A good starting point would be perl's HTML::Parser module. You could load in the HTML file, hunt the tree of tags for things you want changed, alter, write back out. This is the proper way.
Fold, spindle, and mutilate the HTML into something that can be processed line by line.

This is very quick and dirty, highly inefficient, and most decidedly not a full-fledged HTML parser, and while it works for my test cases, it does have limitations. URLs containing ' or " will confuse it. Some fancy meta-tags may confuse it. If any step in the process produces lines longer than sed or your shell can handle, it may explode in a giant firey ball.

Code:

#!/bin/bash

# Add %@% to the end of every line, turn newline into space,
# and add newlines at the very beginning and end of each HTML tag.
# Then we'll get each tag on one line while we READ line by line.
sed 's/$/%@%/' | tr '\n' ' ' | sed 's/</\n</g;s/>/>\n/g' |
while IFS="" read LINE
do
        case "${LINE:0:2}" in
        "</")
#               echo "<!-- Close tag -->"
                echo "${LINE}"
                ;;
        "<"*)
                read TAGTYPE G <<< "${LINE:1}"

                # Feed different things into sed depending on what tag we got
                case "${TAGTYPE}" in
                [iI][mM][gG])
                        REPLACE="[sS][rR][cC]"
                        WITH="src"
                        ;;
                [aA])
                        REPLACE="[hH][rR][eE][fF]"
                        WITH="href"
                        ;;
                *)
                        REPLACE=""
                        ;;
                esac

                if [ -z "${REPLACE}" ]
                then
                        #echo "<!-- No substitution -->"
                        echo "${LINE}"
                else
                        echo "${LINE}" | sed "s#${REPLACE}=['\"][^'\"]*['\"]#${WITH}=''#"
                fi

                ;;
        *)
                #echo "<!-- Raw text -->"
                echo "${LINE}"
                ;;
        esac
# Delete all newlines, then change %@% back into newlines
done | tr -d '\n' | sed 's/%@%/\n/g'

It reads on stdin and writes to stdout.

Neither method really ends up being very easy. I suspect there's a whole new language waiting to be made to deal with this.

---------- Post updated at 01:52 PM ---------- Previous update was at 01:41 PM ----------

Quote:

Originally Posted by Shell_Life

See if this works for you:

Code:

sed '/aa href/!s/ <.*http.*>//' input_file

That will strip out [b]all[/b] url's No it won't, but it will also strip them out from incorrect places inside those tags, should they have a title containing a URL or something..

Last edited by Corona688; 02-17-2011 at 04:28 PM..

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

Shell Programming and Scripting

Remove external urls from .html file

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

How to remove unused html codes from the file using UNIX?

Discussion started by: karthik adiga

2. Shell Programming and Scripting

Sorting a html file with an external sort order

Discussion started by: gimley

3. Shell Programming and Scripting

How to remove multiline HTML tags from a file?

Discussion started by: threesixtyfive

4. Shell Programming and Scripting

How to remove urls from html files

Discussion started by: georgi58

5. Shell Programming and Scripting

Extract urls from index.html downloaded using wget

Discussion started by: mnanavati

6. Web Development

Tricky mod_rewrite for clean urls problems when fetching external sources

Discussion started by: lowmaster

7. Shell Programming and Scripting

Extract URLs from HTML code using sed

Discussion started by: L0rd

8. Shell Programming and Scripting

Rsync to an external list of URLs

Discussion started by: ibsen

9. Solaris

Unix command to remove external SCSI harddrive

Discussion started by: tlee

10. Linux

How to remove only html tags inside a file?

Discussion started by: btech_raju