Using a full-fledged HTML parser. A good starting point would be perl's HTML::Parser module. You could load in the HTML file, hunt the tree of tags for things you want changed, alter, write back out. This is the proper way.
Fold, spindle, and mutilate the HTML into something that can be processed line by line.
This is very quick and dirty, highly inefficient, and most decidedly not a full-fledged HTML parser, and while it works for my test cases, it does have limitations. URLs containing ' or " will confuse it. Some fancy meta-tags may confuse it. If any step in the process produces lines longer than sed or your shell can handle, it may explode in a giant firey ball.
It reads on stdin and writes to stdout.
Neither method really ends up being very easy. I suspect there's a whole new language waiting to be made to deal with this.
---------- Post updated at 01:52 PM ---------- Previous update was at 01:41 PM ----------
Quote:
Originally Posted by Shell_Life
See if this works for you:
That will strip out [b]all[/b] url's No it won't, but it will also strip them out from incorrect places inside those tags, should they have a title containing a URL or something..
Last edited by Corona688; 02-17-2011 at 04:28 PM..
Hi All,
I have following example file
i want to remove all html tags only,
Input File:
<html>
<head>
<title>Software Solutions Inc., </title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
</head>
<body bgcolor=white leftmargin="0" topmargin="0"... (2 Replies)
Hi All,
I have an external scsi harddrive (HD) connected directly to the workstation. I understand when the external HD is connected and turned on, and type in "devfsadm" command. Unix will detect it but not mount the drive.
So by typing in "format" command it will display the following:
#... (6 Replies)
I'm going to have a text file formatted something like this:
some_name http://www.someurl.com/
another_name http://www.anotherurl.com/
third_name http://www.thirdurl.com/
I need to write a script that can rsync from a file path I'll set, to each URL in the list.
Any ideas? (8 Replies)
Hello,
i try to extract urls from google-search-results, but i have problem with sed filtering of html-code.
what i wont is just list of urls thay apears between ........<p><a href=" and next following " in html code.
here is my code, i use wget and pipelines to filtering. wget works, but... (13 Replies)
Hi, I have problems with mod rewrite. I will try to describe...
I want clean urls but fail to make it work propperly. Maybe I have problems, because the content displayed is fetched from my other site...
There is a lot of stuff I already red about this, but somehow I can not find a solution... (2 Replies)
Hi,
I need to basically get a list of all the tarballs located at uri
I am currently doing a wget on urito get the index.html page
Now this index page contains the list of uris that I want to use in my bash script.
can someone please guide me ,.
I am new to Linux and shell scripting.
... (5 Replies)
Does anybody know how to remove all urls from html files?
all urls are links with anchor texts in the form of
<a href="http://www.anydomain.com">ANCHOR</a>
they may start with www or not.
Goal is to delete all urls and keep the ANCHOR text and if possible to change tags around anchor to... (2 Replies)
I am trying to remove a multiline HTML tag and its contents from a few HTML files following the same basic pattern. So far using regex and sed have been unsuccessful. The HTML has a basic structure like this (with the normal HTML stuff around it):
<div id="div1">
<div class="div2">
<other... (4 Replies)
I am working on a web-concordance of Old Avestan and my concordance has produced a HTML file
The sort deployed by the HTML file is not something which we normally use. I have tried my best to force a sort within the concordance itself, but the sort order does not work.
I am giving below the sort... (6 Replies)
Hi All,
We have a HTML source which will be processed using a informatica workflow. In between these two we have a Unix script which transforms the file.
We are getting an error from past week in the informatica saying invalid format, because the file has unused html reference (0-8,14-31 etc)... (2 Replies)