04-07-2010
HTML code remove
Hello,
I have one file which has been inserted intermittently with HTML web page.
I would like to remove all text between "<html xmlns="http://www.w3.org/1999/xhtml">" and </html> tags.
Can any one please suggest me sed regular expression for it.
Thanks
10 More Discussions You Might Find Interesting
1. Linux
Hi All,
I have following example file
i want to remove all html tags only,
Input File:
<html>
<head>
<title>Software Solutions Inc., </title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
</head>
<body bgcolor=white leftmargin="0" topmargin="0"... (2 Replies)
Discussion started by: btech_raju
2 Replies
2. Shell Programming and Scripting
Hello,
is there a way to go through a file and remove certain html tags with bash? If it needs sed or awk, that'll do too.
The reason why I want this is, because I have a monitor script which generates a logfile in HTML and every time it generates a logfile, the tags are reproduced. The tags... (4 Replies)
Discussion started by: dejavu88
4 Replies
3. Shell Programming and Scripting
How to use sed to remove html tags including text between them?
Example: User <b> rolvak </b> is stupid. It does not using <b>OOP</b>!
and should output: User is stupid. It does not using !
Thank you.. (2 Replies)
Discussion started by: alphagon
2 Replies
4. Shell Programming and Scripting
Is there any shell command to clean an html tag of its attributes. For ex <p align ="center"> with <p>.
Thanks for your help!! (2 Replies)
Discussion started by: parshant_bvcoe
2 Replies
5. Shell Programming and Scripting
Hi everyone. I have an html file with lines like so:
link href="localFolder/...">
link href="htp://...">
img src="localFolder/...">
img src="htp://...">
I want to remove the links with http in the href and imgs with http in its src. I'm having trouble removing them because there... (4 Replies)
Discussion started by: CowCow339
4 Replies
6. Shell Programming and Scripting
Does anybody know how to remove all urls from html files?
all urls are links with anchor texts in the form of
<a href="http://www.anydomain.com">ANCHOR</a>
they may start with www or not.
Goal is to delete all urls and keep the ANCHOR text and if possible to change tags around anchor to... (2 Replies)
Discussion started by: georgi58
2 Replies
7. Shell Programming and Scripting
Does anybody know how i can remove string from <a> tag?
There are several hundred posts in a few forums that need to be cleaned up.
The precise situation is
----------
<a href="http://mydomain.com/cgi-bin/anyboard.cgi?fvp=/family/sexuality_and_spirituality/&cmd=rA&cG=43">
-------------
my... (6 Replies)
Discussion started by: georgi58
6 Replies
8. UNIX for Dummies Questions & Answers
Hi all,
How might I go about writing a program that will read all input as an HTML file, and subsequently strip all HTML, embedded scripts and style sheets from its input, leaving only text as the output?
I am a beginner, so the simpler, the better.
Thanks for any advice :) (4 Replies)
Discussion started by: Molly.P.
4 Replies
9. Shell Programming and Scripting
Hi,
I have a txt file which contain this:
<a href="linux">Linux</a>
<a href="unix">Unix</a>
<a href="oracle">Oracle</a>
<a href="perl">Perl</a>
I'm trying to extract the text in between these anchor tag and ignoring everything else using grep. I managed to ignore the tags but unable to... (6 Replies)
Discussion started by: KCApple
6 Replies
10. Shell Programming and Scripting
I am trying to remove a multiline HTML tag and its contents from a few HTML files following the same basic pattern. So far using regex and sed have been unsuccessful. The HTML has a basic structure like this (with the normal HTML stuff around it):
<div id="div1">
<div class="div2">
<other... (4 Replies)
Discussion started by: threesixtyfive
4 Replies
REGEXP(6) Games Manual REGEXP(6)
NAME
regexp - regular expression notation
DESCRIPTION
A regular expression specifies a set of strings of characters. A member of this set of strings is said to be matched by the regular
expression. In many applications a delimiter character, commonly bounds a regular expression. In the following specification for regular
expressions the word `character' means any character (rune) but newline.
The syntax for a regular expression e0 is
e3: literal | charclass | '.' | '^' | '$' | '(' e0 ')'
e2: e3
| e2 REP
REP: '*' | '+' | '?'
e1: e2
| e1 e2
e0: e1
| e0 '|' e1
A literal is any non-metacharacter, or a metacharacter (one of .*+?[]()|^$), or the delimiter preceded by
A charclass is a nonempty string s bracketed [s] (or [^s]); it matches any character in (or not in) s. A negated character class never
matches newline. A substring a-b, with a and b in ascending order, stands for the inclusive range of characters between a and b. In s,
the metacharacters an initial and the regular expression delimiter must be preceded by a other metacharacters have no special meaning and
may appear unescaped.
A matches any character.
A matches the beginning of a line; matches the end of the line.
The REP operators match zero or more (*), one or more (+), zero or one (?), instances respectively of the preceding regular expression e2.
A concatenated regular expression, e1e2, matches a match to e1 followed by a match to e2.
An alternative regular expression, e0|e1, matches either a match to e0 or a match to e1.
A match to any part of a regular expression extends as far as possible without preventing a match to the remainder of the regular expres-
sion.
SEE ALSO
awk(1), ed(1), sam(1), sed(1), regexp(2)
REGEXP(6)