searching & replacing/removing only certain HTML tags


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting searching & replacing/removing only certain HTML tags
# 1  
Old 04-20-2010
searching & replacing/removing only certain HTML tags

I generally save a lot of web pages for reading offline which works out great for school. Now I have to spend a lot of time on the bus and I am looking for the best way to read some of these webpages using my Nokia 7610.

I have uploaded the files to my phone, but they are deadly deadly slow to open and very clunky to navigate using the phones keys.

I have since been copying and pasting content from the webpages into gedit and adding some basic html tags for basic formatting that will make the content layout somewhat pleasant. It looks goos and navigates much quicker than the original webpages viewed on the phone.

but now I am realizing I need a way to add, remove, search and replace HTML a little more automated. So, I am wondering what tools might be available to the ubuntu/xubuntu user for searching and replacing certain tags while leaving other tags in tact?

for example:

HTML Code:
<tr><td class="padleft12"><i>And so she did. (3.3.18)</i></td></tr>
<tr><td class="padleft6"><b>Thought:</b> When Iago wants to make Othello ... observe her well with Cassio;</i></td></tr>
<tr><td class="padleft12"><i>Wear your eye thus, not jealous nor secure:</i></td></tr>
Using gedit I have no problem searching for all instances of:
HTML Code:
padleft12"><i>Thought:</i>
and replacing with:

HTML Code:
padleft12"><b>Thought:</b>
but now I find I have to remove the </i> tags at the end of that same HTML row. But I wonder if there is a way or application to select one tage and tell the search for the next instance of a character like '<' for example.

So in this example:

HTML Code:
<tr><td class="padleft6"><b>Thought:</b> When Iago wants to make Othello ... observe her well with Cassio;</i></td></tr>
I would like to search for the next occurrence of '</i>' after the '</b>' tag while ignoring all regular text in between. Is that possible?

I hope that made sense.
# 2  
Old 04-22-2010
Since posting I looked harder in SED and it seems like it is capable of doing everything I need and then some, however, I think I might be on the brink of suicide here. I have a command that looks like it should work, but well it does not. I am certain it is user error, so if anyone can help me with this one my sanity would sure appreciate it.

Code:
sed "/Thought\:/s/<\/tr>\n\t\t\t<tr><td>.<\/td><\/tr>\n\t\t\t<tr><td /<\/tr>\n\t\t\t<tr><td>.<\/td><\/tr>\n\t\t\t<tr><td>.<\/td><\/tr>\n\t\t\t<tr><td /g" oth1.html > oth2.html

I thought for sure it had to do with the escaping new lines or tabs, but I have made this work fine:

Code:
sed "/Thought\:/s/<\/tr>/<\/tr>\n\t\t\t<tr><td>.<\/td><\/tr>/g" oth1.html > oth2.html

I have made some pretty dumb mistakes sofar trying to learn how to tame SED, but I cannot see why the first example above deos not work.

BTW, is there a way to show current col position of the cursor in either XTerm or gnomeTerm? that might be a big help.


cheers,
nap

Last edited by naphelge; 04-22-2010 at 12:22 AM..
# 3  
Old 04-23-2010
hey guys things are coming slowly but surely with sed. now I have sed script files that do almost everything I need, actually they do, but still with quite a bit of manual effort.

I am trying to use a sed script inside of a bash for loop to automatically change $name variables, but I am having some problems because the file names overwrite each each iteration of the for loop, and by the end only the last name in the for list gets changed in the filename.

can someone please help me fine tune this so that somehow (with a counter I think, but I have played with counters and cannot get seem to get the desired result) the for loop goes thru and changes the name, saves the result, changes the name to the next one and using the saved file from the previous iteration.

Code:
#!/bin/bash
for name in BAPTISTA TRANIO HORTENSIO GREMIO GRUMIO PETRUCHIO WIDOW
do		
		#append colon after any lines that only contain name
		#put name on new line whenever a period immediately proceeds it
		#put name on new line whenever a closing bracket immediately proceeds it
		#all instances of '”' need to be substitued for '"'
		#rm any chars coming after the colon following the name
	sed -e "/^$name$/s/$name/$name\:/g" -e	"/$name/s/.$name/.\n$name\:/g" -e "/$name/s/)$name/)\n$name\:/g" -e "/$name/s/\"$name/\"\n$name\:/g" -e "s/\($name\:\).*/\1/g" -e "/$name/s/$name\:/<tr><td class=\"padleft6\"><b>$name\:<\/b><\/td><\/tr>/g" $filename1 > $filename2
done

All of the sed subs work fine if I call them using the sed file command. And I am pretty sure they should work here as is, since the end result I have, the name WIDOW gets changed in the file as desired. I think the rest of the names do also, but then get written over.

I know that I still need to manually enter the names into the bash script, but doing that and having the for loop correct will still save me a schwack of time.

Thanks for any help because I have never done any programming before, so lots of new ideas here that are probably in need of modification

cheers,
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Replacing HTML tags with sed

Ok, so this is stupid simple, and I know I am going to feel like an idiot when I get help. I am altering a HTML report that has contraband in it so that the links to said contraband and the images are not shown. The link/img pairs are in the form of : <a... (5 Replies)
Discussion started by: twjolson
5 Replies

2. Homework & Coursework Questions

Script: Removing HTML tags and duplicate lines

Use and complete the template provided. The entire template must be completed. If you don't, your post may be deleted! 1. The problem statement, all variables and given/known data: You will write a script that will remove all HTML tags from an HTML document and remove any consecutive... (3 Replies)
Discussion started by: tburns517
3 Replies

3. UNIX for Advanced & Expert Users

Mutt for html body and multiple html & pdf attachments

Hi all: Been racking my brain on this for the last couple of days and what has been most frustrating is that this is the last piece I need to complete a project. There are numerous posts discussing mutt in this forum and others but I have been unable to find similar issues. Running with... (1 Reply)
Discussion started by: raggmopp
1 Replies

4. Shell Programming and Scripting

Removing all except couple of html tags from html file

I tried to find elegant (or at least simple) way to remove all but couple of html tags from html file, but all examples I found dealt with removing all the tags. The logic of the script would be: - if there is <li> or <ul> on the line, do nothing (=write same line to output) - if there is:... (0 Replies)
Discussion started by: juubuntu
0 Replies

5. Shell Programming and Scripting

Removing html tags

I store different variance of the below in an xml file. and apparently, xml has an issue loading up data like this because it contains html tags. i would like to preserve this data as it is, but unfortunately, xml says i cant. so i have to strip out all the html tags. the examples i found... (9 Replies)
Discussion started by: SkySmart
9 Replies

6. UNIX for Advanced & Expert Users

Removing HTML tags

Hello Unix Gurus I am having a problem with one of the files that i am generating using a Unix Script. This Unix Scripts connects to the MY SQL Server and loads the data into a Text file. While generating the Text file for one of the tables the value in one of the column is as follows. <p>... (3 Replies)
Discussion started by: chetan.mudike
3 Replies

7. Shell Programming and Scripting

Replacing variable values in html tags

Hi please help me with this . I have a file test.txt with following content $cat test.txt <td>$test</td> <h2>$test2</h2> and I have a ksh with following content $cat test.ksh #!/bin/ksh test=3 test2=4 while read line do echo $line done < test.html I am expecting the output as (4 Replies)
Discussion started by: panduandpavan
4 Replies

8. Shell Programming and Scripting

In PHP replacing text between TAGS & URI information in Title Tag

Hi, what I am trying to do in PHP, is to replace the Title. I need some of the URL information inside aswell depending on the domain. The title is always different so I need to store it in a variable, put the url info like described below in front of it. Here is an example how it should... (0 Replies)
Discussion started by: lowmaster
0 Replies

9. Shell Programming and Scripting

removing html tags via parameter expansion

Hi all- I have a variable that contains a web page: echo $STUFF <html> <head> <title>my page</title></head> <body> blah blah etc.. Can I use the shell's parameter expansion abilities to remove just the tags? I thought that FIXHTML=${STUFF//<*>/} might do it, but it didn't seem to... (2 Replies)
Discussion started by: rev66
2 Replies

10. UNIX for Dummies Questions & Answers

searching through a file and replacing

Hi can anyone show me how to search through a file (multiple columns) and based on say one column if empty and another being popluted and if both senario's are true then get information from another file matching the data from the populated column based upon the populated column and bring through... (2 Replies)
Discussion started by: Gerry405
2 Replies
Login or Register to Ask a Question