Script to delete HTML tag


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Script to delete HTML tag
# 8  
Old 11-20-2011
Paste the output of
Code:
grep '(^|\\.)' /tmp/temp_ad_file

--ahamed
# 9  
Old 11-20-2011
The file is very long. Dont think it might be a good idea to paste the entire file Smilie
Here is an extract
Code:
(^|\.)zanox-affiliate\.de$
(^|\.)zanox\.com$
(^|\.)zantracker\.com$
(^|\.)zde-affinity\.edgecaching\.net$
(^|\.)zedo\.com$
(^|\.)zencudo\.co\.uk$
(^|\.)zenzuu\.com$
(^|\.)zeus\.developershed\.com$
(^|\.)zeusclicks\.com$
(^|\.)zintext\.com$
(^|\.)zmedia\.com$


Last edited by Scott; 11-20-2011 at 02:22 PM.. Reason: Please start using code tags.
# 10  
Old 11-20-2011
Boy is this thread confusing Smilie

A couple of observations that I have that might help clear the problem. First, the original post refers to 'removing html' from the file. However the file pulled from yoyo.org with wget using text/plain does not contain any html. More so, the reference to
Quote:
The HTML tags would be "(^|\.)" & $. If left in that list, the acl squid can't use the file.
seems to indicate that while incorrectly calling (^|\.) HTML, these strings are not desired. Depending on how squid is configured, this is true, they need to be removed.

The file from yoyo.com is a list of regular expressions and if squid isn't configured with acl ads dstdom_regex -i "[/usr/local]/etc/squid.adservers" then the regex parts will cause problems. I believe this is the reason things have stopped working is because the configuration on the old machine isn't the same as on FreeBSD. Given this, the original code that is extracting the regex lines from the yoyo.com data using the regex string makes sense.

This doesn't explain why the output file is ending up empty, but might change the focus on the problem a bit. If the squid config is changed to match the old machine, then the regex file can be used as is, otherwise the regex portions should be stripped:

Code:
sed 's/[()|.$^]//g' /tmp/temp_ad_file >/usr/local/etc/squid/squid.adservers

Care should be taken if these strings are used without the regex as they might match more URLs than desired.

I'm interested in knowing if the sed above has the same problem -- generates an empty file. If it does, then I question the permissions on the output file. What happens if the output file is changed to something like >/tmp/foo?
These 2 Users Gave Thanks to agama For This Post:
# 11  
Old 11-20-2011
agama, thanks for your help here. You have understood my issue better than I could ever explain it. I am very sorry for calling those tags HTML when they were not and causing so much confusion. My apologies.
Coming back to the script; the tags are completely gone now. My ACL in squid is back to working and the final destination folder does not get emptied any more. Thank you so much for your help again agama.

One small thing still where I am having some difficulties; learning curve is stepSmilie

When the script is run, the /tmp/temp_ad_file is being displayed in the console. Is there a way to not display the temp_ad_file in the console ?
Code:
#!/bin/sh
# Get new ad server list
/usr/local/bin/wget -O /tmp/temp_ad_file \
        'http://pgl.yoyo.org/adservers/serverlist.php?hostformat=squid-dstdom-regex;showintro=0&mimetype=plai
# Removing repeated lines
cat /tmp/temp_ad_file | uniq
# Removing blank lines
sed /^$/d /tmp/temp_ad_file
# Cleaning list
sed 's/[()|.$^]//g' /tmp/temp_ad_file > /usr/local/etc/squid/squid.adservers
# Refresh Squid
/usr/local/sbin/squid -k reconfigure
# Remove tmp file
rm -rf /tmp/temp_ad_file

Kind Regards,

---------- Post updated at 05:18 PM ---------- Previous update was at 04:53 PM ----------

So I think I have figured it out

I am sending into the bit bucket both lines below
Code:
cat /tmp/temp_ad_file | uniq 2>&1 > /dev/nul
sed /^$/d /tmp/temp_ad_file 2>&1 > /dev/nul

I have bought this book "mastering.unix.shell.scripting" to try to understand a bit more about nix scripting. I would recommend this book for advanced users: For someone like me, its a bit tough.

Regards,

zongo

Moderator's Comments:
Mod Comment Please use code tags when posting data and code samples, thank you.

Last edited by Franklin52; 11-21-2011 at 08:05 AM..
# 12  
Old 11-20-2011
Quote:
Originally Posted by zongo
Code:
# cat /tmp/temp_ad_file | grep "(^|\.)" > "/usr/local/etc/squid/squid.adservers"

What about :
Code:
 /usr/bin/tr -d '[(^|\\$)]' < /tmp/temp_ad_file > /usr/local/etc/squid/squid.adservers

# 13  
Old 11-20-2011
Quote:
Originally Posted by zongo
So I think I have figured it out

I am sending into the bit bucket both lines below

cat /tmp/temp_ad_file | uniq 2>&1 > /dev/nul
sed /^$/d /tmp/temp_ad_file 2>&1 > /dev/nul
zongo
Glad you've got it working, and it seems you're willing to understand the why with the how which is important.

In general, if you aren't doing anything with the output of the above commands, then you don't need to run them; just comment them out, or remove them completely. However, looking at your script, I believe the intent was to execute these as a pipeline rather than by themselves. The data from yoyo.com needs to be sorted for uniq to work, and since you get the input from an external source (order unknown), best to sort with the unique option rather than assuming it's sorted. The sed to remove blank lines can be combined with the sed to delete the regex stuff, so the pipeline is just two commands to accomplish all 4 things:

Code:
 # Removing repeated lines
 sort -u /tmp/temp_ad_file | sed '/^$/d; s/[()|.$^]//g'  > /usr/local/etc/squid/squid.adservers

# 14  
Old 11-20-2011
agama, thanks
I have amended as per your suggestions
I get a better result even
I have moved up the "cleaning list" line so I have it above the "uniq" variable. That way it is sorted, cleaned, blank lines removed, and then all repetition being removed.

I am using the "uniq" filter on the final destination folder which is squid.adservers and not anymore on the /tmp/temp_ad_file.

danmero, thanks for helping as well.

Kind Regards,

Code:
#!/bin/sh
# Get new ad server list
/usr/local/bin/wget -O /tmp/temp_ad_file \
        'http://pgl.yoyo.org/adservers/serverlist.php?hostformat=squid-dstdom-regex;showintro=0&mime
# Cleaning list
sort -u /tmp/temp_ad_file | sed '/^$/d; s/[()|.$^]//g' > /usr/local/etc/squid/squid.adservers
# Removing repeated lines
cat /usr/local/etc/squid/squid.adservers | uniq
# Refresh Squid
/usr/local/sbin/squid -k reconfigure
# Remove tmp file
rm -rf /tmp/temp_ad_file


Last edited by Franklin52; 11-21-2011 at 08:04 AM.. Reason: Code tags
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Multiline html tag parse shell script

Hello, I want to parse the contents of a multiline html tag ex: <html> <body> <p>some other text</p> <div> <p class="margin-bottom-0"> text1 <br> text2 <br> <br> text3 </p> </div> </body> (15 Replies)
Discussion started by: SorcRR
15 Replies

2. Shell Programming and Scripting

Print Value between desired html tag

Hi, I have a html line as below :-... (6 Replies)
Discussion started by: satishmallidi
6 Replies

3. Shell Programming and Scripting

Search for a html tag and print the entire tag

I want to print from <fruits> to </fruits> tag which have <fruit> as mango. Also i want both <fruits> and </fruits> in output. Please help eg. <fruits> <fruit id="111">mango<fruit> . another 20 lines . </fruits> (3 Replies)
Discussion started by: Ashik409
3 Replies

4. Shell Programming and Scripting

Extracting a string from html tag

Hi I am new to string extractions in shell script... I am trying to extract a string such as #1753 from html tag looks like below. <a class="model-link tl-tr" href="lastSuccessfulBuild/">Last successful build (#1753), 40 min ago</a> and want the value as 1753 Could someone help me to... (3 Replies)
Discussion started by: hicharbo
3 Replies

5. Shell Programming and Scripting

Add the html tag first and last line the file

Hi, i have 30 html files and i want to add the html tag first (<html>) and end of the line </html> tag..How to do it in script. Thanks, (7 Replies)
Discussion started by: bmk
7 Replies

6. Shell Programming and Scripting

how to delete certain java script from html files using sed

I am cleaning forum posts to convert them in offline reading version with clean html text. All files are with html extension and reside in one folder. There is some java script i would like to remove, which looks like <script LANGUAGE="JavaScript1.1"> <!-- function mMz() { var mPz = "";... (2 Replies)
Discussion started by: georgi58
2 Replies

7. Shell Programming and Scripting

extracting Line between HTML tag

Hi everyone: I want to extract string which is in between certain html tag. e.g. I tried with grep,cut, awk but could not find exact syntax for this one. :wall: PS>Sorry about bad english. (8 Replies)
Discussion started by: newlook2011
8 Replies

8. Shell Programming and Scripting

How can i delete html attributes from tag ?

Input: <table class="pixelBorderTable faqTable" width="100%" border="1" cellpadding="3" cellspacing="0"> <tbody><tr> <td class="pixelBorderTableHeaderTd" valign="top" width="20%" bgcolor="#666666"><p>&nbsp;</p></td> <td class="pixelBorderTableHeaderTd" valign="top"... (1 Reply)
Discussion started by: cola
1 Replies

9. Shell Programming and Scripting

how to use html tag in shell scripting

Hai friends I have a small doubt.. how can we use html tag in shell scripting code : echo "<html>" echo "<body>" echo " welcome to peace world " echo "</body>" echo "</html>" output displayed like this: <html> <body> welcome to peace world </body> </html> (5 Replies)
Discussion started by: jrex1983
5 Replies

10. UNIX for Dummies Questions & Answers

How do I extract text only from html file without HTML tag

I have a html file called myfile. If I simply put "cat myfile.html" in UNIX, it shows all the html tags like <a href=r/26><img src="http://www>. But I want to extract only text part. Same problem happens in "type" command in MS-DOS. I know you can do it by opening it in Internet Explorer,... (4 Replies)
Discussion started by: los111
4 Replies
Login or Register to Ask a Question