Script to delete HTML tag

11-20-2011

Registered User

1,910, 488

Join Date: Sep 2008

Last Activity: 22 December 2019, 2:31 AM EST

Location: San Jose, CA

Posts: 1,910

Thanks Given: 54

Thanked 488 Times in 481 Posts

Paste the output of

Code:

grep '(^|\\.)' /tmp/temp_ad_file

--ahamed

ahamed101

View Public Profile for ahamed101

Find all posts by ahamed101

11-20-2011

Registered User

10, 1

Join Date: Nov 2011

Last Activity: 27 November 2011, 7:38 AM EST

Posts: 10

Thanks Given: 1

Thanked 1 Time in 1 Post

The file is very long. Dont think it might be a good idea to paste the entire file

Here is an extract

Code:

(^|\.)zanox-affiliate\.de$
(^|\.)zanox\.com$
(^|\.)zantracker\.com$
(^|\.)zde-affinity\.edgecaching\.net$
(^|\.)zedo\.com$
(^|\.)zencudo\.co\.uk$
(^|\.)zenzuu\.com$
(^|\.)zeus\.developershed\.com$
(^|\.)zeusclicks\.com$
(^|\.)zintext\.com$
(^|\.)zmedia\.com$

Last edited by Scott; 11-20-2011 at 02:22 PM.. Reason: Please start using code tags.

zongo

View Public Profile for zongo

Find all posts by zongo

11-20-2011

Registered User

1,466, 512

Join Date: Jul 2010

Last Activity: 7 April 2014, 3:02 PM EDT

Location: earth>US>UTC-5

Posts: 1,466

Thanks Given: 110

Thanked 512 Times in 491 Posts

Boy is this thread confusing

A couple of observations that I have that might help clear the problem. First, the original post refers to 'removing html' from the file. However the file pulled from yoyo.org with wget using text/plain does not contain any html. More so, the reference to

Quote:

The HTML tags would be "(^|\.)" & $. If left in that list, the acl squid can't use the file.

seems to indicate that while incorrectly calling (^|\.) HTML, these strings are not desired. Depending on how squid is configured, this is true, they need to be removed.

The file from yoyo.com is a list of regular expressions and if squid isn't configured with acl ads dstdom_regex -i "[/usr/local]/etc/squid.adservers" then the regex parts will cause problems. I believe this is the reason things have stopped working is because the configuration on the old machine isn't the same as on FreeBSD. Given this, the original code that is extracting the regex lines from the yoyo.com data using the regex string makes sense.

This doesn't explain why the output file is ending up empty, but might change the focus on the problem a bit. If the squid config is changed to match the old machine, then the regex file can be used as is, otherwise the regex portions should be stripped:

Code:

sed 's/[()|.$^]//g' /tmp/temp_ad_file >/usr/local/etc/squid/squid.adservers

Care should be taken if these strings are used without the regex as they might match more URLs than desired.

I'm interested in knowing if the sed above has the same problem -- generates an empty file. If it does, then I question the permissions on the output file. What happens if the output file is changed to something like >/tmp/foo?

These 2 Users Gave Thanks to agama For This Post:

agama

View Public Profile for agama

Find all posts by agama

11-20-2011

Registered User

10, 1

Join Date: Nov 2011

Last Activity: 27 November 2011, 7:38 AM EST

Posts: 10

Thanks Given: 1

Thanked 1 Time in 1 Post

agama, thanks for your help here. You have understood my issue better than I could ever explain it. I am very sorry for calling those tags HTML when they were not and causing so much confusion. My apologies.
Coming back to the script; the tags are completely gone now. My ACL in squid is back to working and the final destination folder does not get emptied any more. Thank you so much for your help again agama.

One small thing still where I am having some difficulties; learning curve is step

When the script is run, the /tmp/temp_ad_file is being displayed in the console. Is there a way to not display the temp_ad_file in the console ?

Code:

#!/bin/sh
# Get new ad server list
/usr/local/bin/wget -O /tmp/temp_ad_file \
        'http://pgl.yoyo.org/adservers/serverlist.php?hostformat=squid-dstdom-regex;showintro=0&mimetype=plai
# Removing repeated lines
cat /tmp/temp_ad_file | uniq
# Removing blank lines
sed /^$/d /tmp/temp_ad_file
# Cleaning list
sed 's/[()|.$^]//g' /tmp/temp_ad_file > /usr/local/etc/squid/squid.adservers
# Refresh Squid
/usr/local/sbin/squid -k reconfigure
# Remove tmp file
rm -rf /tmp/temp_ad_file

Kind Regards,

---------- Post updated at 05:18 PM ---------- Previous update was at 04:53 PM ----------

So I think I have figured it out

I am sending into the bit bucket both lines below

Code:

cat /tmp/temp_ad_file | uniq 2>&1 > /dev/nul
sed /^$/d /tmp/temp_ad_file 2>&1 > /dev/nul

I have bought this book "mastering.unix.shell.scripting" to try to understand a bit more about nix scripting. I would recommend this book for advanced users: For someone like me, its a bit tough.

Regards,

zongo

Moderator's Comments:

Please use code tags when posting data and code samples, thank you.

Last edited by Franklin52; 11-21-2011 at 08:05 AM..

zongo

View Public Profile for zongo

Find all posts by zongo

11-20-2011

Registered User

2,163, 123

Join Date: Nov 2007

Last Activity: 31 July 2016, 9:42 AM EDT

Location: H3X

Posts: 2,163

Thanks Given: 11

Thanked 123 Times in 116 Posts

Quote:

Originally Posted by zongo

Code:

# cat /tmp/temp_ad_file | grep "(^|\.)" > "/usr/local/etc/squid/squid.adservers"

What about :

Code:

 /usr/bin/tr -d '[(^|\\$)]' < /tmp/temp_ad_file > /usr/local/etc/squid/squid.adservers

danmero

View Public Profile for danmero

Find all posts by danmero

11-20-2011

Registered User

1,466, 512

Join Date: Jul 2010

Last Activity: 7 April 2014, 3:02 PM EDT

Location: earth>US>UTC-5

Posts: 1,466

Thanks Given: 110

Thanked 512 Times in 491 Posts

Quote:

Originally Posted by zongo

So I think I have figured it out

I am sending into the bit bucket both lines below

cat /tmp/temp_ad_file | uniq 2>&1 > /dev/nul
sed /^$/d /tmp/temp_ad_file 2>&1 > /dev/nul
zongo

Glad you've got it working, and it seems you're willing to understand the why with the how which is important.

In general, if you aren't doing anything with the output of the above commands, then you don't need to run them; just comment them out, or remove them completely. However, looking at your script, I believe the intent was to execute these as a pipeline rather than by themselves. The data from yoyo.com needs to be sorted for uniq to work, and since you get the input from an external source (order unknown), best to sort with the unique option rather than assuming it's sorted. The sed to remove blank lines can be combined with the sed to delete the regex stuff, so the pipeline is just two commands to accomplish all 4 things:

Code:

 # Removing repeated lines
 sort -u /tmp/temp_ad_file | sed '/^$/d; s/[()|.$^]//g'  > /usr/local/etc/squid/squid.adservers

agama

View Public Profile for agama

Find all posts by agama

11-20-2011

Registered User

10, 1

Join Date: Nov 2011

Last Activity: 27 November 2011, 7:38 AM EST

Posts: 10

Thanks Given: 1

Thanked 1 Time in 1 Post

agama, thanks
I have amended as per your suggestions
I get a better result even
I have moved up the "cleaning list" line so I have it above the "uniq" variable. That way it is sorted, cleaned, blank lines removed, and then all repetition being removed.

I am using the "uniq" filter on the final destination folder which is squid.adservers and not anymore on the /tmp/temp_ad_file.

danmero, thanks for helping as well.

Kind Regards,

Code:

#!/bin/sh
# Get new ad server list
/usr/local/bin/wget -O /tmp/temp_ad_file \
        'http://pgl.yoyo.org/adservers/serverlist.php?hostformat=squid-dstdom-regex;showintro=0&mime
# Cleaning list
sort -u /tmp/temp_ad_file | sed '/^$/d; s/[()|.$^]//g' > /usr/local/etc/squid/squid.adservers
# Removing repeated lines
cat /usr/local/etc/squid/squid.adservers | uniq
# Refresh Squid
/usr/local/sbin/squid -k reconfigure
# Remove tmp file
rm -rf /tmp/temp_ad_file

Last edited by Franklin52; 11-21-2011 at 08:04 AM.. Reason: Code tags

zongo

View Public Profile for zongo

Find all posts by zongo

Shell Programming and Scripting

Script to delete HTML tag

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Multiline html tag parse shell script

Discussion started by: SorcRR

2. Shell Programming and Scripting

Print Value between desired html tag

Discussion started by: satishmallidi

3. Shell Programming and Scripting

Search for a html tag and print the entire tag

Discussion started by: Ashik409

4. Shell Programming and Scripting

Extracting a string from html tag

Discussion started by: hicharbo

5. Shell Programming and Scripting

Add the html tag first and last line the file

Discussion started by: bmk

6. Shell Programming and Scripting

how to delete certain java script from html files using sed

Discussion started by: georgi58

7. Shell Programming and Scripting

extracting Line between HTML tag

Discussion started by: newlook2011

8. Shell Programming and Scripting

How can i delete html attributes from tag ?

Discussion started by: cola

9. Shell Programming and Scripting

how to use html tag in shell scripting

Discussion started by: jrex1983

10. UNIX for Dummies Questions & Answers

How do I extract text only from html file without HTML tag

Discussion started by: los111