A couple of observations that I have that might help clear the problem. First, the original post refers to 'removing html' from the file. However the file pulled from yoyo.org with wget using text/plain does not contain any html. More so, the reference to
Quote:
The HTML tags would be "(^|\.)" & $. If left in that list, the acl squid can't use the file.
seems to indicate that while incorrectly calling (^|\.) HTML, these strings are not desired. Depending on how squid is configured, this is true, they need to be removed.
The file from yoyo.com is a list of regular expressions and if squid isn't configured with acl ads dstdom_regex -i "[/usr/local]/etc/squid.adservers" then the regex parts will cause problems. I believe this is the reason things have stopped working is because the configuration on the old machine isn't the same as on FreeBSD. Given this, the original code that is extracting the regex lines from the yoyo.com data using the regex string makes sense.
This doesn't explain why the output file is ending up empty, but might change the focus on the problem a bit. If the squid config is changed to match the old machine, then the regex file can be used as is, otherwise the regex portions should be stripped:
Care should be taken if these strings are used without the regex as they might match more URLs than desired.
I'm interested in knowing if the sed above has the same problem -- generates an empty file. If it does, then I question the permissions on the output file. What happens if the output file is changed to something like >/tmp/foo?
agama, thanks for your help here. You have understood my issue better than I could ever explain it. I am very sorry for calling those tags HTML when they were not and causing so much confusion. My apologies.
Coming back to the script; the tags are completely gone now. My ACL in squid is back to working and the final destination folder does not get emptied any more. Thank you so much for your help again agama.
One small thing still where I am having some difficulties; learning curve is step
When the script is run, the /tmp/temp_ad_file is being displayed in the console. Is there a way to not display the temp_ad_file in the console ?
Kind Regards,
---------- Post updated at 05:18 PM ---------- Previous update was at 04:53 PM ----------
So I think I have figured it out
I am sending into the bit bucket both lines below
I have bought this book "mastering.unix.shell.scripting" to try to understand a bit more about nix scripting. I would recommend this book for advanced users: For someone like me, its a bit tough.
Regards,
zongo
Moderator's Comments:
Please use code tags when posting data and code samples, thank you.
Last edited by Franklin52; 11-21-2011 at 08:05 AM..
Glad you've got it working, and it seems you're willing to understand the why with the how which is important.
In general, if you aren't doing anything with the output of the above commands, then you don't need to run them; just comment them out, or remove them completely. However, looking at your script, I believe the intent was to execute these as a pipeline rather than by themselves. The data from yoyo.com needs to be sorted for uniq to work, and since you get the input from an external source (order unknown), best to sort with the unique option rather than assuming it's sorted. The sed to remove blank lines can be combined with the sed to delete the regex stuff, so the pipeline is just two commands to accomplish all 4 things:
agama, thanks
I have amended as per your suggestions
I get a better result even
I have moved up the "cleaning list" line so I have it above the "uniq" variable. That way it is sorted, cleaned, blank lines removed, and then all repetition being removed.
I am using the "uniq" filter on the final destination folder which is squid.adservers and not anymore on the /tmp/temp_ad_file.
danmero, thanks for helping as well.
Kind Regards,
Last edited by Franklin52; 11-21-2011 at 08:04 AM..
Reason: Code tags
Hello,
I want to parse the contents of a multiline html tag
ex:
<html>
<body>
<p>some other text</p>
<div>
<p class="margin-bottom-0">
text1
<br>
text2
<br>
<br>
text3
</p>
</div>
</body> (15 Replies)
I want to print from <fruits> to </fruits> tag which have <fruit> as mango. Also i want both <fruits> and </fruits> in output. Please help
eg.
<fruits>
<fruit id="111">mango<fruit>
.
another 20 lines
.
</fruits> (3 Replies)
Hi
I am new to string extractions in shell script... I am trying to extract a string such as #1753 from html tag looks like below.
<a class="model-link tl-tr" href="lastSuccessfulBuild/">Last successful build (#1753), 40 min ago</a>
and want the value as
1753
Could someone help me to... (3 Replies)
I am cleaning forum posts to convert them in offline reading version with clean html text. All files are with html extension and reside in one folder. There is some java script i would like to remove, which looks like
<script LANGUAGE="JavaScript1.1">
<!--
function mMz()
{
var mPz = "";... (2 Replies)
Hi everyone:
I want to extract string which is in between certain html tag.
e.g.
I tried with grep,cut, awk but could not find exact syntax for this one. :wall:
PS>Sorry about bad english. (8 Replies)
Hai friends
I have a small doubt..
how can we use html tag in shell scripting
code :
echo "<html>"
echo "<body>"
echo " welcome to peace world "
echo "</body>"
echo "</html>"
output displayed like this:
<html>
<body>
welcome to peace world
</body>
</html> (5 Replies)
I have a html file called myfile. If I simply put "cat myfile.html" in UNIX, it shows all the html tags like <a href=r/26><img src="http://www>. But I want to extract only text part.
Same problem happens in "type" command in MS-DOS.
I know you can do it by opening it in Internet Explorer,... (4 Replies)