sed fails to apply substitute commands


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting sed fails to apply substitute commands
# 1  
Old 08-05-2012
[SOLVED] sed fails to apply substitute commands

I've made a shell script for archiving HTML pages, i.e. making them work offline plus add some features.
Here is it:
Code:
#!/bin/sh

if [ $1 = "" ] || [ $(echo "$1" | egrep "https?://boards.4chan.org/[a-z0-9]+/res/[0-9]+") = "" ]; then
echo "Usage: `basename $0` <4chan thread url> <[OPTIONAL: waiting between sessions in seconds]>"
exit 0
fi

echo "4chan downloader"

LOC=$(echo "$1" | sed 's_.\+/res/\([^#]\+\).*_\1_g')

if [ "$LOC" = "" ]; then
echo "Can't determine the thread's number"
exit 0
fi

ST="static.4chan.org"

if [ $(echo "$2" | egrep "[0-9]*" -o) != "" ]; then
SLP=$(echo "$2" | egrep "[0-9]*" -o)
else
SLP="10"
fi

alias echo="echo -ne"
N="\r"
R="\n"

thejob () {
if [ ! -d $LOC ]; then
mkdir $LOC
fi

if [ ! -d $LOC/misc ]; then
mkdir $LOC/misc
fi

egrep "//.\.thumbs\.4chan\.org/[a-z0-9]+/thumb/[0-9]*s\.jpg" $LOC.html -o | sed 's_^//_http://_g' > $LOC/misc/misc

egrep "//${ST}/image/spoiler-?[a-z0-9]*\.png" $LOC.html -o | sed 's_^//_http://_g' | head -n1 >> $LOC/misc/misc

egrep "//${ST}/image/favicon-?[a-z]*\.ico" $LOC.html -o | sed 's_^//_http://_g' >> $LOC/misc/misc

egrep "//${ST}/image/country/([a-z]*/)?([\w]+\....)" $LOC.html -o | sed 's_^//_http://_g' >> $LOC/misc/misc

egrep "//${ST}/css/[a-z]+\.[0-9]+\.css" $LOC.html -o | sed -e 's_\.css_\.css\n_g' -e 's_//stat_\nhttp://stat_g' | grep /css/ | head -n1 >> $LOC/misc/misc

egrep "//${ST}/image/title/[a-z]+/[0-9a-z]+\.(jpg|png|gif)" $LOC.html -o | sed 's_^//_http://_g' > $LOC/misc/logo

egrep "//images\.4chan\.org/[a-z0-9]+/src/[0-9]*\.(jpg|png|gif)" $LOC.html -o | sed 's_^//_http://_g' > $LOC/images

S1="<script>navigator.userAgent.match(/Presto|Gecko/)\&\&s(d.body,'class','i');(function(){var a=d.querySelectorAll('a.fileThumb'),i=0;for(;a.length>i;i++){s(a[i],'onclick','javascript:imgexp(this);return false;');s(a[i],'title','Click to toggle image size');}})();</script>"

S2="<script>var d=document;function s(a,b,c){a.setAttribute(b,c);}function g(a,b){if('i'===b.class){0>a.top\&\&(b.scrollTop+=a.top-42);}else{0>a.top\&\&(d.documentElement.scrollTop+=a.top-42);}}function imgexp(b){var a=b.firstElementChild,c=b.getBoundingClientRect(),db=d.body;if(null===b.getAttribute('exp')){s(a,'i-old',a.src);a.getAttribute('style')\&\&(s(a,'i-olds',a.getAttribute('style')),a.removeAttribute('style'));a.src=b.href;s(b,'exp','');}else{a.getAtt ribute('i-olds')\&\&(s(a,'style',a.getAttribute('i-olds')),a.removeAttribute('i-olds'));a.src=a.getAttribute('i-old');a.removeAttribute('i-old');b.removeAttribute('exp');g(c,db);}}</script><style>a[exp]>img{max-width:100%;}body.i a[exp]>img{width:100%!important;}.op:after{clear:both;content:'';display:block;}</style>"

sed -e "s_//${ST}/image/favicon\(-\?[a-z]*\)\.ico_${LOC}/misc/favicon\1.ico_" -e 's_<link rel="alternate style.\+\(<link rel="apple-touch-icon" h\)_\1_' -e "s_//${ST}/css/\([a-z0-9\.]\+\)\.css_${LOC}/misc/\1.css_" -e "s_</body>_${S1}&_" -e "s_</head>_${S2}&_" -e "s_//.\.thumbs\.4chan\.org/[\w]\+/thumb/\([0-9]\+\)s\.jpg_${LOC}/misc/\1s.jpg_g" -e "s_//images\.4chan\.org/[\w]\+/src/\([0-9]\+\)\.\(jpg\|gif\|png\)_${LOC}/\1.\2_g" -e "s_//${ST}/image/title/[a-z]\+/[\w]\+\.\(jpg\|gif\|png\)_${LOC}/misc/logo.\1_g" -e "s_//${ST}/image/spoiler\(-\?[\w]*\)\.png_${LOC}/misc/spoiler\1.png_g" -e "s_//${ST}/image/country/\(\([a-z]*/\)\?\w\+\.gif\)_${LOC}/misc/\1_g" -e "s_\(<a href=\"\)${LOC}\(#p[0-9]\+\"\)_\1\2_g" -e "s_<a href=\"#p${LOC}\" class=\"quotelink\">&gt;&gt;${LOC}_& (OP)_g" -e 's_\(<a href="[0-9]\+\)\(#p[0-9]\+" class="quotelink">&gt;&gt;[0-9]\+\)_\1.html\2 (Cross-thread)_g' -e 's_\(</div></div></div><hr>\)<div class="mobile".\+</div><hr>\(<div class="navLinks navLinksBot">\[<a href="\)\.\./\(\./"[^>]*>Return</a>\] \[<a href="\).top\(">Top</a>\]\).\+</body>_\1\2\3javascript:scroll(0,0);\4<div id="bottom"></div></body>_' -e "s_<div id=.boardNavDesktop. class=.desktop.>.*\(<div class=.boardBanner.*\)<hr class=.abovePostForm./\?>.*\(<div class=.navLinks.>.<a href=.\)\.\./\(\./.*Bottom</a>\]\).*alt=../></a>\(</div><hr><a href=.ja\)_\1\2\3\4_" $LOC.html > a

# :a;N;$!ba;

mv a $LOC.html

cd $LOC

wget --continue -q -i images

rm images

cd misc

if [ "$(ls|grep css)" != "" ]; then
rm "$(ls|grep css)"
fi

wget -q -nc -i misc

CSS=$(cat misc | tail -n1 | sed 's_.*/\([a-z]\+\.[0-9]\+\.css\)_\1_')

sed "s_.*fade\(-\?[a-z]*\)\.png.*_http://${ST}/image/fade\1.png_g" $CSS > misc

wget -q -i misc

sed 's_/image/fade\(-\?[a-z]*\)\.png_fade\1.png_g' $CSS > a

mv a $CSS

if [ $(ls|grep logo.) != "" ]; then
rm $(ls|grep logo.)
fi

wget -q -i logo -O "logo.$(sed "s_\._\n_g" logo|tail -n1)"

rm misc logo

touch .nomedia

cd ../..
}

echo "${N}Downloading to $LOC${N}"

echo "${N}"

echo "------------------------------${N}"

while [ "1" = "1" ]; do

trap 'EXIT=1' 1 2 3 15

if [ -s $LOC.html ]; then

wget -np -nd -nH -q -erobots=off $1 -O a

if [ $(wc -c a|cut -d" " -f1) -eq "0" ]; then

echo "Thread has 404'd or 4chan is down. Stopping script${N}"

rm a

exit 0

fi

if [ $(wc -c a|cut -d" " -f1) -gt $(wc -c $LOC.html|cut -d" " -f1) ]; then

mv a $LOC.html

thejob

else

rm a

fi

else

wget -np -nd -nH -q -erobots=off $1 -O $LOC.html

if [ $(wc -c $LOC.html|cut -d" " -f1) -eq "0" ]; then

echo "Thread doesn't exist or 4chan is down. Stopping script${N}"

rm $LOC.html

exit 0

fi

thejob

fi

trap - 1 2 3 15

if [ $EXIT = "1" ] || [ $SLP = "1" ]; then
echo "Session completed. Exiting ${N}"
exit 0
fi

echo "OK"

sleep $SLP

echo "\b\b \b\b"
done;

The parts not getting applied, even though I have checked them with RegexBuddy:
Code:
-e "s_<div id=.boardNavDesktop. class=.desktop.>.*\(<div class=.boardBanner.*\)<hr class=.abovePostForm./\?>.*\(<div class=.navLinks.>.<a href=.\)\.\./\(\./.*Bottom</a>\]\).*alt=../></a>\(</div><hr><a href=.ja\)_\1\2\3\4_"
-e "s_</body>_${S1}&_"
-e "s_</head>_${S2}&_"

I've tried everything I could, but these fail to apply to the fetched HTML.
There aren't any linebreaks in there. These three should apply to the last line of the HTML, because the </head>, <body> and </body> tags are all on one line.

I'm running this on a older Android smartphone that I've replaced with a new one via remote shell and it has "BusyBox v1.19.4-cm9 bionic (2012-02-05 18:40 +0100) multi-call binary" in it. I suppose it has GNU applets.

Last edited by Adolf1994; 08-05-2012 at 06:01 PM..
# 2  
Old 08-05-2012
I don't think busybox sed supports -e very well. It has a reasonably decent awk I think.

Could you explain what you're actually trying to do here? Maybe there's a more direct way.
# 3  
Old 08-05-2012
I haven't noticed that it would have problems with -e
As you can see there are plenty of them and most of them works except for these three.
I haven't looked into anything but sed so far.

The first one's supposed to get rid of garbage that's not really useful once the targeted html page's been modified for offline use. However, this one sometimes works when the html page's size is really small, i.e. around our below 100kB. I thought that it might be a problem with buffer, but after some research I found out that GNU sed has no limit.
The other two are supposed to inject some Javascript and CSS to add some really handy features to the html.

Edit: on a second thought, I have no idea why have i included the third command with the S2 variable, because that works well.

And thanks for the quick reply.
# 4  
Old 08-05-2012
Quote:
Originally Posted by Adolf1994
I haven't noticed that it would have problems with -e
busybox is profoundly not the standard Linux GNU utilities. it can do a surprising amount but corners had to be cut to fit that much functionality into one executable. I wouldn't be surprised if its sed -e was imperfect. What surprises me is that it exists at all.

If you're doing 10,000 greps and seds on the same file, it might be time to consider a language like awk or perl. I bet you don't have perl on that thing, though.

Last edited by Corona688; 08-05-2012 at 07:06 PM..
# 5  
Old 08-05-2012
Hmmm, afaik there's a thing called super sed that doesn't have any dependencies. Maybe if I try and compile that with an Android NDK toolchain?

I'll try first with dividing that big sed commamd pile

---------- Post updated at 12:23 AM ---------- Previous update was at 12:11 AM ----------

Ok, dividing the pile didn't help.
So, how's that awk again? Available options are -v -F and -f. I like the way how you can set a variable there. It's a bit like Javascript, imo

---------- Post updated at 02:24 AM ---------- Previous update was at 12:23 AM ----------

Good news. The awk in the busybox seems to be gawk, because gensub worked with it. I'll try to mess around with this until it works.
# 6  
Old 08-06-2012
It is not gawk. gawk is much larger than busybox's awk implementation. However, one of Busybox's goals is to emulate GNU behavior with the features they implement.

Regards,
Alister
# 7  
Old 08-06-2012
Quote:
Originally Posted by Adolf1994
Good news. The awk in the busybox seems to be gawk, because gensub worked with it.
No, no it is not. I refer you to my earlier post:
Quote:
Originally Posted by Corona688
busybox is profoundly not the standard Linux GNU utilities. it can do a surprising amount but corners had to be cut to fit that much functionality into one executable.
It may have gsub, but doesn't have functions.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Substitute a character with sed

hi all, i'd like to modify a file with sed , i want to substuite a char "-" with "/" how can i do this? Thanks for all regards Francesco (16 Replies)
Discussion started by: Francesco_IT
16 Replies

2. Shell Programming and Scripting

sed - pattern match - apply substitution

Greetings Experts, I am on AIX and in process of creating a re-startable script that connects to Oracle and executes the statements. The sample contents of the file1 is CREATE OR REPLACE VIEW DB_V.TAB1 AS SELECT * FROM DB_T.TAB1; .... CREATE OR REPLACE VIEW DB_V.TAB10 AS SELECT * FROM... (9 Replies)
Discussion started by: chill3chee
9 Replies

3. Shell Programming and Scripting

sed substitute command -- need help

I am trying to do what I thought should be a simple substitution, but I can't get it to work. File: Desire output: I thought I'd start with a sed command to remove the part of the header line preceding the string "comp", then go on to remove the suffix of the target string (e.g. ":3-509(-)"),... (3 Replies)
Discussion started by: pathunkathunk
3 Replies

4. Homework & Coursework Questions

Finding the directories with same permission and then apply some default UNIX commands

Write a Unix shell script named 'mode' that accepts two or more arguments, a file mode, a command and an optional list of parameters and performs the given command with the optional parameters on all files with that given mode. For example, mode 644 ls -l should perform the command ls -l on all... (5 Replies)
Discussion started by: femchi
5 Replies

5. Shell Programming and Scripting

Finding the directories with same permission and then apply some default UNIX commands

HI there. My teacher asked us to write a code for this question Write a Unix shell script named 'mode' that accepts two or more arguments, a file mode, a command and an optional list of parameters and performs the given command with the optional parameters on all files with that given mode. ... (1 Reply)
Discussion started by: femchi
1 Replies

6. UNIX for Dummies Questions & Answers

Using sed to substitute between quotes.

I'm using sed to perform a simply search and replace. The typical data is: <fig><image href="Graphics/BAV.gif" align="left" placement="break" I need to replace the value in the first set of quotes, keeping the remainder of the line the same. Thus: <fig><image href="NEW_VALUE" align="left"... (3 Replies)
Discussion started by: Steve_altius
3 Replies

7. Shell Programming and Scripting

Using sed to substitute first occurrence

I am trying to get rid of some ending tags but I run into some problems. Ex. How are you?</EndTag><Begin>It is fine.</Begin><New> Just about I am trying to get rid of the ending tags, starts with </ and ending with >. (which is </EndTag> and </Begin>) I tried the following sed... (2 Replies)
Discussion started by: quixoticking11
2 Replies

8. Shell Programming and Scripting

Using SED to substitute between two patterns.

Hi All, I'm currently using SED to make various changes to some .xml files I'm working on, but I'm stuck on this particular problem. I want to remove '<placeholder>element-name</placeholder>' from the following: <heading>Element <placeholder>element-name</placeholder> not... (2 Replies)
Discussion started by: Steve_altius
2 Replies

9. Solaris

patchadd fails to apply a patch

Hello, I'm trying to apply the patch on Solaris 9 : $/jac/update$ patchadd ./112945-46 Checking installed patches... One or more patch packages included in 112945-46 are not installed on this system. Patchadd is terminating. The error message is not really talkative so I had a... (7 Replies)
Discussion started by: Tex-Twil
7 Replies

10. UNIX for Dummies Questions & Answers

sed substitute situation

I am having a problem executing a sed substitute in a file. I have tried alot of different things I found in previous posts, however non seem to work. I want to substitute this in $FILE: VALUE=33.4 In the script I have tried the following: prev=$(awk -F"=" '{ print $2 }' $FILE ) new=$(echo... (16 Replies)
Discussion started by: newbreed1
16 Replies
Login or Register to Ask a Question