Grep text matching problem with script which checks if web page contains text.


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Grep text matching problem with script which checks if web page contains text.
# 1  
Old 06-06-2013
Grep text matching problem with script which checks if web page contains text.

I wrote a Bash script which checks to see if a text string exists on a web page and then sends me an email if it does (or does not e.g. "Out of stock"). I run it from my crontab, it's quite handy from time to time and I've been using it for a few years now.

The script uses wget to download an url and then uses grep to match the text string which I lift from the original HTML in case of markups or new lines (though the latter has never actually occured).

Today I added the text to look for and it did not get matched even though it was present in the HTML. When I copied and pasted the, identically looking, text from the downloaded HTML file into the script and tried that it worked perfectly. [Problem solved in this case but I'd like to fix things properly.]

So the character encoding seems to be the problem. Or so I thought! Grep uses utf-8 but it turned out the source HTML was utf-8 as well. What a pain, so no easy fix by using iconv to convert all downloaded files to utf-8.

Anyone know what might be happening here and what I need to do to fix this?

Many thanks.
# 2  
Old 06-07-2013
It might be that there is some special character involved? Have you tried using grep -F?

Otherwise, could you provide a sample of what goes wrong?
This User Gave Thanks to Scrutinizer For This Post:
# 3  
Old 06-07-2013
Thanks for the suggestion but using -F with grep did not work.

However I have diagnosed the problem...

The web page in question (url below) is being searched for the string "Currently out of stock", I had a look at the HTML in a hex editor and discovered that the first space (between 'Currently' and 'out') was not 0x20 but a pair of values: 0xC2 0xA0. A web search revealed that this is known as a non-breaking space which is used as a typesetting aid (in compatible standards such as HTML) to prevent an automatic line break. For instance it might be used instead of the space in the string "100 KM" to make certain that "KM" does not get pushed onto the line below by the HTML renderer, the HTML entity is " ", thus "100 KM" could be used in the HTML.

An imperfect fix involves using '.' (any char match) in my search string. So the following regex works: "Currently.out.of.stock". The single '.' matches the non-breaking space of 0xC2 0xA0 between 'Currently' and 'out'.

However neither [[:space:]] nor [[:blank:]] work at matching the non-breaking space.

The web page which has the non-breaking space:
SGP-CV5 | Xperia Tablet Z accessories | Technical Specifications | SGPCV5/B.AE | SGPCV5 | Sony

Non-breaking space on Wikipedia:
Non-breaking space - Wikipedia, the free encyclopedia
# 4  
Old 06-07-2013
Use [^[:ascii:]] instead!
To match a space or a non-breaking space, try [^[:graph:]].

Last edited by MadeInGermany; 06-07-2013 at 04:38 PM..
This User Gave Thanks to MadeInGermany For This Post:
# 5  
Old 06-09-2013
Good idea, [[:print:]] works too.

The problem now is that all my tests have involved testing from a terminal, running the script from crontab stops the regexes from working (though everything else works). I tried setting my $PATH in both the script and in my crontab as well as setting $SHELL to bash in the crontab and even using the absolute path to grep in the script. No joy the regexes do not work. Finally in desperation I pasted my entire 'env' variable list into the top of my crontab and the grep regex finally worked when the script was called by crontab. Can anyone think which of my 39 environment variables might be the key one making a difference? $SHELL and $PATH I can see why, but on their own they did not fix the issue, so there must be something else as well.

Here's the sorted output of env (I've trimmed the 2 long lines of $LS_COLORS and $PS1):

Code:
$ env | sort
COLORTERM=gnome-terminal
DBUS_SESSION_BUS_ADDRESS=unix:abstract=/tmp/dbus-igzDw7xhFQ,guid=69aae5cf956c630c3ef4d55151b43fb4
DEFAULTS_PATH=/usr/share/gconf/gnome.default.path
DESKTOP_SESSION=gnome
DISPLAY=:0.0
GDM_KEYBOARD_LAYOUT=gb
GDM_LANG=en_GB.UTF-8
GDMSESSION=gnome
GNOME_DESKTOP_SESSION_ID=this-is-deprecated
GNOME_KEYRING_CONTROL=/tmp/keyring-0Mlwje
GNOME_KEYRING_PID=2114
GTK_MODULES=canberra-gtk-module
HOME=/home/ms
LANG=en_GB.UTF-8
LESSCLOSE=/usr/bin/lesspipe %s %s
LESSOPEN=| /usr/bin/lesspipe %s
LOGNAME=ms
LS_COLORS=rs=0:di=01;34:ln=01;...
MANDATORY_PATH=/usr/share/gconf/gnome.mandatory.path
ORBIT_SOCKETDIR=/tmp/orbit-ms
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/home/ms/Scripts
PS1=\[\033[01;31m\]\h\[\033[00m\]...
PWD=/home/ms
SESSION_MANAGER=local/ubuntupc:@/tmp/.ICE-unix/2132,unix/ubuntupc:/tmp/.ICE-unix/2132
SHELL=/bin/bash
SHLVL=1
SPEECHD_PORT=7560
SSH_AGENT_PID=2166
SSH_AUTH_SOCK=/tmp/keyring-0Mlwje/ssh
TERM=xterm
TMPDIR=/tmp/
USER=ms
USERNAME=ms
_=/usr/bin/env
WINDOWID=67108869
XAUTHORITY=/var/run/gdm/auth-for-ms-TrT4XY/database
XDG_CONFIG_DIRS=/etc/xdg/xdg-gnome:/etc/xdg
XDG_DATA_DIRS=/usr/share/gnome:/usr/local/share/:/usr/share/
XDG_SESSION_COOKIE=caa8612e0ce52df979b3de354c360d7c-1370767284.97021-283724276

Thanks all.
# 6  
Old 06-09-2013
LANG sets locale, and has an impact on character sets like [a-z] or [[:print:]].
This User Gave Thanks to MadeInGermany For This Post:
# 7  
Old 06-10-2013
That's it !! All is now working. Well done, it never occurred to me that LANG would have an impact on grep.

Thanks so much.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Copy text from web page and add to file

I need help to make a script for Ubuntu to OSCam that copy the text on this website that only contains "C: ip port randomUSERNAME password" and want to exclude the text "C:" and replace the rest with the old in my test.server file. (line 22) device = ip,port (line 23) user =... (6 Replies)
Discussion started by: baxarn
6 Replies

2. Shell Programming and Scripting

Web page with picture, text, php function

Hello. I'm trying to create a web page which the presentation is as follows: 1 °) at the top of page an image 2 °) below the text 3 °) to complete a php function that returns information. I tried different things but none work. Script 1: <!DOCTYPE html> <html> <head> <style> div { ... (5 Replies)
Discussion started by: jcdole
5 Replies

3. Shell Programming and Scripting

Script extract text from txt file with grep

All, I require a script that grabs some text from the gitHub API and will grep (or other function) for a string a characters that starts with (") quotes followed by two letters, may contain a pipe |, and ending with ) . What i have so far is below but it's not returning anything. ... (4 Replies)
Discussion started by: ChocoTaco
4 Replies

4. Shell Programming and Scripting

Matching text using grep

Hi folks... Relatively new to scripting, but really struggling with something that will no doubt be second nature to most people on here: Trying to get an exact match on $sub, where sub is an ip address. subnet () { clear while true do ... (18 Replies)
Discussion started by: CiCa
18 Replies

5. Shell Programming and Scripting

Removing matching text from multiple files with a shell script

Hello all, I am in need of assistance in creating a script that will remove a specified block of text from multiple .htaccess files. (roughly 1000 files) I am attempting to help with a project to clean up a linux server that has a series of unwanted url rewrites in place, as well as some... (4 Replies)
Discussion started by: boxx
4 Replies

6. HP-UX

Help running a unix script from a web page

First, let me state that I am completely out of my realm with this. I have a server running HPUX. I'm not even sure if this can be considered a UNIX question and for that let me apologize in advance. I need to create a web page where a client can input 2 variables (i.e. date and phone number).... (0 Replies)
Discussion started by: grinds
0 Replies

7. UNIX for Dummies Questions & Answers

Possible to download web page's text to a file?

Hi, Say there is a web page that contains just text only - that is, even the source code is just the text itself, nothing more. An example would be "http://mynasadata.larc.nasa.gov/docs/ocean_percent.txt" Is there a UNIX command that would allow me to download this text and store it in a... (1 Reply)
Discussion started by: Breanne
1 Replies

8. Shell Programming and Scripting

how to redirect to a web-page by shell script

Dear all, I am calling a korn shell script(CGI script) by a web-page. This shell script do some checking in a unix file and return true or false. Now within the same script, If it returns true then I want to redirect to another web-page stored in htdocs directory. Example: Login page sends a... (3 Replies)
Discussion started by: ravi18s
3 Replies

9. UNIX for Dummies Questions & Answers

grep multiple text files in folder into 1 text file?

How do I use the grep command to take mutiple text files in a folder and make one huge text file out of them. I'm using Mac OS X and can not find a text tool that does it so I figured I'd resort to the BSD Unix CLI for a solution... there are 5,300 files that I want to write to one huge file so... (7 Replies)
Discussion started by: coppertone
7 Replies

10. UNIX for Dummies Questions & Answers

Is there a way scroll text instead of page?

Is there a way to slowly scroll the output of a file instead of page or cat ? Instead of one page at a time, I would like to slowly scroll the displayed output of the file. (12 Replies)
Discussion started by: darthur
12 Replies
Login or Register to Ask a Question