Quote:
Originally Posted by
bakunin
You might want to use the
:print: character class the POSIX BRE regexp provide and negate that:
[^:print:] and see how far that gets you.
Basically there is no pattern for what constitutes a printable or non-printable character: character "\9", which is a TAB just is that by convention, not because it is - in principle - any different from "\10" or "\8".
You might also want to identify your
locale, which may establish so-called
collating sequences. See more about this
at this page.
I hope this helps.
bakunin
I don't know of an RE context where
\9 would represent a tab (although
\0x9 and
\011 would when using an ASCII based character set).
The print character class is identified by
[:print:]. A BRE matching a character in the print class is
[[:print:]] and a BRE matching a character that is not in the print class is
[^[:print:]]. The BRE
[^:print:] would match any character other than
:,
i,
n,
p,
r, and
t.
Note also that the print class does not include control characters. To select a UTF-8 character that is not a character in the 7-bit ASCII character set (actually select each byte of one of those characters), you could use the BRE
[^[:ctrl:][:print:]] while in the C or POSIX locale.
But, when working with ASCII, UTF-8, and 8859-* character sets, just filtering out bytes with the high order bit set should be sufficient.