Unable to identify the special characters beyond the range of "[\x80-\xFF]"


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Unable to identify the special characters beyond the range of "[\x80-\xFF]"
# 1  
Old 10-14-2015
Unable to identify the special characters beyond the range of "[\x80-\xFF]"

I want to filter out the special character whose ascii value doesn't fall within the range "[\x80-\xFF]" .

Example:� or Ć. So in that case is there any defined range which will filter out this characters.

I can filter those which falls withing "[\x80-\xFF]" . Need to filter those special chracter which doesn't fall within the "[\x80-\xFF]" range

Last edited by zaxxon; 10-15-2015 at 09:23 AM.. Reason: no need to put post into html-tags
# 2  
Old 10-14-2015
Quote:
Originally Posted by Abhijit Sen
I want to filter out the special character whose ascii value doesn't fall within the range "[\x80-\xFF]"
You might want to use the :print: character class the POSIX BRE regexp provide and negate that: [^:print:] and see how far that gets you.

Basically there is no pattern for what constitutes a printable or non-printable character: character "\9", which is a TAB just is that by convention, not because it is - in principle - any different from "\10" or "\8".

You might also want to identify your locale, which may establish so-called collating sequences. See more about this at this page.

I hope this helps.

bakunin

Last edited by bakunin; 10-14-2015 at 11:03 AM..
# 3  
Old 10-14-2015
Quote:
Originally Posted by bakunin
You might want to use the :print: character class the POSIX BRE regexp provide and negate that: [^:print:] and see how far that gets you.

Basically there is no pattern for what constitutes a printable or non-printable character: character "\9", which is a TAB just is that by convention, not because it is - in principle - any different from "\10" or "\8".

You might also want to identify your locale, which may establish so-called collating sequences. See more about this at this page.

I hope this helps.

bakunin
I don't know of an RE context where \9 would represent a tab (although \0x9 and \011 would when using an ASCII based character set).

The print character class is identified by [:print:]. A BRE matching a character in the print class is [[:print:]] and a BRE matching a character that is not in the print class is [^[:print:]]. The BRE [^:print:] would match any character other than :, i, n, p, r, and t.

Note also that the print class does not include control characters. To select a UTF-8 character that is not a character in the 7-bit ASCII character set (actually select each byte of one of those characters), you could use the BRE [^[:ctrl:][:print:]] while in the C or POSIX locale.

But, when working with ASCII, UTF-8, and 8859-* character sets, just filtering out bytes with the high order bit set should be sufficient.

Last edited by Don Cragun; 10-14-2015 at 03:45 PM.. Reason: Add notes.
# 4  
Old 10-14-2015
If you're trying to identify characters greater than 0xff, those aren't characters, they will be encoded in some way and ordinary regular expressions won't match them. Some GNU tools may have extended features to support them.
# 5  
Old 10-14-2015
I know this is not about python per se, but there are REGEX tools for extended character sets, unicode being one of those sets:

regex - matching unicode characters in python regular expressions - Stack Overflow

UNIX in general is not unicode centric so Corona's answer pretty much stands for most regex engines.

The PCRE supports a lot of encoded charsets. You can download it here:
PCRE - Browse /pcre/8.30 at SourceForge.net
# 6  
Old 10-15-2015
Question

I have a file which consists all the special character which lies within "[\x80-\xFF]" and speical character beyond this range.

So in order to identify all of them by their byte values what can be done.

Moderator's Comments:
Mod Comment edit by bakunin: please leave out the HTML-tags. You don't need them

Last edited by bakunin; 10-15-2015 at 09:57 AM..
# 7  
Old 10-15-2015
To repeat:

Quote:
Originally Posted by Corona688
those aren't characters, they will be encoded in some way
How is your file encoded?
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

PuTTY displaying "special" characters

I'm not really sure which forum this question should go into, so I'm posting it here. I work with AIX and RHEL systems using PuTTY (Release 0.60_q1.129) from a Windows 7 workstation. Some of the files we get from z/OS use "special" characters as delimiters. These characters include Hex 18... (7 Replies)
Discussion started by: derndingle
7 Replies

2. Shell Programming and Scripting

finding the strings beween 2 characters "/" & "/" in .txt file

Hi all. I have a .txt file that I need to sort it My file is like: 1- 88 chain0 MASTER (FF-TE) FFFF 1962510 /TCK T FD2TQHVTT1 /jtagc/jtag_instreg/updateinstr_reg_1 dff1 (TI,SO) 2- ... (10 Replies)
Discussion started by: Behrouzx77
10 Replies

3. Shell Programming and Scripting

Need HELP with AWK split. Need to check for "special characters" in string before splitting the file

Hi Experts. I'm stuck with the below AWK code where i'm trying to move the records containing any special characters in the last field to a bad file. awk -F, '{if ($NF ~ /^|^/) print >"goodfile";else print >"badfile"}' filename sample data 1,abc,def,1234,A * 2,bed,dec,342,* A ... (6 Replies)
Discussion started by: shell_boy23
6 Replies

4. Shell Programming and Scripting

if [ "variable" = "numerical-range" ]; then

been a while so i'm a bit rusty and need a little help. writing a script that needs to compare $EXECHOST(a number) against a numerical range and then set a value. below isn't working but should give you folks an idea of my goal: if ; then echo "This is a 32B machine, exiting..." if ;... (4 Replies)
Discussion started by: crimso
4 Replies

5. Shell Programming and Scripting

How to print range of lines using sed when pattern has special character "["

Hi, My input has much more lines, but few of them are below pin(IDF) { direction : input; drc_pinsigtype : signal; pin(SELDIV6) { direction : input; drc_pinsigtype : ... (3 Replies)
Discussion started by: nehashine
3 Replies

6. Shell Programming and Scripting

Question about special variables: "-" and "$_"

both ksh/bash support this 2 special variables, Is there any document for reference? 1) "-" is $OLDPWD 2) "$_" is last argument of previous command. (4 Replies)
Discussion started by: honglus
4 Replies

7. SuSE

VMDB Failure" followed by "Unable to open snapshot file"

keep getting an error when I try to revert to a snapshot: "VMDB Failure" followed by "Unable to open snapshot file" Im using vmware server 1.0.4, host OS is windows xp and guest OS is SLES. Is there anything I can do to recover the snapshot or am I in trouble!?!?! (0 Replies)
Discussion started by: s_linux
0 Replies

8. Shell Programming and Scripting

How to remove "New line characters" and "spaces" at a time

Dear friends, following is the output of a script from which I want to remove spaces and new-line characters. Example:- Line1 abcdefghijklmnopqrstuvwxyz Line2 mnopqrstuvwxyzabcdefghijkl Line3 opqrstuvwxyzabcdefdefg Here in above example, at every starting line there is a “tab” &... (4 Replies)
Discussion started by: anushree.a
4 Replies

9. Shell Programming and Scripting

how to split special characters "|" using awk

Hi friends I need to splict special character "|" here. Here is my script which giving error LINE=INVTRAN|cd /home/msgGoogle TraxFolderType=`awk -F"|" '{print $1}' $LINE` filePath=`awk -F"|" '{print $2}' $LINE` echo "TraxFolderType: "$TraxFolderType echo "filePath :"$filePath ... (3 Replies)
Discussion started by: krishna9
3 Replies
Login or Register to Ask a Question