Unable to identify the special characters beyond the range of "[\x80-\xFF]"

10-16-2015

Registered User

8, 0

Join Date: Oct 2015

Last Activity: 27 June 2016, 3:50 PM EDT

Posts: 8

Thanks Given: 3

Thanked 0 Times in 0 Posts

My file consist � which is an UTF-16 character , which can't be ident
ified by grep -P -n "[\x80-\xFF]".

And in the same file � is present which is an UTF-8 character and will be identified by grep -P -n "[\x80-\xFF]" .

One possibility that all the UTF-16 can be converted to UTF-8 if no regular expression is present. But it will take time to convert all of them.

Also not sure when it will convert UTF-16(2 byte char) to UTF-8(1 byte),how the alignment will be adjusted.

Looking for a way to find them .

Moderator's Comments:

CODE tags are to be used when displaying sample code segments, sample input, and sample output. HTML tags are to be used when displaying HTML code. Plain text describing requirements should not be tagged.

Last edited by Don Cragun; 10-16-2015 at 05:15 AM.. Reason: Get rid of HTML tags, add ICODE tags.

Abhijit Sen

View Public Profile for Abhijit Sen

Find all posts by Abhijit Sen

10-16-2015

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

UTF8 or UTF16 are a character encodings that apply to an entire file, not a single character. So chances are minimal that in the same text a UTF16 � and a UTF8 � are present, at least if prepared by a reasonable application.

What be the output of file your_input_file

RudiC

View Public Profile for RudiC

Find all posts by RudiC

10-16-2015

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Quote:

Originally Posted by Abhijit Sen

Also not sure when it will convert UTF-16(2 byte char) to UTF-8(1 byte),how the alignment will be adjusted.

UTF-8 is not a 1 byte encoding, it uses 1 or more bytes depending on the character being encoded, which is why character replacement can't catch everything.

If your file is UTF-16, it ought to look very strange to normal UNIX programs because of the extra NULLs between most characters. Try hexdump -C to see what your file actually looks like.

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

10-26-2015

Registered User

8, 0

Join Date: Oct 2015

Last Activity: 27 June 2016, 3:50 PM EDT

Posts: 8

Thanks Given: 3

Thanked 0 Times in 0 Posts

The File contains both UTF-8(�) and (�) which can not be identified by using this range value."[\x80-\xFF]"

I need to filter all this Special character("[\x80-\xFF]" ) and other special character as well

Abhijit Sen

View Public Profile for Abhijit Sen

Find all posts by Abhijit Sen

10-26-2015

Registered User

11,728, 1,345

Join Date: Feb 2004

Last Activity: 8 May 2020, 9:07 AM EDT

Location: NM

Posts: 11,728

Thanks Given: 903

Thanked 1,345 Times in 1,201 Posts

It appears that you need to alter your locale while trying to process the file.

If you have - you are sure I gather - UTF8 characters then you need that locale to "see".
UNIX tools like awk understand locale settings.

The iconv tool can convert files from one locale to another - so this is another approach:
convert the file to match the locale you now run.

jim mcnamara

View Public Profile for jim mcnamara

Find all posts by jim mcnamara

11-03-2015

Registered User

8, 0

Join Date: Oct 2015

Last Activity: 27 June 2016, 3:50 PM EDT

Posts: 8

Thanks Given: 3

Thanked 0 Times in 0 Posts

Hi All,

I am able to convert the UTF-16 characters to UTF-8 characters.

But when it is getting converted(using iconv) as UTF-16 is 2 byte value , hence the file alignment is getting changed. Is there any way to fix this to resolve the alignment problem.

My file contains only UTF-8 value which takes 1 byte. And each line it can have only limited number of UTF-8 chracters. But during conversion few of the UTF-8 characters are getting pushed to next line which is altering the file alignment.

I have tried to use recode command but that is not working. Any help would be greatly appreciated.

Abhijit Sen

View Public Profile for Abhijit Sen

Find all posts by Abhijit Sen

11-04-2015

Registered User

6,384, 2,214

Join Date: May 2005

Last Activity: 28 October 2019, 4:59 PM EDT

Location: In the leftmost byte of /dev/kmem

Posts: 6,384

Thanks Given: 143

Thanked 2,214 Times in 1,548 Posts

Quote:

Originally Posted by Abhijit Sen

But when it is getting converted(using iconv) as UTF-16 is 2 byte value , hence the file alignment is getting changed. Is there any way to fix this to resolve the alignment problem.

My file contains only UTF-8 value which takes 1 byte. And each line it can have only limited number of UTF-8 chracters. But during conversion few of the UTF-8 characters are getting pushed to next line which is altering the file alignment.

In general UNIX is and its utilities are coding-insensitive. That is: using sed (or awk, tr or similar text filters) you work on streams of bytes. In ASCII (and similar encodings) a "character" is a byte and a byte is a character. In other encodings this is not the case (like in UTF-16, where 2 bytes represent a character). But UNIX tools are not aware of this and treat each byte as if it would represent a character.

Having said this: to search for/find a text like "abc" regardless of the encoding cannot be done with these text filters, because they will not recognize that two bytes containing (if memory serves correctly) the hex values "00:61" ("U+0061") is the same letter "a" as a single byte with hex value "61" in ASCII.

Issuing grep "a" /some/file is basically telling grep to search for a byte containing the hex value 61 because this is what "a" is encoded in ASCII with.

Somebody has mentioned that python-tools work differently and are encoding-aware. It seems you need to resort to those (or similarily coding-aware tools) to get what you want. With UNIX tools you will not get over the aforementioned limitation, however clever you may work around that. It will always be a less-than-genuine solution which oughts to break under some unexpected set of conditions.

I hope this helps.

bakunin

bakunin

View Public Profile for bakunin

Find all posts by bakunin

Shell Programming and Scripting

Unable to identify the special characters beyond the range of "[\x80-\xFF]"

9 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

PuTTY displaying "special" characters

Discussion started by: derndingle

2. Shell Programming and Scripting

finding the strings beween 2 characters "/" & "/" in .txt file

Discussion started by: Behrouzx77

3. Shell Programming and Scripting

Need HELP with AWK split. Need to check for "special characters" in string before splitting the file

Discussion started by: shell_boy23

4. Shell Programming and Scripting

if [ "variable" = "numerical-range" ]; then

Discussion started by: crimso

5. Shell Programming and Scripting

How to print range of lines using sed when pattern has special character "["

Discussion started by: nehashine

6. Shell Programming and Scripting

Question about special variables: "-" and "$_"

Discussion started by: honglus

7. SuSE

VMDB Failure" followed by "Unable to open snapshot file"

Discussion started by: s_linux

8. Shell Programming and Scripting

How to remove "New line characters" and "spaces" at a time

Discussion started by: anushree.a

9. Shell Programming and Scripting

how to split special characters "|" using awk

Discussion started by: krishna9