Unable to identify the special characters beyond the range of "[\x80-\xFF]"


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Unable to identify the special characters beyond the range of "[\x80-\xFF]"
# 8  
Old 10-16-2015
My file consist Ä which is an UTF-16 character , which can't be ident
ified by grep -P -n "[\x80-\xFF]".

And in the same file à is present which is an UTF-8 character and will be identified by grep -P -n "[\x80-\xFF]" .

One possibility that all the UTF-16 can be converted to UTF-8 if no regular expression is present. But it will take time to convert all of them.

Also not sure when it will convert UTF-16(2 byte char) to UTF-8(1 byte),how the alignment will be adjusted.

Looking for a way to find them .
Moderator's Comments:
Mod Comment CODE tags are to be used when displaying sample code segments, sample input, and sample output. HTML tags are to be used when displaying HTML code. Plain text describing requirements should not be tagged.

Last edited by Don Cragun; 10-16-2015 at 05:15 AM.. Reason: Get rid of HTML tags, add ICODE tags.
# 9  
Old 10-16-2015
UTF8 or UTF16 are a character encodings that apply to an entire file, not a single character. So chances are minimal that in the same text a UTF16 Ä and a UTF8 Ã are present, at least if prepared by a reasonable application.

What be the output of file your_input_file
# 10  
Old 10-16-2015
Quote:
Originally Posted by Abhijit Sen
Also not sure when it will convert UTF-16(2 byte char) to UTF-8(1 byte),how the alignment will be adjusted.
UTF-8 is not a 1 byte encoding, it uses 1 or more bytes depending on the character being encoded, which is why character replacement can't catch everything.

If your file is UTF-16, it ought to look very strange to normal UNIX programs because of the extra NULLs between most characters. Try hexdump -C to see what your file actually looks like.
# 11  
Old 10-26-2015
The File contains both UTF-8(Ã) and (Ä) which can not be identified by using this range value."[\x80-\xFF]"

I need to filter all this Special character("[\x80-\xFF]" ) and other special character as well
# 12  
Old 10-26-2015
It appears that you need to alter your locale while trying to process the file.

If you have - you are sure I gather - UTF8 characters then you need that locale to "see".
UNIX tools like awk understand locale settings.

The iconv tool can convert files from one locale to another - so this is another approach:
convert the file to match the locale you now run.
# 13  
Old 11-03-2015
Hi All,

I am able to convert the UTF-16 characters to UTF-8 characters.

But when it is getting converted(using iconv) as UTF-16 is 2 byte value , hence the file alignment is getting changed. Is there any way to fix this to resolve the alignment problem.

My file contains only UTF-8 value which takes 1 byte. And each line it can have only limited number of UTF-8 chracters. But during conversion few of the UTF-8 characters are getting pushed to next line which is altering the file alignment.

I have tried to use recode command but that is not working. Any help would be greatly appreciated.
# 14  
Old 11-04-2015
Quote:
Originally Posted by Abhijit Sen
But when it is getting converted(using iconv) as UTF-16 is 2 byte value , hence the file alignment is getting changed. Is there any way to fix this to resolve the alignment problem.

My file contains only UTF-8 value which takes 1 byte. And each line it can have only limited number of UTF-8 chracters. But during conversion few of the UTF-8 characters are getting pushed to next line which is altering the file alignment.
In general UNIX is and its utilities are coding-insensitive. That is: using sed (or awk, tr or similar text filters) you work on streams of bytes. In ASCII (and similar encodings) a "character" is a byte and a byte is a character. In other encodings this is not the case (like in UTF-16, where 2 bytes represent a character). But UNIX tools are not aware of this and treat each byte as if it would represent a character.

Having said this: to search for/find a text like "abc" regardless of the encoding cannot be done with these text filters, because they will not recognize that two bytes containing (if memory serves correctly) the hex values "00:61" ("U+0061") is the same letter "a" as a single byte with hex value "61" in ASCII.

Issuing grep "a" /some/file is basically telling grep to search for a byte containing the hex value 61 because this is what "a" is encoded in ASCII with.

Somebody has mentioned that python-tools work differently and are encoding-aware. It seems you need to resort to those (or similarily coding-aware tools) to get what you want. With UNIX tools you will not get over the aforementioned limitation, however clever you may work around that. It will always be a less-than-genuine solution which oughts to break under some unexpected set of conditions.

I hope this helps.

bakunin
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

PuTTY displaying "special" characters

I'm not really sure which forum this question should go into, so I'm posting it here. I work with AIX and RHEL systems using PuTTY (Release 0.60_q1.129) from a Windows 7 workstation. Some of the files we get from z/OS use "special" characters as delimiters. These characters include Hex 18... (7 Replies)
Discussion started by: derndingle
7 Replies

2. Shell Programming and Scripting

finding the strings beween 2 characters "/" & "/" in .txt file

Hi all. I have a .txt file that I need to sort it My file is like: 1- 88 chain0 MASTER (FF-TE) FFFF 1962510 /TCK T FD2TQHVTT1 /jtagc/jtag_instreg/updateinstr_reg_1 dff1 (TI,SO) 2- ... (10 Replies)
Discussion started by: Behrouzx77
10 Replies

3. Shell Programming and Scripting

Need HELP with AWK split. Need to check for "special characters" in string before splitting the file

Hi Experts. I'm stuck with the below AWK code where i'm trying to move the records containing any special characters in the last field to a bad file. awk -F, '{if ($NF ~ /^|^/) print >"goodfile";else print >"badfile"}' filename sample data 1,abc,def,1234,A * 2,bed,dec,342,* A ... (6 Replies)
Discussion started by: shell_boy23
6 Replies

4. Shell Programming and Scripting

if [ "variable" = "numerical-range" ]; then

been a while so i'm a bit rusty and need a little help. writing a script that needs to compare $EXECHOST(a number) against a numerical range and then set a value. below isn't working but should give you folks an idea of my goal: if ; then echo "This is a 32B machine, exiting..." if ;... (4 Replies)
Discussion started by: crimso
4 Replies

5. Shell Programming and Scripting

How to print range of lines using sed when pattern has special character "["

Hi, My input has much more lines, but few of them are below pin(IDF) { direction : input; drc_pinsigtype : signal; pin(SELDIV6) { direction : input; drc_pinsigtype : ... (3 Replies)
Discussion started by: nehashine
3 Replies

6. Shell Programming and Scripting

Question about special variables: "-" and "$_"

both ksh/bash support this 2 special variables, Is there any document for reference? 1) "-" is $OLDPWD 2) "$_" is last argument of previous command. (4 Replies)
Discussion started by: honglus
4 Replies

7. SuSE

VMDB Failure" followed by "Unable to open snapshot file"

keep getting an error when I try to revert to a snapshot: "VMDB Failure" followed by "Unable to open snapshot file" Im using vmware server 1.0.4, host OS is windows xp and guest OS is SLES. Is there anything I can do to recover the snapshot or am I in trouble!?!?! (0 Replies)
Discussion started by: s_linux
0 Replies

8. Shell Programming and Scripting

How to remove "New line characters" and "spaces" at a time

Dear friends, following is the output of a script from which I want to remove spaces and new-line characters. Example:- Line1 abcdefghijklmnopqrstuvwxyz Line2 mnopqrstuvwxyzabcdefghijkl Line3 opqrstuvwxyzabcdefdefg Here in above example, at every starting line there is a “tab” &... (4 Replies)
Discussion started by: anushree.a
4 Replies

9. Shell Programming and Scripting

how to split special characters "|" using awk

Hi friends I need to splict special character "|" here. Here is my script which giving error LINE=INVTRAN|cd /home/msgGoogle TraxFolderType=`awk -F"|" '{print $1}' $LINE` filePath=`awk -F"|" '{print $2}' $LINE` echo "TraxFolderType: "$TraxFolderType echo "filePath :"$filePath ... (3 Replies)
Discussion started by: krishna9
3 Replies
Login or Register to Ask a Question