Convert UTF-8 encoded hex value to a character


 
Thread Tools Search this Thread
Top Forums UNIX for Advanced & Expert Users Convert UTF-8 encoded hex value to a character
# 8  
Old 10-29-2008
Hi cbkihong,

Thanks for the elaborate explanation.

1. my perl version is 5.8.0, so the UTF-8 thing in perl should not be an issue.

2. I tried out the steps u mentioned (using the binmode and decode function). What happens in my case is that when i try
print "\x{0174}\n" (as 0174 is the UTF-8 hex code to display the req character),
I see 'Å´' instead of 'Ŵ' with the warning in case 2, and without warning in case 3. The as in case 4 , when i use decode() function, the output that i get is a blank.
Hence I suppose that my shell does not have the capability to dispaly the character after decoding it.
In fact even if i simmple try to paste the (Ŵ) character onto my terminal, it appears only as '.' (a dot)

Is it to do with setting the locale for my environment properly ? I tried out a few of the avialable locales, but to no help.

locale -a on my machine gives -->

POSIX
common
en_US.UTF-8
C
iso_8859_1
iso_8859_15
en_GB
en_GB.ISO8859-1
en_GB.ISO8859-15
en_GB.ISO8859-15@euro
en_IE
en_IE.ISO8859-1
en_IE.ISO8859-15
en_IE.ISO8859-15@euro
fr
fr.ISO8859-15
fr.UTF-8
fr_BE
fr_BE.ISO8859-1
fr_BE.ISO8859-15
fr_BE.ISO8859-15@euro
fr_FR
fr_FR.ISO8859-1
fr_FR.ISO8859-15
fr_FR.ISO8859-15@euro
fr_FR.UTF-8
fr_FR.UTF-8@euro
nl
nl.ISO8859-15
nl_BE
nl_BE.ISO8859-1
nl_BE.ISO8859-15
nl_BE.ISO8859-15@euro
nl_NL
nl_NL.ISO8859-1
nl_NL.ISO8859-15
nl_NL.ISO8859-15@euro
en_UK
nl_BE.UTF-8
nl_NL.UTF-8

Could you suggest if a specific locale be used ? In general how should the locale be selected ? Any links would be a great help .

Or is it something else that i should configure ? The character that i've picked is just randomly from the utf-8 character-set and is not related to any specific language as such.
# 9  
Old 10-29-2008
Simply, your terminal is not properly configured for UTF-8 yet.

Not sure about Solaris, on Linux just check the values of LANG and LC_ALL and make sure they are something like en_US.UTF-8, and confirm you have a font that can render those characters.

What kind of terminal is it? A normal console or X-Windows based? I'm not sure if you can use Unicode with a normal console at all. For terms like Xterm or gnome-terminal or kconsole, that should be possible with relevant fonts installed.

One thing you can test - prepare a UTF-8 based text file on another system and try to cat(1) from a shell in your terminal. If cat does not give you a correct rendering, go check your terminal.
# 10  
Old 10-30-2008
Hi cbkihong,

Thanks again for the reply.

My terminal is a normal console. I've tried the same with xterm and emacs shell. So does it mean that it is not possible to see these (any UTF-8 2 bye char) characters on a shell terminal.

I tried the reverse of the cat check you've suggested, i copied the file that i've onto windows, and opened it with notepad/editplus. I am able to see the UTF-8 characters correctly in there.

I also tried a few combinations on unix shell with the file that i have.
case 1 - i simply print the variable, and the output of decode function, i see the foll output --

ŴŶ (Original string)
Wide character in print at ./test1.pl line 11, <FILE> line 19.
ŴŶ (utf-8 decoded string)

if i then enable the binmode, the warning goes away but the character display changes

Ã\205´Ã\205¶ (Original string)
ŴŶ (utf-8 decoded string)


In the second case, i am unable to understand why the original strings shows different characters now (as opposed to case 1 when i simply print the string), and the decoded string displays characters same as the orginal string (it should have showed some diff character considering 2 byte per character and trying to encode to relevant utf-8 )
# 11  
Old 10-30-2008
Quote:
Originally Posted by sumirmehta
My terminal is a normal console. I've tried the same with xterm and emacs shell. So does it mean that it is not possible to see these (any UTF-8 2 bye char) characters on a shell terminal.
You may not be able to see these on a normal console, but you should be able to see multibyte UTF-8 characters rendered in an X-Windows-based terminal provided:

1. You have set correct locale at the shell
2. You are using a terminal emulator (e.g. uxterm, konsole) that handles Unicode processing properly.
3. You have configured the terminal emulator for the correct encoding.
4. You have the needed X fonts installed, and selected for rendering that specific Unicode character.

On Unix, because an X terminal emulator must have been forked off some process (be it a shell running in another terminal, or a desktop environment such as Gnome or KDE or IceWM), the locale of the parent process may affect the rendering, so eventually that may sometimes propagate upwards until you hit the system locale - and that is especially nasty.

I tried printing the U+0174 character you mentioned with 3 terminals: xterm, konsole and gnome-terminal. gnome-terminal and uxterm displayed it correctly on the machine I am currently using. You can look at the screenshot. For Chinese UTF-8 3-byte characters, they are rendered properly on all of the terminals I tried.

Normally, a font only contains glyphs for a subset of the supported character set. Given the wide range of characters embraced by Unicode, it is not unusual that fonts not designed to render a specific range of characters may fail to render those characters properly. With many X-based terminal emulators, setting the font is easy, but getting to know which font to use is likely a trickier issue.

In the worst case, if you have no idea whether a font contains the glyph for the specific characters you need, you may need to use something like FontForge, as I was suggested by some experts in the field while I was playing with LaTeX. But FontForge is not trivial. You probably can find other programs that gives you easier interface without resorting to FontForge.

I'm not exactly sure about the tests you mentioned. But, my experience is that if you have doubts over the generated bytestream, pass it over to hexdump (or od, as you prefer) and check individual bytes. The golden rule still applies - check the bytes first, and if the bytes are correct but rendering isn't, check the environment (terminal, shell, fonts, locale).

I must admit that getting Unicode to be processed and rendered correctly the first time is tricky, but once it is done, you may find that it becomes more trivial you do it the second time.
Convert UTF-8 encoded hex value to a character-u0174_gnome_terminalpng
Convert UTF-8 encoded hex value to a character-u0174_uxtermpng
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Convert files to UTF-8 on AIX 7.1

Dears, I have a shell script - working perfectly on Oracle Linux - that detects the encoding (the charset to be exact) of the files in a specified directory using the "file" command (The file command outputs the charset in Linux, but doesn't do that in AIX), then if the file isn't a UTF-8 text... (4 Replies)
Discussion started by: JeanM-1
4 Replies

2. Shell Programming and Scripting

Convert UTF-8 file to ASCII/ISO8859-1 OR replace characters

I am trying to develop a script which will work on a source UTF-8 file and perform one or more of the following It will accept the target encoding as an argument e.g. US-ASCII or ISO-8859-1, etc 1. It should replace all occurrences of characters outside target character set by " " (space) or... (3 Replies)
Discussion started by: hemkiran.s
3 Replies

3. UNIX for Advanced & Expert Users

UTF-8,16,32 character lengths using awk

Hi All, I am trying to obtain count of characters using awk, but "length" function returns a value of 1 for 2-byte or 3-byte characters as well unlike wc -c command. I have tried to use the below commands within awk function, but it does not seem to work { cmd="wc -c "stringtocheck ( cmd )... (6 Replies)
Discussion started by: tostay2003
6 Replies

4. Shell Programming and Scripting

Trying to convert utf-8 to WINDOWS-1251

Hello all i have utf-8 file that i try to convert to WINDOWS-1251 on linux without any success the file name is utf-8 when i try to do : file -bi test.txt it gives me : text/plain; charset=utf-8 when i try to convert the file i do : /usr/bin/iconv -f UTF-8 -t WINDOWS-1251 test.txt >... (1 Reply)
Discussion started by: umen
1 Replies

5. Linux

Help to Convert file from UNIX UTF-8 to Windows UTF-16

Hi, I have tried to convert a UTF-8 file to windows UTF-16 format file as below from unix machine unix2dos < testing.txt | iconv -f UTF-8 -t UTF-16 > out.txt and i am getting some chinese characters as below which l opened the converted file on windows machine. LANG=en_US.UTF-8... (3 Replies)
Discussion started by: phanidhar6039
3 Replies

6. UNIX for Dummies Questions & Answers

Issue with UTF-8 BOM character in text file

Sometimes we recieve some excel files containing French/Japanese characters over the mail, and these files are manually transferred to the server by using SFTP (security is not a huge concern here). The data is changed to text format before transferring it using Notepad. Problem is: When saving... (4 Replies)
Discussion started by: jawsnnn
4 Replies

7. Shell Programming and Scripting

Convert hex to decimal

can someone help me in converting hex streams to decimal values using perl script Hex value: $my_hex_stream="0c07ac14001676"; Every hex value in the above stream should be converted in to decimal and separated by comma. The output should be: 12,07,172,20,00,22,118 (2 Replies)
Discussion started by: Arun_Linux
2 Replies

8. Shell Programming and Scripting

How to modify character to UTF-8 in shell script?

I have a shell script running to load some data from a text file to database. Text file contains some non-ASCII characters like ü. How can i convert these characters to UTF-8 codes before loading to DB. (5 Replies)
Discussion started by: vel4ever
5 Replies

9. Red Hat

Can't convert 7bit ASCII to UTF-8

Hello, I am trying to convert a 7bit ASCII file to UTF-8. I have used iconv before though it can't recognize it for some reason and says unknown file encoding. When I used ascii2uni package with different package, ./ascii2uni -a K -a I -a J -a X test_file > new_test_file It still... (2 Replies)
Discussion started by: rockf1bull
2 Replies

10. Programming

Howto convert Ascii -> UTF-8 & back C++

While working with russian text under FreeBSD&MySQL I need to convert a string from MySQL to the Unicode format. I've just started my way in C++ under FreeBSD , so please explain me how can I get ascii code of Char variable and also how can i get a character into variable with the specified ascii... (3 Replies)
Discussion started by: macron
3 Replies
Login or Register to Ask a Question