Convert UTF-8 encoded hex value to a character

10-29-2008

Registered User

16, 0

Join Date: Aug 2008

Last Activity: 20 January 2009, 4:45 PM EST

Posts: 16

Thanks Given: 0

Thanked 0 Times in 0 Posts

Hi cbkihong,

Thanks for the elaborate explanation.

1. my perl version is 5.8.0, so the UTF-8 thing in perl should not be an issue.

2. I tried out the steps u mentioned (using the binmode and decode function). What happens in my case is that when i try
print "\x{0174}\n" (as 0174 is the UTF-8 hex code to display the req character),
I see 'Ŵ' instead of 'Ŵ' with the warning in case 2, and without warning in case 3. The as in case 4 , when i use decode() function, the output that i get is a blank.
Hence I suppose that my shell does not have the capability to dispaly the character after decoding it.
In fact even if i simmple try to paste the (Ŵ) character onto my terminal, it appears only as '.' (a dot)

Is it to do with setting the locale for my environment properly ? I tried out a few of the avialable locales, but to no help.

locale -a on my machine gives -->

POSIX
common
en_US.UTF-8
C
iso_8859_1
iso_8859_15
en_GB
en_GB.ISO8859-1
en_GB.ISO8859-15
en_GB.ISO8859-15@euro
en_IE
en_IE.ISO8859-1
en_IE.ISO8859-15
en_IE.ISO8859-15@euro
fr
fr.ISO8859-15
fr.UTF-8
fr_BE
fr_BE.ISO8859-1
fr_BE.ISO8859-15
fr_BE.ISO8859-15@euro
fr_FR
fr_FR.ISO8859-1
fr_FR.ISO8859-15
fr_FR.ISO8859-15@euro
fr_FR.UTF-8
fr_FR.UTF-8@euro
nl
nl.ISO8859-15
nl_BE
nl_BE.ISO8859-1
nl_BE.ISO8859-15
nl_BE.ISO8859-15@euro
nl_NL
nl_NL.ISO8859-1
nl_NL.ISO8859-15
nl_NL.ISO8859-15@euro
en_UK
nl_BE.UTF-8
nl_NL.UTF-8

Could you suggest if a specific locale be used ? In general how should the locale be selected ? Any links would be a great help .

Or is it something else that i should configure ? The character that i've picked is just randomly from the utf-8 character-set and is not related to any specific language as such.

sumirmehta

View Public Profile for sumirmehta

Find all posts by sumirmehta

10-29-2008

Registered User

1,622, 11

Join Date: Sep 2002

Last Activity: 4 May 2014, 6:22 AM EDT

Location: Hong Kong, China

Posts: 1,622

Thanks Given: 0

Thanked 11 Times in 10 Posts

Simply, your terminal is not properly configured for UTF-8 yet.

Not sure about Solaris, on Linux just check the values of LANG and LC_ALL and make sure they are something like en_US.UTF-8, and confirm you have a font that can render those characters.

What kind of terminal is it? A normal console or X-Windows based? I'm not sure if you can use Unicode with a normal console at all. For terms like Xterm or gnome-terminal or kconsole, that should be possible with relevant fonts installed.

One thing you can test - prepare a UTF-8 based text file on another system and try to cat(1) from a shell in your terminal. If cat does not give you a correct rendering, go check your terminal.

cbkihong

View Public Profile for cbkihong

Find all posts by cbkihong

10-30-2008

Registered User

16, 0

Join Date: Aug 2008

Last Activity: 20 January 2009, 4:45 PM EST

Posts: 16

Thanks Given: 0

Thanked 0 Times in 0 Posts

Hi cbkihong,

Thanks again for the reply.

My terminal is a normal console. I've tried the same with xterm and emacs shell. So does it mean that it is not possible to see these (any UTF-8 2 bye char) characters on a shell terminal.

I tried the reverse of the cat check you've suggested, i copied the file that i've onto windows, and opened it with notepad/editplus. I am able to see the UTF-8 characters correctly in there.

I also tried a few combinations on unix shell with the file that i have.
case 1 - i simply print the variable, and the output of decode function, i see the foll output --

ŴŶ (Original string)
Wide character in print at ./test1.pl line 11, <FILE> line 19.
ŴŶ (utf-8 decoded string)

if i then enable the binmode, the warning goes away but the character display changes

�\205´�\205¶ (Original string)
ŴŶ (utf-8 decoded string)

In the second case, i am unable to understand why the original strings shows different characters now (as opposed to case 1 when i simply print the string), and the decoded string displays characters same as the orginal string (it should have showed some diff character considering 2 byte per character and trying to encode to relevant utf-8 )

sumirmehta

View Public Profile for sumirmehta

Find all posts by sumirmehta

10-30-2008

Registered User

1,622, 11

Join Date: Sep 2002

Last Activity: 4 May 2014, 6:22 AM EDT

Location: Hong Kong, China

Posts: 1,622

Thanks Given: 0

Thanked 11 Times in 10 Posts

Quote:

Originally Posted by sumirmehta

My terminal is a normal console. I've tried the same with xterm and emacs shell. So does it mean that it is not possible to see these (any UTF-8 2 bye char) characters on a shell terminal.

You may not be able to see these on a normal console, but you should be able to see multibyte UTF-8 characters rendered in an X-Windows-based terminal provided:

1. You have set correct locale at the shell
2. You are using a terminal emulator (e.g. uxterm, konsole) that handles Unicode processing properly.
3. You have configured the terminal emulator for the correct encoding.
4. You have the needed X fonts installed, and selected for rendering that specific Unicode character.

On Unix, because an X terminal emulator must have been forked off some process (be it a shell running in another terminal, or a desktop environment such as Gnome or KDE or IceWM), the locale of the parent process may affect the rendering, so eventually that may sometimes propagate upwards until you hit the system locale - and that is especially nasty.

I tried printing the U+0174 character you mentioned with 3 terminals: xterm, konsole and gnome-terminal. gnome-terminal and uxterm displayed it correctly on the machine I am currently using. You can look at the screenshot. For Chinese UTF-8 3-byte characters, they are rendered properly on all of the terminals I tried.

Normally, a font only contains glyphs for a subset of the supported character set. Given the wide range of characters embraced by Unicode, it is not unusual that fonts not designed to render a specific range of characters may fail to render those characters properly. With many X-based terminal emulators, setting the font is easy, but getting to know which font to use is likely a trickier issue.

In the worst case, if you have no idea whether a font contains the glyph for the specific characters you need, you may need to use something like FontForge, as I was suggested by some experts in the field while I was playing with LaTeX. But FontForge is not trivial. You probably can find other programs that gives you easier interface without resorting to FontForge.

I'm not exactly sure about the tests you mentioned. But, my experience is that if you have doubts over the generated bytestream, pass it over to hexdump (or od, as you prefer) and check individual bytes. The golden rule still applies - check the bytes first, and if the bytes are correct but rendering isn't, check the environment (terminal, shell, fonts, locale).

I must admit that getting Unicode to be processed and rendered correctly the first time is tricky, but once it is done, you may find that it becomes more trivial you do it the second time.

Convert UTF-8 encoded hex value to a character-u0174_gnome_terminalpng

Convert UTF-8 encoded hex value to a character-u0174_uxtermpng

cbkihong

View Public Profile for cbkihong

Find all posts by cbkihong

UNIX for Advanced & Expert Users

Convert UTF-8 encoded hex value to a character

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Convert files to UTF-8 on AIX 7.1

Discussion started by: JeanM-1

2. Shell Programming and Scripting

Convert UTF-8 file to ASCII/ISO8859-1 OR replace characters

Discussion started by: hemkiran.s

3. UNIX for Advanced & Expert Users

UTF-8,16,32 character lengths using awk

Discussion started by: tostay2003

4. Shell Programming and Scripting

Trying to convert utf-8 to WINDOWS-1251

Discussion started by: umen

5. Linux

Help to Convert file from UNIX UTF-8 to Windows UTF-16

Discussion started by: phanidhar6039

6. UNIX for Dummies Questions & Answers

Issue with UTF-8 BOM character in text file

Discussion started by: jawsnnn

7. Shell Programming and Scripting

Convert hex to decimal

Discussion started by: Arun_Linux

8. Shell Programming and Scripting

How to modify character to UTF-8 in shell script?

Discussion started by: vel4ever

9. Red Hat

Can't convert 7bit ASCII to UTF-8

Discussion started by: rockf1bull

10. Programming

Howto convert Ascii -> UTF-8 & back C++

Discussion started by: macron