Charsets and encoding details | Unix Linux Forums | UNIX for Dummies Questions & Answers

  Go Back    


UNIX for Dummies Questions & Answers If you're not sure where to post a UNIX or Linux question, post it here. All UNIX and Linux newbies welcome !!

Charsets and encoding details

UNIX for Dummies Questions & Answers


Closed Thread    
 
Thread Tools Search this Thread Display Modes
    #1  
Old 11-14-2012
wakatana wakatana is offline
Registered User
 
Join Date: Jul 2009
Last Activity: 23 May 2013, 2:35 PM EDT
Posts: 116
Thanks: 1
Thanked 0 Times in 0 Posts
Charsets and encoding details

Hello gurus, I would like to get deep into charset and encoding isse, also tried google it but no luck. Please see bellow

My configuration

Code:
[pista@HP-PC MULTIBOOT]$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

I have file1, containing text. This text I am able to see correctly only on M$ windows, If i just open the file with less, cat or vi I get this:

Code:
[pista@HP-PC konvertovanie]$ cat file1 
- Prich�dzaj�.
- Kto prich�dza?
N�� svet okupuj
vyvinut� �udsk� druhy,

[pista@HP-PC konvertovanie]$ less file1 
- Prich<E1>dzaj<FA>.
- Kto prich<E1>dza?
N<E1><9A> svet okupuj<FA>
vyvinut<E9> <BE>udsk<E9> druhy,

[pista@HP-PC konvertovanie]$ vi file1 
- Prichádzajú.
- Kto prichádza?
Ná<9a> svet okupujú
vyvinuté ľudské druhy,

Under linux I have to use iconv to see it correctly

Code:
[pista@HP-PC konvertovanie]$ iconv -f WINDOWS-1250 -t UTF-8 file1
- Prichádzajú.
- Kto prichádza?
Náš svet okupujú
vyvinuté ľudské druhy,

I understand that this is because of that file was coded in one format (WINDOWS-1250) and encoded in another (UTF-8). But can you clarify the following?

1.) When I check the decimal ASCII value of each character I get following lines. So what does negative values mean and what is that code 341 (instead of á) ? AFAIK ASCII is from 0-127.

Code:
[pista@HP-PC konvertovanie]$ cat file1 | od -An -t dC -c
   45   32   80  114  105   99  104  -31  100  122   97  106   -6   46   13   10
    -         P    r    i    c    h  341    d    z    a    j  372    .   \r   \n
   45   32   75  116  111   32  112  114  105   99  104  -31  100  122   97   63
    -         K    t    o         p    r    i    c    h  341    d    z    a    ?
   13   10   78  -31 -102   32  115  118  101  116   32  111  107  117  112  117
   \r   \n    N  341  232         s    v    e    t         o    k    u    p    u
  106   -6   13   10   48   48   58   48   48   58   48   53   44   56   50   48
    j  372   \r   \n    0    0    :    0    0    :    0    5    ,    8    2    0
   32   45   45   62   32   48   48   58   48   48   58   48   55   44   54   53
         -    -    >         0    0    :    0    0    :    0    7    ,    6    5
   52   13   10  118  121  118  105  110  117  116  -23   32  -66  117  100  115
    4   \r   \n    v    y    v    i    n    u    t  351       276    u    d    s
  107  -23   32  100  114  117  104  121   44   13   10
    k  351         d    r    u    h    y    ,   \r   \n

2.) My assumption is that if UTF-8 and WINDOWS-1250 uses for same characters different "numbers" (code representation) then if some character will be encoded using encoding1 (WINDOWS-1250) it gains approporiate "code1" from encoding1 table. So if this encoded character (or more likely it's number representation, which is "code1") will be decoded using another encoding (UTF-8) the only thing that happens here is that for "code1" there will be lookup in encoding2 (UTF-8) table and approporiate character from encoding2 table is asigned, am I right ? I think after some exaple it will be clear:

Please look at following sites, they shows what will happend if you encode with one encoding and decode with another. Seems that until you reach 127 (decimal) boundary no mather if you decode with wrong decoding (this is why some characters in above example was displayed correctly even when wrong encoding was used).

from UTF-8 to WINDOWS-1250
Encoding utf-8 to windows-1250

from WINDOWS-1250 to UTF-8
Encoding windows-1250 to utf-8

According this site The extreme UTF-8 table the "á" character is encoded in UTF-8 as a 225. According wikipedia Windows-1250 - Wikipedia, the free encyclopedia "á" has also value 225 in Windows-1250. So why is "á" not dispplayed correctly even if I use wrong encoding, check here and type "á" Encoding / decoding tool. Analyze character encoding problems and errors. ? Also some interesting observation, in UTF-8 table there is "š" character two times (one time with 154 and another with 453 code) why ?

3.) If i understand it right there is no way to tell how file was encoded (unless there is some header that specify this, or you do some statistical language analysis etc.). So why/how "file" commands recognize UTF-8 encoding but not WINDOWS-1250 ?

Code:
[pista@HP-PC konvertovanie]$ file -bi file1 
text/plain; charset=unknown-8bit
[pista@HP-PC konvertovanie]$ iconv -f WINDOWS-1250 -t UTF-8 file1 > file1.utf8
[pista@HP-PC konvertovanie]$ file -bi file1.utf8 
text/plain; charset=utf-8

Thank you very much
Sponsored Links
    #2  
Old 11-15-2012
Don Cragun's Avatar
Don Cragun Don Cragun is online now Forum Staff  
Moderator
 
Join Date: Jul 2012
Last Activity: 20 December 2014, 2:03 AM EST
Location: San Jose, CA, USA
Posts: 5,257
Thanks: 205
Thanked 1,755 Times in 1,495 Posts
Quote:
Originally Posted by wakatana View Post
Hello gurus, I would like to get deep into charset and encoding isse, also tried google it but no luck. Please see bellow

My configuration

Code:
[pista@HP-PC MULTIBOOT]$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

I have file1, containing text. This text I am able to see correctly only on M$ windows, If i just open the file with less, cat or vi I get this:

Code:
[pista@HP-PC konvertovanie]$ cat file1 
- Prich�dzaj�.
- Kto prich�dza?
N�� svet okupuj
vyvinut� �udsk� druhy,

[pista@HP-PC konvertovanie]$ less file1 
- Prich<E1>dzaj<FA>.
- Kto prich<E1>dza?
N<E1><9A> svet okupuj<FA>
vyvinut<E9> <BE>udsk<E9> druhy,

[pista@HP-PC konvertovanie]$ vi file1 
- Prichádzajú.
- Kto prichádza?
Ná<9a> svet okupujú
vyvinuté ľudské druhy,

Under linux I have to use iconv to see it correctly

Code:
[pista@HP-PC konvertovanie]$ iconv -f WINDOWS-1250 -t UTF-8 file1
- Prichádzajú.
- Kto prichádza?
Náš svet okupujú
vyvinuté ľudské druhy,

I understand that this is because of that file was coded in one format (WINDOWS-1250) and encoded in another (UTF-8). But can you clarify the following?

1.) When I check the decimal ASCII value of each character I get following lines. So what does negative values mean and what is that code 341 (instead of á) ? AFAIK ASCII is from 0-127.

Code:
[pista@HP-PC konvertovanie]$ cat file1 | od -An -t dC -c
   45   32   80  114  105   99  104  -31  100  122   97  106   -6   46   13   10
    -         P    r    i    c    h  341    d    z    a    j  372    .   \r   \n
   45   32   75  116  111   32  112  114  105   99  104  -31  100  122   97   63
    -         K    t    o         p    r    i    c    h  341    d    z    a    ?
   13   10   78  -31 -102   32  115  118  101  116   32  111  107  117  112  117
   \r   \n    N  341  232         s    v    e    t         o    k    u    p    u
  106   -6   13   10   48   48   58   48   48   58   48   53   44   56   50   48
    j  372   \r   \n    0    0    :    0    0    :    0    5    ,    8    2    0
   32   45   45   62   32   48   48   58   48   48   58   48   55   44   54   53
         -    -    >         0    0    :    0    0    :    0    7    ,    6    5
   52   13   10  118  121  118  105  110  117  116  -23   32  -66  117  100  115
    4   \r   \n    v    y    v    i    n    u    t  351       276    u    d    s
  107  -23   32  100  114  117  104  121   44   13   10
    k  351         d    r    u    h    y    ,   \r   \n

2.) My assumption is that if UTF-8 and WINDOWS-1250 uses for same characters different "numbers" (code representation) then if some character will be encoded using encoding1 (WINDOWS-1250) it gains approporiate "code1" from encoding1 table. So if this encoded character (or more likely it's number representation, which is "code1") will be decoded using another encoding (UTF-8) the only thing that happens here is that for "code1" there will be lookup in encoding2 (UTF-8) table and approporiate character from encoding2 table is asigned, am I right ? I think after some exaple it will be clear:

Please look at following sites, they shows what will happend if you encode with one encoding and decode with another. Seems that until you reach 127 (decimal) boundary no mather if you decode with wrong decoding (this is why some characters in above example was displayed correctly even when wrong encoding was used).

from UTF-8 to WINDOWS-1250
Encoding utf-8 to windows-1250

from WINDOWS-1250 to UTF-8
Encoding windows-1250 to utf-8

According this site The extreme UTF-8 table the "á" character is encoded in UTF-8 as a 225. According wikipedia Windows-1250 - Wikipedia, the free encyclopedia "á" has also value 225 in Windows-1250. So why is "á" not dispplayed correctly even if I use wrong encoding, check here and type "á" Encoding / decoding tool. Analyze character encoding problems and errors. ? Also some interesting observation, in UTF-8 table there is "š" character two times (one time with 154 and another with 453 code) why ?

3.) If i understand it right there is no way to tell how file was encoded (unless there is some header that specify this, or you do some statistical language analysis etc.). So why/how "file" commands recognize UTF-8 encoding but not WINDOWS-1250 ?

Code:
[pista@HP-PC konvertovanie]$ file -bi file1 
text/plain; charset=unknown-8bit
[pista@HP-PC konvertovanie]$ iconv -f WINDOWS-1250 -t UTF-8 file1 > file1.utf8
[pista@HP-PC konvertovanie]$ file -bi file1.utf8 
text/plain; charset=utf-8

Thank you very much
Most of the websites you referenced are not really applicable to this question. By convention, using LANG=en_US.UTF-8 indicates that you are using the English language with formatting conventions used in the United States encoded using the UTF-8 character set.

The UTF-8 character set encodes over a million characters using one to six bytes to encode a character. Any byte that has the high order bit set is part of a multi-byte character. Any byte that has the high order bit clear represents the same character as the US-ASCII character with the same value. See Wikipedia UTF-8 for details.

The Windows-1250 codeset is used to represent text for Central and Eastern European characters using Latin script and encodes 251 characters using one byte to encode a character. Any byte that has the high order bit clear represents the same character as the US-ASCII character with the same value. See Wikipedia Windows-1250 for details. That web page shows that the character with decimal value 251 in Windows-1250 ("á") corresponds to the Unicode character with value 00E1 which is encoded in UTF-8 as the two byte sequence 0xC3 0xA1 (or in decimal 195 161). Since you can see that the single byte with value 251 (base 10) in Windows-1250 is not the same as the two byte values 195 followed by 161 in UTF-8, you will not see the same printed characters when you try to look at Windows-1250 codeset characters when you tell the system that you are using a locale with a UTF-8 codeset representation of characters. Furthermore, in UTF-8 there is never a single byte character with the high bit set and there is never a multi-byte character than has any byte without the high bit set.

The iconv utility knows how to convert characters from one codeset to another codeset and as you have seen does so successfully. But, expecting characters from a single byte codeset with the high bit set to be magically converted from invalid characters in UTF-8 based on an unspecified assumption that invalid UTF-8 characters should be treated as Windows-1250 characters just won't work.
Sponsored Links
Closed Thread

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
at -l doesnt give details of the scheduled job. How to get the details? superparticle UNIX for Dummies Questions & Answers 2 11-09-2011 08:51 AM
How to find the file encoding and updating the file encoding? cnraja Shell Programming and Scripting 7 05-27-2011 07:50 AM
Encoding Type risshanth UNIX for Dummies Questions & Answers 1 02-24-2010 03:32 PM
Araic Encoding habuzahra Shell Programming and Scripting 2 07-02-2009 09:38 PM
encoding palmer18 UNIX for Dummies Questions & Answers 3 08-21-2007 10:35 AM



All times are GMT -4. The time now is 03:15 AM.