Linux and UNIX Man Pages

Linux & Unix Commands - Search Man Pages

utf8(5) [freebsd man page]

UTF8(5) 						      BSD File Formats Manual							   UTF8(5)

NAME
utf8 -- UTF-8, a transformation format of ISO 10646 SYNOPSIS
ENCODING "UTF-8" DESCRIPTION
The UTF-8 encoding represents UCS-4 characters as a sequence of octets, using between 1 and 6 for each character. It is backwards compatible with ASCII, so 0x00-0x7f refer to the ASCII character set. The multibyte encoding of non-ASCII characters consist entirely of bytes whose high order bit is set. The actual encoding is represented by the following table: [0x00000000 - 0x0000007f] [00000000.0bbbbbbb] -> 0bbbbbbb [0x00000080 - 0x000007ff] [00000bbb.bbbbbbbb] -> 110bbbbb, 10bbbbbb [0x00000800 - 0x0000ffff] [bbbbbbbb.bbbbbbbb] -> 1110bbbb, 10bbbbbb, 10bbbbbb [0x00010000 - 0x001fffff] [00000000.000bbbbb.bbbbbbbb.bbbbbbbb] -> 11110bbb, 10bbbbbb, 10bbbbbb, 10bbbbbb [0x00200000 - 0x03ffffff] [000000bb.bbbbbbbb.bbbbbbbb.bbbbbbbb] -> 111110bb, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb [0x04000000 - 0x7fffffff] [0bbbbbbb.bbbbbbbb.bbbbbbbb.bbbbbbbb] -> 1111110b, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb If more than a single representation of a value exists (for example, 0x00; 0xC0 0x80; 0xE0 0x80 0x80) the shortest representation is always used. Longer ones are detected as an error as they pose a potential security risk, and destroy the 1:1 character:octet sequence mapping. SEE ALSO
euc(5) Rob Pike and Ken Thompson, "Hello World", Proceedings of the Winter 1993 USENIX Technical Conference, USENIX Association, January 1993. F. Yergeau, UTF-8, a transformation format of ISO 10646, January 1998, RFC 2279. The Unicode Standard, Version 3.0, The Unicode Consortium, 2000, as amended by the Unicode Standard Annex #27: Unicode 3.1 and by the Unicode Standard Annex #28: Unicode 3.2. STANDARDS
The utf8 encoding is compatible with RFC 2279 and Unicode 3.2. BSD
April 7, 2004 BSD

Check Out this Related Man Page

UTF2(5) 						      BSD File Formats Manual							   UTF2(5)

NAME
utf2 -- Universal character set Transformation Format encoding of runes SYNOPSIS
ENCODING "UTF2" DESCRIPTION
The UTF2 encoding has been deprecated in favour of UTF-8. New applications should not use UTF2. The UTF2 encoding is based on a proposed X-Open multibyte FSS-UCS-TF (File System Safe Universal Character Set Transformation Format) encod- ing as used in Plan 9 from Bell Labs. Although it is capable of representing more than 16 bits, the current implementation is limited to 16 bits as defined by the Unicode Standard. UTF2 representation is backwards compatible with ASCII, so 0x00-0x7f refer to the ASCII character set. The multibyte encoding of runes between 0x0080 and 0xffff consist entirely of bytes whose high order bit is set. The actual encoding is represented by the following table: [0x0000 - 0x007f] [00000000.0bbbbbbb] -> 0bbbbbbb [0x0080 - 0x07ff] [00000bbb.bbbbbbbb] -> 110bbbbb, 10bbbbbb [0x0800 - 0xffff] [bbbbbbbb.bbbbbbbb] -> 1110bbbb, 10bbbbbb, 10bbbbbb If more than a single representation of a value exists (for example, 0x00; 0xC0 0x80; 0xE0 0x80 0x80) the shortest representation is always used (but the longer ones will be correctly decoded). The final three encodings provided by X-Open: [00000000.000bbbbb.bbbbbbbb.bbbbbbbb] -> 11110bbb, 10bbbbbb, 10bbbbbb, 10bbbbbb [000000bb.bbbbbbbb.bbbbbbbb.bbbbbbbb] -> 111110bb, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb [0bbbbbbb.bbbbbbbb.bbbbbbbb.bbbbbbbb] -> 1111110b, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb which provides for the entire proposed ISO-10646 31 bit standard are currently not implemented. SEE ALSO
mklocale(1), setlocale(3), utf8(5) BSD
October 11, 2002 BSD
Man Page