Linux and UNIX Man Pages

Linux & Unix Commands - Search Man Pages

utf2(5) [osx man page]

UTF2(5) 						      BSD File Formats Manual							   UTF2(5)

NAME
utf2 -- Universal character set Transformation Format encoding of runes SYNOPSIS
ENCODING "UTF2" DESCRIPTION
The UTF2 encoding has been deprecated in favour of UTF-8. New applications should not use UTF2. The UTF2 encoding is based on a proposed X-Open multibyte FSS-UCS-TF (File System Safe Universal Character Set Transformation Format) encod- ing as used in Plan 9 from Bell Labs. Although it is capable of representing more than 16 bits, the current implementation is limited to 16 bits as defined by the Unicode Standard. UTF2 representation is backwards compatible with ASCII, so 0x00-0x7f refer to the ASCII character set. The multibyte encoding of runes between 0x0080 and 0xffff consist entirely of bytes whose high order bit is set. The actual encoding is represented by the following table: [0x0000 - 0x007f] [00000000.0bbbbbbb] -> 0bbbbbbb [0x0080 - 0x07ff] [00000bbb.bbbbbbbb] -> 110bbbbb, 10bbbbbb [0x0800 - 0xffff] [bbbbbbbb.bbbbbbbb] -> 1110bbbb, 10bbbbbb, 10bbbbbb If more than a single representation of a value exists (for example, 0x00; 0xC0 0x80; 0xE0 0x80 0x80) the shortest representation is always used (but the longer ones will be correctly decoded). The final three encodings provided by X-Open: [00000000.000bbbbb.bbbbbbbb.bbbbbbbb] -> 11110bbb, 10bbbbbb, 10bbbbbb, 10bbbbbb [000000bb.bbbbbbbb.bbbbbbbb.bbbbbbbb] -> 111110bb, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb [0bbbbbbb.bbbbbbbb.bbbbbbbb.bbbbbbbb] -> 1111110b, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb which provides for the entire proposed ISO-10646 31 bit standard are currently not implemented. SEE ALSO
mklocale(1), setlocale(3), utf8(5) BSD
October 11, 2002 BSD

Check Out this Related Man Page

UTF(6)								   Games Manual 							    UTF(6)

NAME
UTF, Unicode, ASCII, rune - character set and format DESCRIPTION
The Plan 9 character set and representation are based on the Unicode Standard and on the ISO multibyte UTF-8 encoding (Universal Character Set Transformation Format, 8 bits wide). The Unicode Standard represents its characters in 16 bits; UTF-8 represents such values in an 8-bit byte stream. Throughout this manual, UTF-8 is shortened to UTF. In Plan 9, a rune is a 16-bit quantity representing a Unicode character. Internally, programs may store characters as runes. However, any external manifestation of textual information, in files or at the interface between programs, uses a machine-independent, byte-stream encoding called UTF. UTF is designed so the 7-bit ASCII set (values hexadecimal 00 to 7F), appear only as themselves in the encoding. Runes with values above 7F appear as sequences of two or more bytes with values only from 80 to FF. The UTF encoding of the Unicode Standard is backward compatible with ASCII: programs presented only with ASCII work on Plan 9 even if not written to deal with UTF, as do programs that deal with uninterpreted byte streams. However, programs that perform semantic processing on ASCII graphic characters must convert from UTF to runes in order to work properly with non-ASCII input. See rune(2). Letting numbers be binary, a rune x is converted to a multibyte UTF sequence as follows: 01. x in [00000000.0bbbbbbb] -> 0bbbbbbb 10. x in [00000bbb.bbbbbbbb] -> 110bbbbb, 10bbbbbb 11. x in [bbbbbbbb.bbbbbbbb] -> 1110bbbb, 10bbbbbb, 10bbbbbb Conversion 01 provides a one-byte sequence that spans the ASCII character set in a compatible way. Conversions 10 and 11 represent higher- valued characters as sequences of two or three bytes with the high bit set. Plan 9 does not support the 4, 5, and 6 byte sequences pro- posed by X-Open. When there are multiple ways to encode a value, for example rune 0, the shortest encoding is used. In the inverse mapping, any sequence except those described above is incorrect and is converted to rune hexadecimal 0080. FILES
/lib/unicode table of characters and descriptions, suitable for look(1). SEE ALSO
ascii(1), tcs(1), rune(2), keyboard(6), The Unicode Standard. UTF(6)
Man Page