Linux and UNIX Man Pages

Linux & Unix Commands - Search Man Pages

utf(6) [plan9 man page]

UTF(6)								   Games Manual 							    UTF(6)

NAME
UTF, Unicode, ASCII, rune - character set and format DESCRIPTION
The Plan 9 character set and representation are based on the Unicode Standard and on the ISO multibyte UTF-8 encoding (Universal Character Set Transformation Format, 8 bits wide). The Unicode Standard represents its characters in 16 bits; UTF-8 represents such values in an 8-bit byte stream. Throughout this manual, UTF-8 is shortened to UTF. In Plan 9, a rune is a 16-bit quantity representing a Unicode character. Internally, programs may store characters as runes. However, any external manifestation of textual information, in files or at the interface between programs, uses a machine-independent, byte-stream encoding called UTF. UTF is designed so the 7-bit ASCII set (values hexadecimal 00 to 7F), appear only as themselves in the encoding. Runes with values above 7F appear as sequences of two or more bytes with values only from 80 to FF. The UTF encoding of the Unicode Standard is backward compatible with ASCII: programs presented only with ASCII work on Plan 9 even if not written to deal with UTF, as do programs that deal with uninterpreted byte streams. However, programs that perform semantic processing on ASCII graphic characters must convert from UTF to runes in order to work properly with non-ASCII input. See rune(2). Letting numbers be binary, a rune x is converted to a multibyte UTF sequence as follows: 01. x in [00000000.0bbbbbbb] -> 0bbbbbbb 10. x in [00000bbb.bbbbbbbb] -> 110bbbbb, 10bbbbbb 11. x in [bbbbbbbb.bbbbbbbb] -> 1110bbbb, 10bbbbbb, 10bbbbbb Conversion 01 provides a one-byte sequence that spans the ASCII character set in a compatible way. Conversions 10 and 11 represent higher- valued characters as sequences of two or three bytes with the high bit set. Plan 9 does not support the 4, 5, and 6 byte sequences pro- posed by X-Open. When there are multiple ways to encode a value, for example rune 0, the shortest encoding is used. In the inverse mapping, any sequence except those described above is incorrect and is converted to rune hexadecimal 0080. FILES
/lib/unicode table of characters and descriptions, suitable for look(1). SEE ALSO
ascii(1), tcs(1), rune(2), keyboard(6), The Unicode Standard. UTF(6)

Check Out this Related Man Page

TCS(1)							      General Commands Manual							    TCS(1)

NAME
tcs - translate character sets SYNOPSIS
tcs [ -slcv ] [ -f ics ] [ -t ocs ] [ file ... ] DESCRIPTION
Tcs interprets the named file(s) (standard input default) as a stream of characters from the ics character set or format, converts them to runes, and then converts them into a stream of characters from the ocs character set or format on the standard output. The default value for ics and ocs is utf, the UTF encoding described in utf(6). The -l option lists the character sets known to tcs. Processing continues in the face of conversion errors (the -s option prevents reporting of these errors). The -c option forces the output to contain only cor- rectly converted characters; otherwise, 0x80 characters will be substituted for UTF encoding errors and 0xFFFD characters will substituted for unknown characters. The -v option generates various diagnostic and summary information on standard error, or makes the -l output more verbose. Tcs recognizes an ever changing list of character sets. In particular, it supports a variety of Russian and Japanese encodings. Some of the supported encodings are utf The Plan 9 UTF encoding, known by ISO as UTF-8 utf1 The deprecated original UTF encoding from ISO 10646 ascii 7-bit ASCII 8859-1 Latin-1 (Central European) 8859-2 Latin-2 (Czech .. Slovak) 8859-3 Latin-3 (Dutch .. Turkish) 8859-4 Latin-4 (Scandinavian) 8859-5 Part 5 (Cyrillic) 8859-6 Part 6 (Arabic) 8859-7 Part 7 (Greek) 8859-8 Part 8 (Hebrew) 8859-9 Latin-5 (Finnish .. Portuguese) koi8 KOI-8 (GOST 19769-74) jis-kanji ISO 2022-JP ujis EUC-JX: JIS 0208 ms-kanji Microsoft, or Shift-JIS jis (from only) guesses between ISO 2022-JP, EUC or Shift-Jis gb Chinese national standard (GB2312-80) big5 Big 5 (HKU version) unicode Unicode Standard 1.0 tis Thai character set plus ASCII (TIS 620-1986) msdos IBM PC: CP 437 atari Atari-ST character set EXAMPLES
tcs -f 8859-1 Convert 8859-1 (Latin-1) characters into UTF format. tcs -s -f jis Convert characters encoded in one of several shift JIS encodings into UTF format. Unknown Kanji will be converted into 0xFFFD char- acters. tcs -lv Print an up to date list of the supported character sets. SOURCE
/sys/src/cmd/tcs SEE ALSO
ascii(1), rune(2), utf(6). TCS(1)
Man Page