Linux and UNIX Man Pages

Linux & Unix Commands - Search Man Pages

utf-8(5) [osf1 man page]

Unicode(5)							File Formats Manual							Unicode(5)

NAME
Unicode, unicode, universal.utf8, UCS-2, UCS-4, UTF-8, UTF-16, UTF-32, iso10646 - Support for the Unicode and ISO/IEC 10646 standards DESCRIPTION
The operating system provides locales and codeset converters that support the following standards: The Unicode Standard, Version 3.0, Uni- code, Inc., 1999 Information Technology-Universal Multiple-Octet Coded Character Set, ISO/IEC 10646:1993 The Basic Multilingual Plane defined by this standard is identical with the main body of Unicode character encoding. These standards define generalized character encoding rules that can be applied to characters in most native language scripts. The Unicode Standard specifies a universal character set (UCS) that contains definitions in Version 3.0 for 49,194 characters and also includes a Pri- vate Use Area for vendor- or user-defined characters. The following list summarizes the main features of this character set: All characters are treated as 16-bit units. Each 16-bit unit has an abstract character identity. Certain sequences of 16-bit characters in a text stream are transformed into other characters, called composed characters. Characters have properties, such as base, numeric, spacing, combina- tion, and directionality. The Unicode standard provides rules for ordering characters with different properties so that parsing of charac- ter sequences is unambiguous. The relationship between Unicode characters and the glyphs in the native language script that users see, type, or print is not necessarily one-to-one. A glyph may be mapped to a single abstract character or a composed character. Conversely, more than one glyph can be mapped to a character. The ISO 8859-1 character set occupies the first 256 code positions (and the ASCII char- acter set the first 128 positions) of the UCS. The ISO/IEC 10646 standard specifies both 16- and 32-bit units for each abstract character defined in the the UCS. The 16-bit character values in Unicode are zero-extended through a second 16-bit unit in the larger encoding format. The second, or low-surrogate, 16-bit unit is reserved for future use in both standards. The Unicode and ISO/IEC 10646 standards specify a uniform character size and allow character units to be processed for all languages by using the same set of rules. Therefore, system support for the universal character set does not need to include multiple algorithms (one or more per language) for converting between file code and internal process code. However, the two different character sizes (16-bit or 32-bit) that the standards support require different parsing schemes for data input and output. Universal character encoding that an imple- mentation parses in 16-bit units (2 octets) is known as UCS-2. This is the canonical Unicode encoding in wide use on PC systems. Universal character encoding that an implementation parses in 32-bit units (4 octets) is known as UCS-4. This is the canonical ISO/IEC 10646 encoding that is in use on systems that can support the larger data unit size. The operating system supports UCS-2 with codeset converters and UCS-4 with both codeset converters and locales. The locales whose names include the string @ucs4 allow use of UCS-4 for internal process code with proprietary file encoding formats. The standards define a number of transformation formats for the universal character set. For the most part, the following UCS transforma- tion formats (UTFs) exist to transform UCS values into sequences of bytes for handling by various byte-oriented protocols: UTF-8, the stan- dard method for transforming UCS-4 process encoding into a sequence of 8-bit bytes and ensuring interchange transparency for characters in C0 code positions (0 to 31), the SPACE (32) character, and the DEL (127) character The operating supports UTF-8 with both codeset converters and locales. UTF-7, an obsolete interchange format for environments that strip the eighth bit from each byte The operating system does not support UTF-7. UTF-1, an obsolete interchange format that is similar to UTF-8 but also ensures inter- change transparency of characters in C1 code positions (128 to 159) The operating system does not support UTF-1. UTF-16, which handles the surrogate character extensions defined by Version 2.0 of the Unicode Standard and represents characters in 2-byte units The surrogate character extensions are characters whose values in UCS-4 are outside the range normally allowed by a 16-bit length restriction. When data includes these characters, the UTF-16 transformation format enables data exchange between applications using UCS-4 and applications that require the data to be in UCS-2 (2-byte) format. Although UTF-16 does not support representation of the entire UCS-4 code space, it supports all characters (except those in certain private-use ranges) that have been currently defined for the languages covered by both standards. Byte orientation in file code can differ and, depending on the platform on which the file was generated, can be little-endian (LE) or big-endian (BE). UTF-16 uses a byte order mark (BOM), which is not part of the file text data, to indicate byte orientation. The code point of the BOM is U+FEFF. The Unicode Standard also defines UTF-16LE and UTF-16BE, which are specific to the little-endian and big-endian orientations, respectively, and do not include a byte order mark. The operating system supports UTF-16, UTF-16LE, and UTF-16BE through codeset converters. In terms of codeset converter names, UTF-16* is recognized as an alias for UCS-2 but also enables codeset conversion of surrogate character extensions. Note By default, the operating system uses UTF-16 rather than UTF-16LE or UTF-16BE. That is, in an input file, the software first looks for a BOM. If a BOM is not found, the converter assumes UTF-16LE. This means that you must explicitly specify UTF-16BE to the con- verter (convert files manually) when UTF-16BE applies to an input file. For an output file, the converter automatically inserts a BOM. This means that you must explicitly specify UTF-16LE or UTF-16BE (convert files manually) when you want conversion output to be UTF-16LE or UTF-16BE rather than UTF-16. UTF-32, which also supports the surrogate character extensions defined by the Unicode Standard but allows character representation in 4-byte encoding units In addition, UTF-32 is restricted in values to the range 0 to 10FFFF, which precisely matches the range of character values defined in the Unicode Standard. Unlike UTF-16, UTF-32 does not support private-use ranges for character values and therefore promotes interoperability among Unicode encoding formats. UTF-32 uses a byte order mark to indicate little-endian or big-endian byte orientation. The Unicode standard also defines UTF-32LE and UTF-32BE , which are specific to the little-endian and big-endian orientations, respectively, and do not include a byte order mark. UTF-32 is almost the same as UCS-4, so you can use UCS-4 codeset converters to process UTF-32. However, the UCS-4 converter software has not yet been changed to support UTF-32, UTF-32LE, or UTF-32BE as alias names in the way that the UTF-16* strings are supported by the UCS-2 converters. Codeset Conversion Codeset converters are available to convert data in all the major encoding formats that the operating system supports to and from UCS-2, UCS-4, and UTF-8. If the worldwide support subsets are installed on your system, you can enter the following commands to find the names of these converters: % cd /usr/lib/nls/loc/iconv % ls | grep UTF % ls | grep UCS Among the converters listed, you will find some that handle conversion of data in the code-page format used on PC systems. See the code_page(5) reference page for more information about converting between codeset and code-page formats. All codeset converters can be used with the iconv command and associated library functions. Note There was a change in mapping of Korean Hangul characters between Version 1.1 and Version 2.0 of the Unicode Standard. By default, UCS-2, UCS-4, and UTF-8 conversion assumes Version 2.0 character mapping for Hangul characters. Therefore, if data is in Version 1.1 format, the data must first be converted to Version 2.0 format before converting from UCS-2, UCS-4, or UTF-8 to an entirely different format. The for- mat of a codeset converter name is from-codeset_to-codeset. In converter names, the Version 1.1 codeset formats for UCS-2, UCS-4, and UTF-8 are represented by UNICODE-1-1, UNICODE-1-1-UCS-4, and UNICODE-1-1-UTF-8, respectively. The Version 2.0 codeset names are represented by UCS-2, UCS-4, and UTF-8. For example, if Korean data is currently in UCS-4 Version 1.1 format, the data must first be processed by the UNICODE-1-1-UCS-4_UCS-4 converter before being processed by the UCS-4_deckorean converter. See the iconv_intro(5) reference page for general information on codeset conversion. Locales The following locales use UCS-4 as internal processing code: universal.UTF-8 This locale converts data in UTF-8 file format to UCS-4 process code. The locale can be used to test any UCS-4 character to deter- mine if it is included in one of the following classes defined for the LC_CTYPE category: alnum, alpha, blank, cntrl, digit, graph, lower, print, punct, space, upper, or xdigit. In the universal.utf8@ucs4 locale, the LC_MESSAGES, LC_MONETARY, LC_NUMERIC, and LC_TIME category definitions match those for the POSIX (C) locale. native_locale_name@ucs4 These locales (for example, fr_FR.ISO8859-1@ucs4) perform the same function as the universal.UTF-8 locale but are different in the following ways: The file code is specified by the codeset portion (for example, ISO8859-1) of native_locale_name. Classification information is not provided for the full set of UCS-4 characters, but only for those in a particular native language (for example, French). Country-specific data is also available to the application. The LC_COLLATE, LC_MESSAGES, LC_MONETARY, LC_NUMERIC, and LC_TIME category definitions match those defined in native_locale_name. language_territory.UTF-8 These locales (for example, fr_FR.UTF-8) are similar to the @ucs4 locales in limiting classification information to the characters in a particular native language and making country-specific data available to the application. However, the locales assume file data follows UTF-8 encoding rules and are the only locales that support the euro monetary character (C=). Note CDE desktop users can select locales by choosing names followed by (Unicode) from the CDE language menu at session startup. In this case, the locale setting applies by default to all applications run during the CDE session. Unicode Character Database For the convenience of programmers, the source file for the Unicode character database (Version 3.0.0) is available online. This source file is the one used to build the locales provided in optional software subsets included with the operating system product. If the locales are installed on your system, both the Unicode character database and an associated ReadMe file are also installed in the /usr/share/unidata directory. The ReadMe file discusses the character properties supported by Unicode. Font Support The operating system provides the following types of bitmap fonts for UCS characters: Public domain Unicode fonts: -etl-fixed-medium-r-normal--14-140-72-72-c-70-iso10646-1 -etl-fixed-medium-r-normal--16-160-72-72-c-80-iso10646-1 -etl-fixed-medium- r-normal--24-240-72-72-c-120-iso10646-1 Composite fonts that the libfr_FGC font renderer creates by combining fonts available for other codesets These fonts currently cover only a subset of the characters in UCS. Each of the ETL public domain fonts supports about 1000 characters, but does not include any characters for Chinese, Japanese, or Korean. The composite fonts created by the font renderer are generated only from fonts available for the ISO 8859-1 (Latin-1) and ISO 8859-15 (Latin-9) codesets. Refer to iso8859-1(5) and iso8859-15(5) for the names of fonts available for Latin-1 and Latin-9 characters. Note that the Latin-9 fonts, which include glyphs for the euro character, provide the best support for the language_territory.UTF-8 locales, which also support this character. For information on printer support and converting bitmap font encoding to PostScript, see i18n_printing(5) and wwpsof(8). SEE ALSO
Commands: locale(1), wwpsof(8) Others: ascii(5), code_page(5), iso8859-1(5), iso8859-15(5), i18n_intro(5), i18n_printing(5), iconv_intro(5), l10n_intro(5) Unicode(5)
Man Page