utf-8(5) [osf1 man page]

Unicode(5)							File Formats Manual							Unicode(5)

NAME

       Unicode, unicode, universal.utf8, UCS-2, UCS-4, UTF-8, UTF-16, UTF-32, iso10646 - Support for the Unicode and ISO/IEC 10646 standards

DESCRIPTION

       The  operating system provides locales and codeset converters that support the following standards: The Unicode Standard, Version 3.0, Uni-
       code, Inc., 1999 Information Technology-Universal Multiple-Octet Coded Character Set, ISO/IEC 10646:1993

	      The Basic Multilingual Plane defined by this standard is identical with the main body of Unicode character encoding.

       These standards define generalized character encoding rules that can be applied to characters in most native language scripts. The  Unicode
       Standard  specifies a universal character set (UCS) that contains definitions in Version 3.0 for 49,194 characters and also includes a Pri-
       vate Use Area for vendor- or user-defined characters. The following list summarizes the main features of this character set: All characters
       are treated as 16-bit units.  Each 16-bit unit has an abstract character identity.  Certain sequences of 16-bit characters in a text stream
       are transformed into other characters, called composed characters.  Characters have properties, such as base,  numeric,	spacing,  combina-
       tion,  and directionality. The Unicode standard provides rules for ordering characters with different properties so that parsing of charac-
       ter sequences is unambiguous.  The relationship between Unicode characters and the glyphs in the native language  script  that  users  see,
       type,  or  print  is  not necessarily one-to-one. A glyph may be mapped to a single abstract character or a composed character. Conversely,
       more than one glyph can be mapped to a character.  The ISO 8859-1 character set occupies the first 256 code positions (and the ASCII  char-
       acter set the first 128 positions) of the UCS.

       The  ISO/IEC  10646  standard specifies both 16- and 32-bit units for each abstract character defined in the the UCS.  The 16-bit character
       values in Unicode are zero-extended through a second 16-bit unit in the larger encoding format. The second, or low-surrogate,  16-bit  unit
       is reserved for future use in both standards.

       The  Unicode  and  ISO/IEC  10646 standards specify a uniform character size and allow character units to be processed for all languages by
       using the same set of rules. Therefore, system support for the universal character set does not need to include multiple algorithms (one or
       more  per  language)  for  converting  between  file  code and internal process code. However, the two different character sizes (16-bit or
       32-bit) that the standards support require different parsing schemes for data input and output. Universal character encoding that an imple-
       mentation parses in 16-bit units (2 octets) is known as UCS-2.  This is the canonical Unicode encoding in wide use on PC systems. Universal
       character encoding that an implementation parses in 32-bit units (4 octets) is known as UCS-4. This is the canonical ISO/IEC 10646 encoding
       that is in use on systems that can support the larger data unit size.

       The  operating  system  supports  UCS-2 with codeset converters and UCS-4 with both codeset converters and locales. The locales whose names
       include the string @ucs4 allow use of UCS-4 for internal process code with proprietary file encoding formats.

       The standards define a number of transformation formats for the universal character set.  For the most part, the following UCS  transforma-
       tion formats (UTFs) exist to transform UCS values into sequences of bytes for handling by various byte-oriented protocols: UTF-8, the stan-
       dard method for transforming UCS-4 process encoding into a sequence of 8-bit bytes and ensuring interchange transparency for characters	in
       C0 code positions (0 to 31), the SPACE (32) character, and the DEL (127) character

	      The  operating supports UTF-8 with both codeset converters and locales.  UTF-7, an obsolete interchange format for environments that
	      strip the eighth bit from each byte

	      The operating system does not support UTF-7.  UTF-1, an obsolete interchange format that is similar to UTF-8 but also ensures inter-
	      change transparency of characters in C1 code positions (128 to 159)

	      The operating system does not support UTF-1.  UTF-16, which handles the surrogate character extensions defined by Version 2.0 of the
	      Unicode Standard and represents characters in 2-byte units

	      The surrogate character extensions are characters whose values in UCS-4 are outside the range normally allowed by  a  16-bit  length
	      restriction.  When data includes these characters, the UTF-16 transformation format enables data exchange between applications using
	      UCS-4 and applications that require the data to be in UCS-2 (2-byte) format. Although UTF-16 does not support representation of  the
	      entire  UCS-4  code  space, it supports all characters (except those in certain private-use ranges) that have been currently defined
	      for the languages covered by both standards.

	      Byte orientation in file code can differ and, depending on the platform on which the file was generated, can be  little-endian  (LE)
	      or big-endian (BE).  UTF-16 uses a byte order mark (BOM), which is not part of the file text data, to indicate byte orientation. The
	      code point of the BOM is U+FEFF. The Unicode Standard also defines UTF-16LE and UTF-16BE, which are specific  to	the  little-endian
	      and big-endian orientations, respectively, and do not include a byte order mark.

	      The  operating  system  supports	UTF-16,  UTF-16LE,  and  UTF-16BE through codeset converters. In terms of codeset converter names,
	      UTF-16* is recognized as an alias for UCS-2 but also enables codeset conversion of surrogate character extensions.

									      Note

	      By default, the operating system uses UTF-16 rather than UTF-16LE or UTF-16BE. That is, in an input file, the software  first  looks
	      for  a  BOM. If a BOM is not found, the converter assumes UTF-16LE. This means that you must explicitly specify UTF-16BE to the con-
	      verter (convert files manually) when UTF-16BE applies to an input file. For an output file, the converter  automatically	inserts  a
	      BOM. This means that you must explicitly specify UTF-16LE or UTF-16BE (convert files manually) when you want conversion output to be
	      UTF-16LE or UTF-16BE rather than UTF-16.	UTF-32, which also supports the surrogate character  extensions  defined  by  the  Unicode
	      Standard but allows character representation in 4-byte encoding units

	      In  addition, UTF-32 is restricted in values to the range 0 to 10FFFF, which precisely matches the range of character values defined
	      in the Unicode Standard. Unlike UTF-16, UTF-32 does not support private-use ranges  for  character  values  and  therefore  promotes
	      interoperability among Unicode encoding formats.

	      UTF-32  uses  a byte order mark to indicate little-endian or big-endian byte orientation. The Unicode standard also defines UTF-32LE
	      and UTF-32BE , which are specific to the little-endian and big-endian orientations, respectively, and do not include  a  byte  order
	      mark.

	      UTF-32 is almost the same as UCS-4, so you can use UCS-4 codeset converters to process UTF-32. However, the UCS-4 converter software
	      has not yet been changed to support UTF-32, UTF-32LE, or UTF-32BE as alias names in the way that the UTF-16* strings  are  supported
	      by the UCS-2 converters.

   Codeset Conversion
       Codeset	converters  are  available to convert data in all the major encoding formats that the operating system supports to and from UCS-2,
       UCS-4, and UTF-8.  If the worldwide support subsets are installed on your system, you can enter the following commands to find the names of
       these converters: % cd /usr/lib/nls/loc/iconv % ls | grep UTF % ls | grep UCS

       Among  the  converters  listed,	you  will  find  some  that  handle conversion of data in the code-page format used on PC systems. See the
       code_page(5) reference page for more information about converting between codeset and code-page formats.  All  codeset  converters  can	be
       used with the iconv command and associated library functions.

									  Note

       There  was  a change in mapping of Korean Hangul characters between Version 1.1 and Version 2.0 of the Unicode Standard. By default, UCS-2,
       UCS-4, and UTF-8 conversion assumes Version 2.0 character mapping for Hangul characters.  Therefore, if data is in Version 1.1 format,  the
       data  must first be converted to Version 2.0 format before converting from UCS-2, UCS-4, or UTF-8 to an entirely different format. The for-
       mat of a codeset converter name is from-codeset_to-codeset.  In converter names, the Version 1.1 codeset  formats  for  UCS-2,  UCS-4,  and
       UTF-8 are represented by UNICODE-1-1, UNICODE-1-1-UCS-4, and UNICODE-1-1-UTF-8, respectively. The Version 2.0 codeset names are represented
       by UCS-2, UCS-4, and UTF-8. For example, if Korean data is currently in UCS-4 Version 1.1 format, the data must first be processed  by  the
       UNICODE-1-1-UCS-4_UCS-4 converter before being processed by the UCS-4_deckorean converter.

       See the iconv_intro(5) reference page for general information on codeset conversion.

   Locales
       The following locales use UCS-4 as internal processing code: universal.UTF-8

	      This  locale converts data in UTF-8 file format to UCS-4 process code.  The locale can be used to test any UCS-4 character to deter-
	      mine if it is included in one of the following classes defined for the LC_CTYPE category: alnum, alpha, blank, cntrl, digit,  graph,
	      lower, print, punct, space, upper, or xdigit.

	      In  the  universal.utf8@ucs4  locale, the LC_MESSAGES, LC_MONETARY, LC_NUMERIC, and LC_TIME category definitions match those for the
	      POSIX (C) locale.  native_locale_name@ucs4

	      These locales (for example, fr_FR.ISO8859-1@ucs4) perform the same function as the universal.UTF-8 locale but are different  in  the
	      following  ways:	The  file code is specified by the codeset portion (for example, ISO8859-1) of native_locale_name.  Classification
	      information is not provided for the full set of UCS-4 characters, but only for those in a particular native language  (for  example,
	      French).	 Country-specific  data  is  also available to the application.  The LC_COLLATE, LC_MESSAGES, LC_MONETARY, LC_NUMERIC, and
	      LC_TIME category definitions match those defined in native_locale_name.  language_territory.UTF-8

	      These locales (for example, fr_FR.UTF-8) are similar to the @ucs4 locales in limiting classification information to  the	characters
	      in a particular native language and making country-specific data available to the application. However, the locales assume file data
	      follows UTF-8 encoding rules and are the only locales that support the euro monetary character (C=).

									  Note

       CDE desktop users can select locales by choosing names followed by (Unicode) from the CDE language menu at session startup. In  this  case,
       the locale setting applies by default to all applications run during the CDE session.

   Unicode Character Database
       For  the  convenience  of  programmers, the source file for the Unicode character database (Version 3.0.0) is available online. This source
       file is the one used to build the locales provided in optional software subsets included with the operating system product. If the  locales
       are  installed  on  your  system,  both	the  Unicode  character  database  and	an  associated	ReadMe	file  are  also  installed  in the
       /usr/share/unidata directory.  The ReadMe file discusses the character properties supported by Unicode.

   Font Support
       The operating system provides the following types of bitmap fonts for UCS characters: Public domain Unicode fonts:

	      -etl-fixed-medium-r-normal--14-140-72-72-c-70-iso10646-1 -etl-fixed-medium-r-normal--16-160-72-72-c-80-iso10646-1 -etl-fixed-medium-
	      r-normal--24-240-72-72-c-120-iso10646-1  Composite  fonts  that the libfr_FGC font renderer creates by combining fonts available for
	      other codesets

       These fonts currently cover only a subset of the characters in UCS.  Each of the ETL public domain fonts supports  about  1000  characters,
       but  does  not include any characters for Chinese, Japanese, or Korean. The composite fonts created by the font renderer are generated only
       from fonts available for the ISO 8859-1 (Latin-1) and ISO 8859-15 (Latin-9) codesets.

       Refer to iso8859-1(5) and iso8859-15(5) for the names of fonts available for Latin-1 and Latin-9 characters. Note that the  Latin-9  fonts,
       which  include  glyphs  for  the euro character, provide the best support for the language_territory.UTF-8 locales, which also support this
       character.

       For information on printer support and converting bitmap font encoding to PostScript, see i18n_printing(5) and wwpsof(8).

SEE ALSO

       Commands: locale(1), wwpsof(8)

       Others: ascii(5), code_page(5), iso8859-1(5), iso8859-15(5), i18n_intro(5), i18n_printing(5), iconv_intro(5), l10n_intro(5)

																	Unicode(5)
Linux and UNIX Man Pages

utf-8(5) [osf1 man page]