tcl_externaltoutf(3) [opendarwin man page]

Tcl_GetEncoding(3)					      Tcl Library Procedures						Tcl_GetEncoding(3)

__________________________________________________________________________________________________________________________________________________

NAME

       Tcl_GetEncoding,    Tcl_FreeEncoding,	Tcl_ExternalToUtfDString,    Tcl_ExternalToUtf,    Tcl_UtfToExternalDString,	Tcl_UtfToExternal,
       Tcl_WinTCharToUtf, Tcl_WinUtfToTChar, Tcl_GetEncodingName, Tcl_SetSystemEncoding, Tcl_GetEncodingNames, Tcl_CreateEncoding, Tcl_GetDefault-
       EncodingDir, Tcl_SetDefaultEncodingDir - procedures for creating and using encodings.

SYNOPSIS

       #include <tcl.h>

       Tcl_Encoding
       Tcl_GetEncoding(interp, name)

       void
       Tcl_FreeEncoding(encoding)

       char *
       Tcl_ExternalToUtfDString(encoding, src, srcLen, dstPtr)

       int
       Tcl_ExternalToUtf(interp, encoding, src, srcLen, flags, statePtr, dst, dstLen, srcReadPtr, dstWrotePtr,
	    dstCharsPtr)

       char *
       Tcl_UtfToExternalDString(encoding, src, srcLen, dstPtr)

       int
       Tcl_UtfToExternal(interp, encoding, src, srcLen, flags, statePtr, dst, dstLen, srcReadPtr, dstWrotePtr,
	    dstCharsPtr)

       char *
       Tcl_WinTCharToUtf(tsrc, srcLen, dstPtr)

       TCHAR *
       Tcl_WinUtfToTChar(src, srcLen, dstPtr)

       CONST char *
       Tcl_GetEncodingName(encoding)

       int
       Tcl_SetSystemEncoding(interp, name)

       void
       Tcl_GetEncodingNames(interp)

       Tcl_Encoding
       Tcl_CreateEncoding(typePtr)

       CONST char *
       Tcl_GetDefaultEncodingDir(void)

       void
       Tcl_SetDefaultEncodingDir(path)

ARGUMENTS

       Tcl_Interp	   *interp	  (in)	    Interpreter to use for error reporting, or NULL if no error reporting is desired.

       CONST char	   *name	  (in)	    Name of encoding to load.

       Tcl_Encoding	   encoding	  (in)	    The  encoding  to  query,  free, or use for converting text.  If encoding is NULL, the current
						    system encoding is used.

       CONST char	   *src 	  (in)	    For the Tcl_ExternalToUtf functions, an array of bytes in the specified encoding that  are	to
						    be converted to UTF-8.  For the Tcl_UtfToExternal and Tcl_WinUtfToTChar functions, an array of
						    UTF-8 characters to be converted to the specified encoding.

       CONST TCHAR	   *tsrc	  (in)	    An array of Windows TCHAR characters to convert to UTF-8.

       int		   srcLen	  (in)	    Length of src or tsrc in bytes.  If the length is negative, the  encoding-specific	length	of
						    the string is used.

       Tcl_DString	   *dstPtr	  (out)     Pointer to an uninitialized or free Tcl_DString in which the converted result will be stored.

       int		   flags	  (in)	    Various  flag bits OR-ed together.	TCL_ENCODING_START signifies that the source buffer is the
						    first block in a (potentially multi-block) input stream, telling  the  conversion  routine	to
						    reset  to an initial state and perform any initialization that needs to occur before the first
						    byte is converted.	TCL_ENCODING_END signifies that the source buffer is the last block  in  a
						    (potentially  multi-block)	input stream, telling the conversion routine to perform any final-
						    ization that needs to occur after the last byte is converted and then to reset to  an  initial
						    state.   TCL_ENCODING_STOPONERROR  signifies that the conversion routine should return immedi-
						    ately upon reading a source character that doesn't exist in the target encoding;  otherwise  a
						    default fallback character will automatically be substituted.

       Tcl_EncodingState   *statePtr	  (in/out)  Used  when	converting a (generally long or indefinite length) byte stream in a piece by piece
						    fashion.  The conversion routine stores its current state in *statePtr after src  (the  buffer
						    containing	the  current piece) has been converted; that state information must be passed back
						    when converting the next piece of the stream so the conversion routine knows what state it was
						    in when it left off at the end of the last piece.  May be NULL, in which case the value speci-
						    fied for flags is ignored and the source buffer is assumed to contain the complete	string	to
						    convert.

       char		   *dst 	  (out)     Buffer in which the converted result will be stored.  No more than dstLen bytes will be stored
						    in dst.

       int		   dstLen	  (in)	    The maximum length of the output buffer dst in bytes.

       int		   *srcReadPtr	  (out)     Filled with the number of bytes from src that were actually converted.  This may be less  than
						    the  original  source length if there was a problem converting some source characters.  May be
						    NULL.

       int		   *dstWrotePtr   (out)     Filled with the number of bytes that were actually stored in the output buffer as a result	of
						    the conversion.  May be NULL.

       int		   *dstCharsPtr   (out)     Filled with the number of characters that correspond to the number of bytes stored in the out-
						    put buffer.  May be NULL.

       Tcl_EncodingType    *typePtr	  (in)	    Structure that defines a new type of encoding.

       CONST char	   *path	  (in)	    A path to the location of the encoding file.
_________________________________________________________________

INTRODUCTION

       These routines convert between Tcl's internal character representation, UTF-8, and character representations used by various operating sys-
       tems  or  file  systems,  such as Unicode, ASCII, or Shift-JIS.	When operating on strings, such as such as obtaining the names of files or
       displaying characters using international fonts, the strings must be translated into one or possibly multiple formats that the various sys-
       tem  calls  can	use.  For instance, on a Japanese Unix workstation, a user might obtain a filename represented in the EUC-JP file encoding
       and then translate the characters to the jisx0208 font encoding in order to display the filename in a Tk widget.  The purpose of the encod-
       ing  package  is  to help bridge the translation gap.  UTF-8 provides an intermediate staging ground for all the various encodings.  In the
       example above, text would be translated into UTF-8 from whatever file encoding the operating system is using.  Then it would be	translated
       from UTF-8 into whatever font encoding the display routines require.

       Some  basic  encodings  are  compiled into Tcl.	Others can be defined by the user or dynamically loaded from encoding files in a platform-
       independent manner.

DESCRIPTION

       Tcl_GetEncoding finds an encoding given its name.  The name may refer to a builtin Tcl encoding,  a  user-defined  encoding  registered	by
       calling	Tcl_CreateEncoding,  or a dynamically-loadable encoding file.  The return value is a token that represents the encoding and can be
       used in subsequent calls to procedures such as Tcl_GetEncodingName, Tcl_FreeEncoding, and Tcl_UtfToExternal.  If the name did not refer	to
       any known or loadable encoding, NULL is returned and an error message is returned in interp.

       The  encoding  package  maintains  a  database  of all encodings currently in use.  The first time name is seen, Tcl_GetEncoding returns an
       encoding with a reference count of 1.  If the same name is requested further times, then the reference count for that  encoding	is  incre-
       mented without the overhead of allocating a new encoding and all its associated data structures.

       When  an  encoding  is  no  longer  needed, Tcl_FreeEncoding should be called to release it.  When an encoding is no longer in use anywhere
       (i.e., it has been freed as many times as it has been gotten) Tcl_FreeEncoding will release all storage the encoding was using  and  delete
       it from the database.

       Tcl_ExternalToUtfDString  converts  a  source buffer src from the specified encoding into UTF-8.  The converted bytes are stored in dstPtr,
       which is then null-terminated.  The caller should eventually call Tcl_DStringFree to free any information stored in dstPtr.  When  convert-
       ing, if any of the characters in the source buffer cannot be represented in the target encoding, a default fallback character will be used.
       The return value is a pointer to the value stored in the DString.

       Tcl_ExternalToUtf converts a source buffer src from the specified encoding into UTF-8.  Up to srcLen bytes are converted  from  the  source
       buffer and up to dstLen converted bytes are stored in dst.  In all cases, *srcReadPtr is filled with the number of bytes that were success-
       fully converted from src and *dstWrotePtr is filled with the corresponding number of bytes that were stored in dst.  The  return  value	is
       one of the following:

	      TCL_OK			   All bytes of src were converted.

	      TCL_CONVERT_NOSPACE	   The	destination buffer was not large enough for all of the converted data; as many characters as could
					   fit were converted though.

	      TCL_CONVERT_MULTIBYTE	   The last fews bytes in the source buffer were the beginning of a multibyte  sequence,  but  more  bytes
					   were  needed to complete this sequence.  A subsequent call to the conversion routine should pass a buf-
					   fer containing the unconverted bytes that remained in src plus  some  further  bytes  from  the  source
					   stream to properly convert the formerly split-up multibyte sequence.

	      TCL_CONVERT_SYNTAX	   The source buffer contained an invalid character sequence.  This may occur if the input stream has been
					   damaged or if the input encoding method was misidentified.

	      TCL_CONVERT_UNKNOWN	   The source buffer contained a character that could not  be  represented  in	the  target  encoding  and
					   TCL_ENCODING_STOPONERROR was specified.

       Tcl_UtfToExternalDString  converts  a  source buffer src from UTF-8 into the specified encoding.  The converted bytes are stored in dstPtr,
       which is then terminated with the appropriate encoding-specific null.  The caller should eventually call Tcl_DStringFree to free any infor-
       mation  stored  in  dstPtr.  When converting, if any of the characters in the source buffer cannot be represented in the target encoding, a
       default fallback character will be used.  The return value is a pointer to the value stored in the DString.

       Tcl_UtfToExternal converts a source buffer src from UTF-8 into the specified encoding.  Up to srcLen bytes are converted  from  the  source
       buffer and up to dstLen converted bytes are stored in dst.  In all cases, *srcReadPtr is filled with the number of bytes that were success-
       fully converted from src and *dstWrotePtr is filled with the corresponding number of bytes that were stored in dst.  The return values  are
       the same as the return values for Tcl_ExternalToUtf.

       Tcl_WinUtfToTChar  and  Tcl_WinTCharToUtf are Windows-only convenience functions for converting between UTF-8 and Windows strings.  On Win-
       dows 95 (as with the Macintosh and Unix operating systems), all strings exchanged between Tcl and the operating system  are  "char"  based.
       On Windows NT, some strings exchanged between Tcl and the operating system are "char" oriented while others are in Unicode.  By convention,
       in Windows a TCHAR is a character in the ANSI code page on Windows 95 and a Unicode character on Windows NT.

       If you planned to use the same "char"  based  interfaces  on  both  Windows  95	and  Windows  NT,  you	could  use  Tcl_UtfToExternal  and
       Tcl_ExternalToUtf  (or  their  Tcl_DString  equivalents) with an encoding of NULL (the current system encoding).  On the other hand, if you
       planned to use the Unicode interface when running on Windows NT and the "char" interfaces when running on Windows 95,  you  would  have	to
       perform the following type of test over and over in your program (as represented in pseudo-code):
	      if (running NT) {
		  encoding <- Tcl_GetEncoding("unicode");
		  nativeBuffer <- Tcl_UtfToExternal(encoding, utfBuffer);
		  Tcl_FreeEncoding(encoding);
	      } else {
		  nativeBuffer <- Tcl_UtfToExternal(NULL, utfBuffer);
       Tcl_WinUtfToTChar  and  Tcl_WinTCharToUtf automatically handle this test and use the proper encoding based on the current operating system.
       Tcl_WinUtfToTChar returns a pointer to a TCHAR string, and Tcl_WinTCharToUtf expects a TCHAR string pointer as the src string.	Otherwise,
       these functions behave identically to Tcl_UtfToExternalDString and Tcl_ExternalToUtfDString.

       Tcl_GetEncodingName  is	roughly the inverse of Tcl_GetEncoding.  Given an encoding, the return value is the name argument that was used to
       create the encoding.  The string returned by Tcl_GetEncodingName is only guaranteed to persist until the encoding is deleted.   The  caller
       must not modify this string.

       Tcl_SetSystemEncoding  sets the default encoding that should be used whenever the user passes a NULL value for the encoding argument to any
       of the other encoding functions.  If name is NULL, the system encoding is reset to the default system encoding, binary.	If  the  name  did
       not  refer  to  any  known  or  loadable encoding, TCL_ERROR is returned and an error message is left in interp.  Otherwise, this procedure
       increments the reference count of the new system encoding, decrements the reference count of the old system encoding, and returns TCL_OK.

       Tcl_GetEncodingNames sets the interp result to a list consisting of the names of all the encodings that are currently  defined  or  can	be
       dynamically  loaded,  searching	the encoding path specified by Tcl_SetDefaultEncodingDir.  This procedure does not ensure that the dynami-
       cally-loadable encoding files contain valid data, but merely that they exist.

       Tcl_CreateEncoding defines a new encoding and registers the C procedures that are called back to convert between the  encoding  and  UTF-8.
       Encodings  created  by Tcl_CreateEncoding are thereafter visible in the database used by Tcl_GetEncoding.  Just as with the Tcl_GetEncoding
       procedure, the return value is a token that represents the encoding and can be used  in	subsequent  calls  to  other  encoding	functions.
       Tcl_CreateEncoding  returns  an encoding with a reference count of 1. If an encoding with the specified name already exists, then its entry
       in the database is replaced with the new encoding; the token for the old encoding will remain valid and continue to behave as  before,  but
       users of the new token will now call the new encoding procedures.

       The  typePtr  argument  to Tcl_CreateEncoding contains information about the name of the encoding and the procedures that will be called to
       convert between this encoding and UTF-8.  It is defined as follows:

	      typedef struct Tcl_EncodingType {
		CONST char *encodingName;
		Tcl_EncodingConvertProc *toUtfProc;
		Tcl_EncodingConvertProc *fromUtfProc;
		Tcl_EncodingFreeProc *freeProc;
		ClientData clientData;
		int nullSize;
	      } Tcl_EncodingType;

       The encodingName provides a string name for the encoding, by which it can be referred in other procedures  such	as  Tcl_GetEncoding.   The
       toUtfProc  refers  to  a  callback procedure to invoke to convert text from this encoding into UTF-8.  The fromUtfProc refers to a callback
       procedure to invoke to convert text from UTF-8 into this encoding.  The freeProc refers to a callback procedure to invoke when this  encod-
       ing is deleted.	The freeProc field may be NULL.  The clientData contains an arbitrary one-word value passed to toUtfProc, fromUtfProc, and
       freeProc whenever they are called.  Typically, this is a pointer to a data structure containing encoding-specific information that  can	be
       used  by the callback procedures.  For instance, two very similar encodings such as ascii and macRoman may use the same callback procedure,
       but use different values of clientData to control its behavior.	The nullSize specifies the number of zero bytes that signify end-of-string
       in this encoding.  It must be 1 (for single-byte or multi-byte encodings like ASCII or Shift-JIS) or 2 (for double-byte encodings like Uni-
       code).  Constant-sized encodings with 3 or more bytes per character (such as CNS11643) are not accepted.

       The callback procedures toUtfProc and fromUtfProc should match the type Tcl_EncodingConvertProc:

	      typedef int Tcl_EncodingConvertProc(
		ClientData clientData,
		CONST char *src,
		int srcLen,
		int flags,
		Tcl_Encoding *statePtr,
		char *dst,
		int dstLen,
		int *srcReadPtr,
		int *dstWrotePtr,
		int *dstCharsPtr);

       The toUtfProc and fromUtfProc procedures are called by the Tcl_ExternalToUtf or Tcl_UtfToExternal family of functions to perform the actual
       conversion.   The  clientData  parameter  to  these procedures is the same as the clientData field specified to Tcl_CreateEncoding when the
       encoding was created.  The remaining arguments to the callback procedures are the  same	as  the  arguments,  documented  at  the  top,	to
       Tcl_ExternalToUtf or Tcl_UtfToExternal, with the following exceptions.  If the srcLen argument to one of those high-level functions is neg-
       ative, the value passed to the callback procedure will be  the  appropriate  encoding-specific  string  length  of  src.   If  any  of  the
       srcReadPtr,  dstWrotePtr,  or dstCharsPtr arguments to one of the high-level functions is NULL, the corresponding value passed to the call-
       back procedure will be a non-NULL location.

       The callback procedure freeProc, if non-NULL, should match the type Tcl_EncodingFreeProc:
	      typedef void Tcl_EncodingFreeProc(
		ClientData clientData);

       This freeProc function is called when the encoding is deleted.  The clientData parameter is the same as the clientData field  specified	to
       Tcl_CreateEncoding when the encoding was created.

       Tcl_GetDefaultEncodingDir  and  Tcl_SetDefaultEncodingDir access and set the directory to use when locating the default encoding files.	If
       this value is not NULL, the TclpInitLibraryPath routine appends the path to the head of the search path, and uses this path  as	the  first
       place to look into when trying to locate the encoding file.

ENCODING FILES

       Space  would prohibit precompiling into Tcl every possible encoding algorithm, so many encodings are stored on disk as dynamically-loadable
       encoding files.	This behavior also allows the user to create additional encoding files that can be loaded using the same mechanism.  These
       encoding  files	contain  information  about  the tables and/or escape sequences used to map between an external encoding and Unicode.  The
       external encoding may consist of single-byte, multi-byte, or double-byte characters.

       Each dynamically-loadable encoding is represented as a text file.  The initial line of the file, beginning with a ``#'' symbol, is  a  com-
       ment  that  provides  a	human-readable description of the file.  The next line identifies the type of encoding file.  It can be one of the
       following letters:

       [1]   S
	      A single-byte encoding, where one character is always one byte long in the encoding.  An example is iso8859-1, used by many European
	      languages.

       [2]   D
	      A double-byte encoding, where one character is always two bytes long in the encoding.  An example is big5, used for Chinese text.

       [3]   M
	      A  multi-byte  encoding,	where  one character may be either one or two bytes long.  Certain bytes are a lead bytes, indicating that
	      another byte must follow and that together the two bytes represent one character.  Other bytes are  not  lead  bytes  and  represent
	      themselves.  An example is shiftjis, used by many Japanese computers.

       [4]   E
	      An  escape-sequence encoding, specifying that certain sequences of bytes do not represent characters, but commands that describe how
	      following bytes should be interpreted.

       The rest of the lines in the file depend on the type.

       Cases [1], [2], and [3] are collectively referred to as table-based encoding files.  The lines in a table-based encoding file  are  in  the
       same format as this example taken from the shiftjis encoding (this is not the complete file):
	      # Encoding file: shiftjis, multi-byte
	      M
	      003F 0 40
	      00
	      0000000100020003000400050006000700080009000A000B000C000D000E000F
	      0010001100120013001400150016001700180019001A001B001C001D001E001F
	      0020002100220023002400250026002700280029002A002B002C002D002E002F
	      0030003100320033003400350036003700380039003A003B003C003D003E003F
	      0040004100420043004400450046004700480049004A004B004C004D004E004F
	      0050005100520053005400550056005700580059005A005B005C005D005E005F
	      0060006100620063006400650066006700680069006A006B006C006D006E006F
	      0070007100720073007400750076007700780079007A007B007C007D203E007F
	      0080000000000000000000000000000000000000000000000000000000000000
	      0000000000000000000000000000000000000000000000000000000000000000
	      0000FF61FF62FF63FF64FF65FF66FF67FF68FF69FF6AFF6BFF6CFF6DFF6EFF6F
	      FF70FF71FF72FF73FF74FF75FF76FF77FF78FF79FF7AFF7BFF7CFF7DFF7EFF7F
	      FF80FF81FF82FF83FF84FF85FF86FF87FF88FF89FF8AFF8BFF8CFF8DFF8EFF8F
	      FF90FF91FF92FF93FF94FF95FF96FF97FF98FF99FF9AFF9BFF9CFF9DFF9EFF9F
	      0000000000000000000000000000000000000000000000000000000000000000
	      0000000000000000000000000000000000000000000000000000000000000000
	      81
	      0000000000000000000000000000000000000000000000000000000000000000
	      0000000000000000000000000000000000000000000000000000000000000000
	      0000000000000000000000000000000000000000000000000000000000000000
	      0000000000000000000000000000000000000000000000000000000000000000
	      300030013002FF0CFF0E30FBFF1AFF1BFF1FFF01309B309C00B4FF4000A8FF3E
	      FFE3FF3F30FD30FE309D309E30034EDD30053006300730FC20152010FF0F005C
	      301C2016FF5C2026202520182019201C201DFF08FF0930143015FF3BFF3DFF5B
	      FF5D30083009300A300B300C300D300E300F30103011FF0B221200B100D70000
	      00F7FF1D2260FF1CFF1E22662267221E22342642264000B0203220332103FFE5
	      FF0400A200A3FF05FF03FF06FF0AFF2000A72606260525CB25CF25CE25C725C6
	      25A125A025B325B225BD25BC203B301221922190219121933013000000000000
	      000000000000000000000000000000002208220B2286228722822283222A2229
	      000000000000000000000000000000002227222800AC21D221D4220022030000
	      0000000000000000000000000000000000000000222022A52312220222072261
	      2252226A226B221A223D221D2235222B222C0000000000000000000000000000
	      212B2030266F266D266A2020202100B6000000000000000025EF000000000000

       The  third line of the file is three numbers.  The first number is the fallback character (in base 16) to use when converting from UTF-8 to
       this encoding.  The second number is a 1 if this file represents the encoding for a symbol font, or 0 otherwise.  The last number (in  base
       10) is how many pages of data follow.

       Subsequent  lines  in the example above are pages that describe how to map from the encoding into 2-byte Unicode.  The first line in a page
       identifies the page number.  Following it are 256 double-byte numbers, arranged as 16 rows of 16 numbers.  Given a character in the  encod-
       ing,  the high byte of that character is used to select which page, and the low byte of that character is used as an index to select one of
       the double-byte numbers in that page - the value obtained being the corresponding Unicode character.  By examination of the example  above,
       one can see that the characters 0x7E and 0x8163 in shiftjis map to 203E and 2026 in Unicode, respectively.

       Following the first page will be all the other pages, each in the same format as the first: one number identifying the page followed by 256
       double-byte Unicode characters.	If a character in the encoding maps to the Unicode character 0000, it means  that  the	character  doesn't
       actually exist.	If all characters on a page would map to 0000, that page can be omitted.

       Case  [4]  is  the  escape-sequence encoding file.  The lines in an this type of file are in the same format as this example taken from the
       iso2022-jp encoding:
	      # Encoding file: iso2022-jp, escape-driven
	      E
	      init	     {}
	      final	     {}
	      iso8859-1      x1b(B
	      jis0201	     x1b(J
	      jis0208	     x1b$@
	      jis0208	     x1b$B
	      jis0212	     x1b$(D
	      gb2312	     x1b$A
	      ksc5601	     x1b$(C

       In the file, the first column represents an option and the second column is the associated value.  init is  a  string  to  emit	or  expect
       before  the  first character is converted, while final is a string to emit or expect after the last character.  All other options are names
       of table-based encodings; the associated value is the escape-sequence that marks that encoding.	Tcl syntax is used for the values; in  the
       above example, for instance, ``{}'' represents the empty string and ``x1b'' represents character 27.

       When  Tcl_GetEncoding  encounters  an encoding name that has not been loaded, it attempts to load an encoding file called name.enc from the
       encoding subdirectory of each directory specified in the library path $tcl_libPath.  If the encoding file  exists,  but	is  malformed,	an
       error message will be left in interp.

KEYWORDS

       utf, encoding, convert

Tcl									8.1							Tcl_GetEncoding(3)
Linux and UNIX Man Pages

tcl_externaltoutf(3) [opendarwin man page]