XML::UM(3) User Contributed Perl Documentation XML::UM(3)
NAME
XML::UM - Convert UTF-8 strings to any encoding supported by XML::Encoding
SYNOPSIS
use XML::UM;
# Set directory with .xml files that comes with XML::Encoding distribution
# Always include the trailing slash!
$XML::UM::ENCDIR = '/home1/enno/perlModules/XML-Encoding-1.01/maps/';
# Create the encoding routine
my $encode = XML::UM::get_encode (
Encoding => 'ISO-8859-2',
EncodeUnmapped => &XML::UM::encode_unmapped_dec);
# Convert a string from UTF-8 to the specified Encoding
my $encoded_str = $encode->($utf8_str);
# Remove circular references for garbage collection
XML::UM::dispose_encoding ('ISO-8859-2');
DESCRIPTION
This module provides methods to convert UTF-8 strings to any XML encoding that XML::Encoding supports. It creates mapping routines from the
.xml files that can be found in the maps/ directory in the XML::Encoding distribution. Note that the XML::Encoding distribution does
install the .enc files in your perl directory, but not the.xml files they were created from. That's why you have to specify $ENCDIR as in
the SYNOPSIS.
This implementation uses the XML::Encoding class to parse the .xml file and creates a hash that maps UTF-8 characters (each consisting of
up to 4 bytes) to their equivalent byte sequence in the specified encoding. Note that large mappings may consume a lot of memory!
Future implementations may parse the .enc files directly, or do the conversions entirely in XS (i.e. C code.)
get_encode (Encoding => STRING, EncodeUnmapped => SUB)
The central entry point to this module is the XML::UM::get_encode() method. It forwards the call to the global $XML::UM::FACTORY, which is
defined as an instance of XML::UM::SlowMapperFactory by default. Override this variable to plug in your own mapper factory.
The XML::UM::SlowMapperFactory creates an instance of XML::UM::SlowMapper (and caches it for subsequent use) that reads in the .xml encod-
ing file and creates a hash that maps UTF-8 characters to encoded characters.
The get_encode() method of XML::UM::SlowMapper is called, finally, which generates an anonimous subroutine that uses the hash to convert
multi-character UTF-8 blocks to the proper encoding.
dispose_encoding ($encoding_name)
Call this to free the memory used by the SlowMapper for a specific encoding. Note that in order to free the big conversion hash, the user
should no longer have references to the subroutines generated by get_encode().
The parameters to the get_encode() method (defined as name/value pairs) are:
o Encoding
The name of the desired encoding, e.g. 'ISO-8859-2'
o EncodeUnmapped (Default: &XML::UM::encode_unmapped_dec)
Defines how Unicode characters not found in the mapping file (of the specified encoding) are printed. By default, they are converted
to decimal entity references, like '{'
Use &XML::UM::encode_unmapped_hex for hexadecimal constants, like '«'
CAVEATS
I'm not exactly sure about which Unicode characters in the range (0 .. 127) should be mapped to themselves. See comments in XML/UM.pm near
%DEFAULT_ASCII_MAPPINGS.
The encodings that expat supports by default are currently not supported, (e.g. UTF-16, ISO-8859-1), because there are no .enc files avail-
able for these encodings. This module needs some more work. If you have the time, please help!
AUTHOR
Send bug reports, hints, tips, suggestions to Enno Derksen at <enno@att.com>.
perl v5.8.0 2000-02-17 XML::UM(3)