Encoding Conversion

Data encoding compatibility problems are one of the most common difficulties encountered by programmers new to XML in general and libxml in particular. Thinking through the design of your application in light of this issue will help avoid difficulties later. Internally, libxml stores and manipulates data in the UTF-8 format. Data used by your program in other formats, such as the commonly used ISO-8859-1 encoding, must be converted to UTF-8 before passing it to libxml functions. If you want your program's output in an encoding other than UTF-8, you also must convert it.

Libxml uses iconv if it is available to convert data. Without iconv, only UTF-8, UTF-16 and ISO-8859-1 can be used as external formats. With iconv, any format can be used provided iconv is able to convert it to and from UTF-8. Currently iconv supports about 150 different character formats with ability to convert from any to any. While the actual number of supported formats varies between implementations, every iconv implementation is almost guaranteed to support every format anyone has ever heard of.

[Warning]Warning

A common mistake is to use different formats for the internal data in different parts of one's code. The most common case is an application that assumes ISO-8859-1 to be the internal data format, combined with libxml, which assumes UTF-8 to be the internal data format. The result is an application that treats internal data differently, depending on which code section is executing. The one or the other part of code will then, naturally, misinterpret the data.

This example constructs a simple document, then adds content provided at the command line to the document's root element and outputs the results to stdout in the proper encoding. For this example, we use ISO-8859-1 encoding. The encoding of the string input at the command line is converted from ISO-8859-1 to UTF-8. Full code: Appendix H, Code for Encoding Conversion Example

The conversion, encapsulated in the example code in the convert function, uses libxml's xmlFindCharEncodingHandler function:

	1xmlCharEncodingHandlerPtr handler;
        2size = (int)strlen(in)+1; 
        out_size = size*2-1; 
        out = malloc((size_t)out_size); 

…
	3handler = xmlFindCharEncodingHandler(encoding);
…
	4handler->input(out, &out_size, in, &temp);
…	
	5xmlSaveFormatFileEnc("-", doc, encoding, 1);
      

1

handler is declared as a pointer to an xmlCharEncodingHandler function.

2

The xmlCharEncodingHandler function needs to be given the size of the input and output strings, which are calculated here for strings in and out.

3

xmlFindCharEncodingHandler takes as its argument the data's initial encoding and searches libxml's built-in set of conversion handlers, returning a pointer to the function or NULL if none is found.

4

The conversion function identified by handler requires as its arguments pointers to the input and output strings, along with the length of each. The lengths must be determined separately by the application.

5

To output in a specified encoding rather than UTF-8, we use xmlSaveFormatFileEnc, specifying the encoding.