Writing Java to work all over the world can be a minefield for someone used to plain old English characters; so here is a quick start guide to Unicode and character encoding in Java.

Unicode, XML & Java

Using XML

The subject of Unicode and Java character encoding is quite extensive and so rather than ramble on about i18n internationalisation, this article is intended to focus on the XML part of the subject.

So if you imagine that you are developing a system that needs to manipulate XML; perhaps you don't need to imagine. But essentially you need to read and write elements in Java, but now you have an added complication you also need to not break its encoding. It could be something as simple as a £ currency symbol that trips your system up; so how do you avoid the char mangling? Read on...

Home · Contact · Blog · General Interest · Software · JHosts · Gos4j · © Hugh Reid

Basic Encoding

This article is written in XML, it is then processed though a Java template engine to produce the page you see now. To write the above paragraph required a number of changes to the original document and the template being used. In a standards compliant browser you should have seen the UK currency symbol called the pound symbol. The changes required are summarised in the following points.

The basics are:

  • <?xml version="1.0" encoding="your encoding" ?>
  • Put it as the first line in every XML document you produce. Most XML utility libraries will provide you with a way to add the encoding to your document. It is your responsibility as the author of a document to set the encoding correctly. This page uses UTF-16 because it can might have a variety of characters from different sets.
  • Use the character codes for literals.
  • Your text editor can probably cope with unicode characters, but don't rely on that fact to get the right literal in your hand written documents. For example, writing this in Eclipse I can set the editor charset to ISO-8859-1 (General Western Europe) and quite happily type £10 into the editor. But using vi I might not (depending on flavour). To avoid these situations, where an editor can accidentally re-encode your special characters, use the XML codes like &#163;.
    Similarly for other literals outside XML document you should use the \u0000 encoding scheme to make it clear that it is unicode. This is particularly true is you need to use resource bundles with special characters in them.
  • Use the facilities of the Character class.
  • Which provides an effective set of character classification methods. Do you know what constitutes a capital letter in all character sets?

Most XML processing utilities will handle unicode properly. Provided that your documents are correctly identified then there is normally no issue.

Or try:
Unicode
Java character encoding
i18n
XML
template engine
pound symbol
Related Pages

The Golden Rule

Don't get bitten by using byte based streams/readers/arrays.
The type byte is limited to the size of UTF-8, the default encoding, and so does not provide unicode support.

The most common coding errors are things like this:

// DO NOT COPY THIS BAD EXAMPLE
String myStringWithUnicode = "Please donate £10";
OutputStream out = new ByteArrayOutputStream(file); 
out.write(myStringWithUnicode.getBytes());
To understand what is wrong with this you need to understand what unicode is. Unicode is a way of representing a large number of different characters in a consistent manner. And you can't fit all those character numbers into a single byte as we are used to with Ascii. So if you manipulate a unicode string as above you will chop up and then re-assemble the unicode characters - producing something like:
Please donate £10
				

In this case the author should have used a class like CharArrayWriter to direct the stream.

Further Reading

Copyright © Hugh Reid, Creative Commons License
This work is licensed under a Creative Commons License.