The subject of
So if you imagine that you are developing a system that needs to manipulate XML; perhaps you don't need to imagine. But essentially you need to read and write elements in Java, but now you have an added complication you also need to not break its encoding. It could be something as simple as a £ currency symbol that trips your system up; so how do you avoid the char mangling? Read on...
This article is written in XML, it is then processed though a
Java
The basics are:
Most XML processing utilities will handle unicode properly. Provided that your documents are correctly identified then there is normally no issue.
Don't get bitten by using byte based streams/readers/arrays.The type byte is limited to the size of UTF-8, the default encoding, and so does not provide unicode support.
The most common coding errors are things like this:
// DO NOT COPY THIS BAD EXAMPLE String myStringWithUnicode = "Please donate £10"; OutputStream out = new ByteArrayOutputStream(file); out.write(myStringWithUnicode.getBytes());To understand what is wrong with this you need to understand what unicode is. Unicode is a way of representing a large number of different characters in a consistent manner. And you can't fit all those character numbers into a single byte as we are used to with Ascii. So if you manipulate a unicode string as above you will chop up and then re-assemble the unicode characters - producing something like:
Please donate £10
In this case the author should have used a class like CharArrayWriter to direct the stream.