Monday, October 19, 2015

A collection of fun everyday software internationalization issues

The following is a collection of snappy headlines with examples of common software internationalization problems.

The past twenty years in two paragraphs
On the positive side, programming languages, libraries and frameworks have become better at doing internationalization. Translators are better trained and use better tools. I was responsible for some time for some of the latter, including the translation memory tool you know as SDL X. And our friends in England published a short article I wrote entitled “A DIY translation manager”, which described how to emulate the MS Word user interface of TRADOS with a simple MS Access database backend. XML has come along and taken much of the bite out of  code pages and character sets. And Unicode is finally unifying the world.

On the negative side, XML has come along and, together with greater use of scripting and the internet, added to what I call the “layer problem”.  More than ever, user interface strings can and will be used across different software components that do not handle strings in the same way. You may end up with a broken character in a database, you may see a square box or questions marks on a web page or in a command window. You may get insidious runtime errors because of a bad use of single or double quotes in javascript. And, as much as I hate to say this, documentation has gotten worse while it as improved. Automation of documentation for C# and Java, for example, is now largely in the hands of individual developers and that means very different quality within even small components. Last but not least, there is the proliferation of devices, from mobile phones to madical devices large and small, various pads to point of sale terminals and radio frequency identification chips.

While many people continue to argue about the correct definitions of internationalization and localization or expect erudite explanations of character set, encoded character set and font, I personally focus on “transition”. Every single time, a piece of text (a string) is handed off to some other component, there is the potential for a problem (a bug). Once you think in these terms, much of the work becomes easier. The concept that much of internationalization is about transition makes the need for good pseudo-translation pretty obvious. Just as it makes good QA obvious.

This way or that way (bidi)
The good news is that ADF and other packages will take care of bi-directional display for you. The bad news is: challenges remain.
But there is an excellent tutorial by the W3C folks at Creating HTML Pages in Arabic, Hebrew and Other Right-to-left Scripts.

you can quote me on this
It is probably fair to say that the cumulative cost of single and double ASCII quotes over the past 30 years or so is in the order of millions of dollars across the software industry.
From \'escapes\' that do not survive the translation process to a French year 2000 (l'an 2000) that becomes a lan 2000 in the ui, and on to javascript errors that surface when a customer hits a page - ASCII quotes (U + 00xx) continue to surprise.
There is a very simple solution for web pages: use smart quotes (U+20xx). See ASCII and Unicode quotation marks.

when good strings go bad: Mojibake
Instead of me laboriously trying to explain and find examples of character corruption, Wikipedia will do the explaining: Mojibake around the world.

garbled download filenames
A variation on a still common theme. Common for Firefox if you do not use mime-types with base64 encoded filename in the CONTENT-DISPOSITION header. The mime encoded string consists of =??B??= where B stands for base 64. So, it should look more or less like this CONTENT-DISPOSITION: inline; filename==?utf-8?B?44GC44GE44GG44GI44GK44GL44GN44GP44GR44GTLnhscw==?=

for entities' sake
Do not use entity references unless you absolutely have to. Which means, use entity references only to escape markup inside xml, for example: <mystring>This is a &lt;b&gt; bold &lt;/b&gt; statement</mystring>

Why you should never use them to output text to a web page:
The page source is no longer human readable for many languages
They can easily mask a bad "content-type". If you do not set the charset and the server defaults to iso-8859-1, your entities will display correctly but:
Form submission may be bad
Other software that consumes this page may run into errors

to sort or to collate
Displaying sorted lists of text can be done in different ways. If your entries come from a database, you can either rely on whatever collation is set for it, you can show the list in its binary sort, or you can try to sort in accordance with the language of a user.  You can download a simple web application for text sorting right here.

duplicate translations
Issue: files coming back from translation have duplicate translations. In other words, you have two different English strings but they have the same translation.
Solution: The generic advice is to never make your ui dependent on uniqueness of strings.

There are instances where this is correct behavior. An over-simplified example would be something like this: en1=Goto en2=Go to with translation1=Gehe nach and translation2=Gehe nach

There are other instances where the English is artificially "unique", for example when you deal with currencies.

And then there are instances where it is simply a translation error.

The problem is compounded when a data loading utility is used and uniqueness constraints are enforced. The solution for uniqueness constraints is:
You need to write or provide a utility that is file based and that can check a file for uniqueness violations before import. It is critical that all violations in a given file are found in a single run. Many tools, for example, xml parsers, will stop at the first error, and this can lead to numerous, time consuming iterations.

(ir)regular expressions
Issue: Regular expressions sometimes fail with non-ASCII.

Solution: While they may have their use, a slightly more verbose approach to coding may well be better in the long run for shipped products. "Multi-byte" regular expression bugs need particularly serious testing.

string theory and math
In general, you can expect a "char" in any programming language to hold one 16-bit value. Since Unicode supplementary characters need 2 of these, make sure your string math is right.
.NET: A single Char object usually represents a single code point; that is, the numeric value of the Char equals the code point. However, a code point might require more than one encoded element. For example, a Unicode supplementary code point (a surrogate pair) is encoded with two Char objects.
To work with each Unicode character instead of each Char object, use the System.Globalization.StringInfo class.
Java: "char" and CodePoint are the respective equivalents.
JavaScript: see once upon a time (surrogates) at the end of this post

How about some Oracle PL/SQL math:

CREATE OR REPLACE PROCEDURE mgmt$mask_sendMsg (msg IN VARCHAR2) IS
msg1 VARCHAR2(1020);
len INTEGER := length(msg);
i INTEGER := 1;
BEGIN dbms_output.enable (1000000);
LOOP msg1 := SUBSTR (msg, i, 255);
dbms_output.put_line (msg1);
len := len - 255;
i := i + 255;
EXIT WHEN len <= 0; END LOOP;
END mgmt$mask_sendMsg;

/

Issue: Until at least Oracle 11, this procedure does not work properly if the input string contains multi-byte characters.

We get a "string buffer size is too small" error from the call to dbms_output.put_line(). It works if you change the substr length to 100.

Solution: Find the Oracle reference "SQL Functions for Different Length Semantics".

databases
Issue: Text stored in a MS SQL table using "N" datatypes fails to import into a DB2 utf-8 table.
Solution: Here is a rule of thumb for using databases to store text in "char/varchar" columns:

If you only support Oracle and/or DB2: use a Unicode database (set the character set for the whole database to utf-8)

If you support Oracle and MS SQL (and DB2): use N datatypese (including clobs) and "graphic" types for DB2. This will make the RDBMS behavior more consistent.

Since every commercial RDBMS comes with its own set of challenges and documentation, go and check there for details you may never need.

pseudo translation
This involves adding language specific text (something Japanese, French, etc.) to extracted English strings and then testing "Japanese", "French" and the like. One of the most widely known benefits is the detection of hardcoded strings. But, equally important are:

detection of encoding problems (see "when good strings go bad" above)

detection of layout problems

detection of bad programming practices (for example, using an extracted string as a variable name)

Inadequate pseudo translation: never add just ASCII characters to a string. For example, this 'myenstring=enter a number' would be badly "pseudo translated" as something like that:
myenstring=[PS] enter a number
myenstring=$$ enter a number $$

creative placeholders
In general, placeholders should be simple, numeric tokens of the type 0, {1}, '2', and the like. Unfortunately, there are "standards" will a long history (such as the %s, %d from C/C++) as well as various, well-intentioned more recent approaches. Do not get creative, for example: The Accounting Pay Period yyy / zzz for the aaa entry is not open.
If your product has old or proprietary placeholders, be prepared to mitigate bugs. One way to do this would be to split one message into two in order to get around the need to re-order placeholders in a translation.

SBCS, MBCS, DBCS
In a Unicode world, you should use these terms only in a very narrow sense. For example, Microsoft MSDN says: Support for Multibyte Character Sets (MBCS) Multibyte character sets (MBCS) are an alternative to Unicode for supporting character sets, like Japanese and Chinese, that cannot be represented in a single byte. If you are programming for an international market, consider using either Unicode or MBCS, or enabling your program so you can build it for either by changing a switch. The most common MBCS implementation is double-byte character sets (DBCS). Visual C++ in general, and MFC in particular, is fully enabled for DBCS. If you can spend a minute on Wikipedia Multi-byte_character_set, you are on your way to say "Chinese, Japanese, Korean" or "CJK" instead of MBCS when you really mean CJK. Also, please note that modern "MBCS" also contains Western European, Russian, Greek, etc. - just not in the same location that you would expect.

fonts: squares
Issue: Users see squares instead of the desired characters or glyphs.
Solution: Check what fonts the application is calling. If possible, use the "generic" (aka. logical) name "sansSerif" or "serif". You can find a font test tool in the Downloads section of this web site. Browsers are generally smarter than desktop applications when it comes to displaying content.

fonts: question marks
Issue: Users see question marks instead of the desired characters or glyphs.
Solution: Check that encoding conversion is done correctly and encoding is specified at all transition points.

american == english?
The Oracle NLSGDK (at least as of 2012) allows developers to map languages and locales (Oracle traditionally uses different names than Java). The result of the mapping may not always be as expected:

Locale locale = new Locale( "en" );
String oraLang = LocaleMapper.getOraLanguage( locale );
String oraLangFromJavaLang = LocaleMapper.getOraLangFromJavaLang( locale.toString() );
System.out.println("oraLang = " + oraLang); System.out.println("oraLangFromJavaLang = " + oraLangFromJavaLang);
The output is: oraLang = AMERICAN oraLangFromJavaLang = ENGLISH

Outputting xml without an xml parser
We have all done it at some point, especially in the earlier days when parser implementations were often unwieldy and sucked. We built “xml” as strings in memory and wrote them out using simple file output with an OutputStream. The XML never saw a parser or supporting classes (like a DcoumentBuilder).
The problem with this is that this approach works until the stream is no longer just ASCII, at which time everything falls apart.
So, do yourself a favor and tell the scrum masters that they do want controlled input and output. I, for one, will be forever grateful to the folks who wrote the JDOM package because it was more then adequate for many use cases.

Naming your files
Another seemingly trivial aspect which merits enforcement is paying attention to the naming of files as they are created. Unfortunately, IDEs provide dafault names too easily accepted, for example “resource.properties” or “labels.java”.  That is enough for the IDE to keep track of strings and to know where it can find them. But just image how one team felt when they saw that their great new product had all the ui strings and messages in some 30 files that all had the name “labels.properties”.  If you run into such a problem and your project manager or dev manager has other things to do than make coders change the filenames and the calling code, it will be you as the internationalization or globalization people who will suffer for years to come.  You will soon be sick and tired of having to dig through files because a bug is filed against one “labels.properties”, and you will have to find out which one.
Css, a matter of style
For any browser based rendering, use external css stylesheets. They can be easily adapted to internationalization needs. For example, the display font size for most Asian language needs to be slightly larger than for Western languages, usually an increase of 1pt is enough. But since you have them defined externally, it is easy to change that, too.Issue: I don't know how many stylesheets I need, one per locale?Solution: One per language is enough. In a more colloquial way of saying it: Austrians and Swiss will ge happy with the same font size. Depending on your specific ui (how cluttered, how busy), you may very well be able to use a single stylesheet for most or all of your supported languages.

XLIFF resources as replacement for Java .properties and ListResourceBundles
For several years now, Java has a ResourceBundle.Control which provides you with so much flexibility that I sometimes shake my head at the reluctance to adopt it.  The XLIFF resources format, originally intended as an interchange format, has been adopted quite widely and can even be used as a drop-in replacement for Java Properties and ListResourceBundle (LRB).
For the sake of full disclosure: Yes, I am aware that no "real objects", as the javdoc of ListResourceBundle calls them, can be in the LRB if you want to use XLIFF as a replacement. But then, finding “real objects” in resource bundles has been a pet peeve of mine for 20 years and is one of the low points of internationalization architecture. And yes, the much touted ability to catch potential runtime errors through the fact that you have to compile LRBs before use is another pet peeve of mine.
But how do you use an XLIFF or any other xml as a replacement?
Play with it and thank the Java folks for a very useful feature.  There have been two main objections with regard to this class. The first one was that the fallback mechanism did not work or was hard. That is a misconception, fallback works. The second, more justified one is performance and footprint. For very small applications in the mobile or embedded areas, yes, you will want to consider this. But for anything else, especially those resource hogging enterprise suites, I have only one comment: Let us assume, hypothetically, that we have a large application and that loading of a resource file takes an additional few milliseconds. Now, if someone tells that this is an unacceptable drop in performance, my answer will be “please find the true performance issues in the base code before you dismiss the use of ResourceBundle.Control”.

Using an interface to call (text) resources
Standardization of resource strings and ease of use in source code continue to be big issues. Some improvements, such as using a central database and having gatekeepers who must approve new strings, have been more or less successful. One contribution to a solution for big projects is the use of interfaces to declare constants that can then be called easily.
How to do this:
1. Create an Interface and declare strings as public static final, for example, public static final String NO_USER_LOGIN = "NO_USER_LOGIN";
2. If you use LRBs, add an entry like this:
    { BaseConfigMsgID.NO_USER_LOGIN, "No valid login found for user {0}" }, where  BaseConfigMsgID is your Interface.
   If you use a .properties bundle, add your message like this:
    NO_USER_LOGIN="No valid login found for user {0}"  

The code that calls the message is identical and goes like this: bundle.getString(BaseConfigMsgID.NO_USER_LOGIN);

The beauty of this, in my view, is that developers can easily reference the correct message through autocomplete in modern development environments and even more so the fact that you can switch between LRBs and Properties bundles so easily.

Location, location, location
The standard real estate phrase also applies to Java software real estate.  An astounding number of programs insist on packaging resources in the same .jar/.war as the code.  This is partly because of IDE defaults, partly because of inadequate research into lookup and loading.
Yet, there is absolutely no need to do this. On the contrary, if you separate them out, you have already made a huge step to facilitate adding new languages to an existing program.
Java web applications will easily load resources from jars in the lib directory. But, since jars, wars, ears are "virtual file systems", you can really put them anywhere. This simple jsf snippet will work with resources in a jar in the webapp lib directory.  Stick the resources into the lib folder, and then all you need in the ui code is a short snippet, shown here for jsfs:

<f:view  locale="ja">
   <f:loadBundle basename="myapp.mycomponent.msg.IntroMsg" var="msg"/>
      <h1><h:outputText value="#{msg.ABORTED}" /></h1>
</f:view>
It does not get much better or, big development dream, much more elegant.

Plurals and gender issues
Attempts have been made to provide ways to handle more complex formatting operations, for instance, plural forms, ranges of numbers, etc. You should avoid these because of some inherent issues with translation. Stick to the simplest possible way of doing resources with placeholders. Mozilla has examples for using PluralForms, but plurals are not the only aspect of a language. You will find gender questions, questions of case and other choices, such as agreement between different parts of speech.
What is wrong with the following?
  • seconds=second;seconds
  • seconds=sekunda;sekundy;sekund
Issue: You cannot go from the English with its two forms to the Polish with three by translation. Translators will not add or remove items. They translate what is there. Issue: The approach above turns a properties file into a nested file by adding a secondary delimiter. Asian translators can and will frequently replace the ASCII semicolon with an ideographic one (outside of the ASCII range, showing in the translation as \uxxxx).
Solution: Use more than one key=value pair. Never ever chain values together as shown above.

A usage based view of xml
Those of you who had the pleasure of working with SGML markup long before html and xml came around, can skip this section.

While many discussions of xml over the years have focused on issues of “the spec” versus “the implementation”,  I have found it useful to distinguish between two usage scenarios.

Scenario 1 data exchange between applications or no transformation.If xml is used exclusively to transfer data, for example, in SOAP or as an export/import format, few constraints apply to parsing and formatting. For example, you can very well ignore "pretty" settings that produce nice indentation. While I am wary of certain implementations when it comes to usage described in Scenario 2,  you can also safely use standards-compliant but weird implementations like the Oracle XMLPrintDriver class in the v2 parser in the pure data exchange scenario.

Scenario 2: human interaction or transformation.If there is any chance that humans will interact with the xml source, or if some transformation is performed, you should add the niceties and some hints that are not strictly required for parsing or processing xml.  The general guideline for the treatment of XML that is sometimes (or all the time) destined for direct human consumption or further processing is: no gratuitous changes.

Common examples include:
  • Add the encoding statement even for utf-8 content. XML defaults to utf-8 and some purists decry having the declaration with the encoding statement. I have seen so many bugs that could have been avoided or made easier to detect, had the developers added the encoding statement.Do yourself and your fellow humans a favor and always add the encoding statement.
  • No escape from the quotes question. The spec does not really care whether you use single quotes or double quotes for attributes, etc. But what about quotes in text content, for example in embedded html markup? The "spec focus" would say: they do not matter there either, so I can change them in any way I like.
    But if you use xml to store ui resources, you may not know what happens to these strings once they have been loaded into the application. Or worse, your code and ui strings will be consumed by third party components, leaving those developers with the chore of figuring out what you did in your xml. What if quotes in embedded html are now double quotes and break javascript? Leave existing quotes unchanged if you can.
  • "The spec" allows for an embedded greater than sign (>) to be either escaped or not. About 99.99 percent of standard implementations are aware that the > will frequently be paired with a < and treat both the same way. But some implementations do not. For example, the Oracle XLMPrintDriver will not escape the >. Always escape both embedded < and >.
Default display language
It is a simple question, what do you show on a screen if you are asked to display Chinese,  or Spanish, or English? It has to do with the different versions of language and script. An easy way to try and settle this is to look at English: we have come to pretty much accept to see the American version of this language in software.  Since politics has a way of making life hard, I argue with numbers. There are some 1,2 billion users of “Simplified Chinese” versus some 150 million of “Traditional Chinese”. So, in terms of the base language, I will fight to display “Simplified Chinese” if all that matters of a locale is the language part (in this case “zh”). And my argument for Portuguese is the same, “pt” will have Brazilian text.  If you need the version used is Portugal, you use pt-pt.  
This argument will generally win unless:
Someone has a politico-cultural war to fight. I have personally experienced a bad fight over “Chinese” because the internationalization chiefs all came from “Traditional” backgrounds. To this day, it was the dirtiest fight of two letters I have seen.
Sales numbers are held against you. If the sales folks say “we have no customers in mainland China, so we want  Chinese to mean Traditional”, your best hope is that you can convince development to keep the design open for the future.

once upon a time (surrogates)
Once upon a time, the notion of more than 64K of characters, glyphs or whatever you want to call it, seemed frivolous. And then someone decided to combine some unused ones, and surrogates were born. And with them, parsing strings by/into chars became a little bit more complicated.

size matters
Be nice to your translators and tell them they can use abbreviations if your code does not allow text to expand enough. While the latter is preferred, there will be times when it cannot be done. Let them use abbreviations and, on your side, make sure that punctuation marks in your strings do not have any functional meaning in your code.


(Published with permission of the original author) 



No comments:

Post a Comment