Localization pitfalls

Posted by waldner on 24 April 2010, 1:42 pm

I'm pretty sure this has happened to you at least once (it surely happens a lot, at least judging by the number of people who have trouble with this):

$ echo 'abcd' | sed 's/[A-C]/X/g'
aXXd     # WTF?
$ echo 'ABCD' | awk '/[a-z]/{print "found a match"}'
found a match     # WTF^2 ?

if it hasn't happened to you yet, then you've been lucky. This is one of those "features" that might go unnoticed for a long time, and bite you when you less expect it.

So what's going on here? Welcome to the weird and wonderful world of localization.

Once upon a time everything was ASCII, and peace and harmony reigned in UNIX tools. Sorting was done based on ASCII codes, and bracket expressions in regular expressions like [A-Z] matched the obvious set [ABCDEFGHIJKLMNOPQRSTUVWXYZ]. Things were predictable, and scripts worked as expected.

But gradually all this changed. People started using different alphabets and character sets on their computers, and in these new environments the good old rules began to lose most or all of their validity (think for example languages like chinese or russian). An immediate consequence of this was the proliferation of all sorts of character encodings, which ultimately led to the introduction of Unicode, which is supposed to be a single, huge character set that includes every conceivable character in use in any human script or language, with many rules that dictate how each of these characters is encoded when used electronically (the UTF-8 encoding, which is the most used, is variable-length and the length of an encoded character depends on "where" in the Unicode set the character is - if you're curious see this page to get more information on Unicode and UTF-8 than you'd ever want to know. UTF-8 is backwards-compatible with ASCII, so a valid ASCII file is also a valid UTF-8 file; this was a wise decision, to avoid breaking tons of existing tools and scripts).

But a new language needs more than just an encoding for its characters; every country has its own rules regarding for example date and time format, or numeric and monetary representation (think commas vs. dots to separate decimals and so on), or collation order, which is the definition of how characters and symbols sort, and their relative order. Even if this last one may not seem to be an issue, it definitely is. To begin with, for languages that use symbols that are not part of ASCII/ISO-8859-1, it should be decided where those symbols fit in the big picture; secondly, different languages may have different relative ordering for the same symbols; then, some of those languages may not use some of the latin characters at all, and may also have a completely different concept of vowels or consonants, or of uppercase and lowercase letters. As an example of the first problem, consider common symbols found in european languages like accented letters ("à", or "é"), or letters bearing umlaut or circumflex like "ä" or "ê", or other symbols used in northern languages like "ø" or "å" and so on (and we're deliberately ignoring totally different character sets like those used in japanese). Where do they fit? Does "ü" sort before or after the normal "u"? It turns out that there's no single answer to this question, which leads us to the second problem mentioned above. Different languages sort the same symbols differently; for example, to keep the same example, in german "ü" sorts like its corresponding expansion "ue", while in estonian all the umlauted letters ("ä", "ö" and "ü") sort after "w". On top of that, in many cases within the same language there are special cases and/or exceptions. Like it or not, that's the way things have evolved in hundreds of years. This Wikipedia page provides a good overview of collation for languages that use latin-derived symbols.

The set of settings that defines how to handle language-dependent matters like those described above is called locale. In the POSIX definition:

A locale is the definition of the subset of a user's environment that depends on language and cultural conventions. It is made up from one or more categories. Each category is identified by its name and controls specific aspects of the behavior of components of the system. Category names correspond to the following environment variable names:

LC_CTYPE
Character classification and case conversion.
LC_COLLATE
Collation order.
LC_MONETARY
Monetary formatting.
LC_NUMERIC
Numeric, non-monetary formatting.
LC_TIME
Date and time formats.
LC_MESSAGES
Formats of informative and diagnostic messages and interactive responses.

Having categories within the locale means that it is possible to change a category without affecting the rest, so one could have a french date and time format, but use an english collation order (this is just an example of course). Each category is set using an environmental variable of the same name. Additionally, there's another variable LC_ALL that can be set and overrides all the other specific values (it's like simultaneously setting all the categories to the same value).

Typically, locales installed in a UNIX system have names like en_GB, fr_CA or de_DE.utf8. Roughly speaking, that defines the language (eg fr for french), the country (eg CA for Canada), and optionally an encoding (for example, ISO-8859-1 or UTF-8). To see what locale you're currently using, just type locale (don't worry, we're slowly getting back to the problem we described at the beginning):

$ locale
LANG=en_GB.utf8
LC_CTYPE="en_GB.utf8"
LC_NUMERIC="en_GB.utf8"
LC_TIME="en_GB.utf8"
LC_COLLATE="en_GB.utf8"
LC_MONETARY="en_GB.utf8"
LC_MESSAGES="en_GB.utf8"
LC_PAPER="en_GB.utf8"
LC_NAME="en_GB.utf8"
LC_ADDRESS="en_GB.utf8"
LC_TELEPHONE="en_GB.utf8"
LC_MEASUREMENT="en_GB.utf8"
LC_IDENTIFICATION="en_GB.utf8"
LC_ALL=en_GB.utf8

(Linux implements some more categories besides those dictated by the standard). Here we're using the same locale, namely en_GB.utf8, for all the categories (which is what happens normally). If we don't say otherwise, the values of $LC_ALL, $LC_CTYPE etc. seen by the programs are those shown above. We can temporarily override them by doing for example

$ LC_ALL=anotherlocale command

and that's what we're going to do in the following examples.

Now, all this lengthy introduction is important because virtually all the common UNIX shell tools and utilities are affected by the locale in use. An obvious example would be sort, but even seemingly innocuous commands like ls are very locale-dependent (ls has to sort the files it displays, and with the -l option it also has to write out date and time information).

To play around a bit, you can change the locale and see how the behavior of the common tools changes. First, let's check what locales are available on the system using locale -a:

$ locale -a
C
en_GB
en_GB.iso88591
en_GB.utf8
fr_FR
fr_FR@euro
es_ES.iso88591
es_ES.iso885915@euro
it_IT.utf8
POSIX

Your system may have many more than those. So let's start easy and take a blatantly locale-dependent command like date:

$ LC_ALL=fr_FR date
sam avr 24 10:49:22 BST 2010
$ LC_ALL=en_GB date
Sat Apr 24 10:49:30 BST 2010

Also the encoding used is important for commands like wc that count characters, because in certain locales some characters can be longer than one byte (UTF-8 encodes characters with a variable number of bytes, up to 4):

$ printf 'abc€' | LC_ALL=en_GB wc -m
6
$ printf 'abc€' | LC_ALL=en_GB.utf8 wc -m
4

So with a non-UTF-8 locale, wc assumes that each character is one byte, and reports 6 characters (the euro symbol is encoded using 3 bytes); with a UTF-8 locale, wc correctly decodes the three bytes that encode the euro symbol into a single character, and reports a total of 4 characters.

Now let's take a step further. For our purposes, two important locale-related variables are LC_CTYPE and LC_COLLATE. LC_CTYPE, roughly, defines character classes which are used especially in regular expressions; examples of character classes are [:upper:] or [:digit:]. LC_CTYPE, among other things, defines which characters compose those character classes. LC_COLLATE defines sort ordering. Here's an example of LC_COLLATE in action (closer and closer to the original weirdness):

$ LC_ALL= LC_COLLATE=en_GB.utf8 ls -1
Afile.doc
anotherfile.txt
Bigfile.txt
Cartoons.png
count.sh

(We need to unset LC_ALL otherwise it overrides whatever value we give to LC_COLLATE. You could of course have got the same effect by just setting LC_ALL=C, but the example is meant to show that it's LC_COLLATE that affects sorting order.) So, anything strange in the above output? It seems that the sorting order is not right. Well, that's the sorting order defined for the en_GB.utf8 locale. Is it weird? Probably, but nonetheless it's so. What happens in most locales is that the good old ASCII sorting order has been abandoned for the reasons described earlier, and instead characters are sorted based on different criteria. One of these criteria is, roughly, that characters that are basically the same letter (like "a", "A", "ä", "Å", "â" etc.) sort together (dictionary sort), and, within the group, there are secondary, tertiary etc. criteria (like uppercase-ness, accents etc) that break the tie and define a sort order for symbols that would tie in the upper-level criteria. Now the problem is that these rules are quite complex, and again in most cases obviously locale-dependent. What's more important, their focus is on relative ordering rather than absolute; that is, given two symbols A and B, the rules make it possible to univocally define which sorts first (so it's determined whether A < B or B < A). For more (scary) information, see the Unicode Collation Algorithm and the ISO/IEC 14651 document on "International string ordering and comparison -- Method for comparing character strings and description of the common template tailorable ordering" (and good luck).

Now this is all nice and dandy, but being focused on relative ordering unfortunately it makes it very hard to work out the actual absolute ordering of symbols in the current locale. In fact, I haven't been able to find a way to obtain such sequence; if somebody has, more information on how to do that would be more than welcome.
However, we can do something to get an idea of what that ordering is. The following Perl code, found in "perldoc perllocale", generates the first 255 characters in the current locale (you can extend that), takes the so-called "word" characters, and sorts them. The resulting output gives a good approximation of how the current locale sorts:

$ LC_ALL=en_GB.utf8 perl -e 'use locale; print +(sort grep /\w/, map { chr } 0..255), "\n";'
_0123456789aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ

Hopefully, now we're starting to see where our initial problem comes from. If you look carefully, you'll see that character classes like [A-C] or [a-z] that we were using include many more characters than we were expecting! For example, in the en_GB.utf8 locale, according to the above output, [A-C] includes at least "A", "b", "B", "c" and "C" (that's why the lowercase "a" wasn't replaced by sed). Similarly, it's easy to see that [a-z] matches much more than lowercase letters.

And now we're finally ready to see how to handle modern locales as safely as possible in scripts and command line programs. There are a few tips that can make our life easier, although each has some issues.

There is a special locale, the "C" locale (also known as "POSIX" locale), which should hopefully be available on all modern systems, that emulates the good old ASCII days, and using which collation and character classes behave as they used to do. Here is a demonstration:

$ LC_ALL=C perl -e 'use locale; print +(sort grep /\w/, map { chr } 0..255), "\n";'
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz
$ LC_ALL= LC_COLLATE=C ls -1
Afile.doc
Bigfile.txt
Cartoons.png
anotherfile.txt
count.sh
$ echo 'abcd' | LC_ALL=C sed 's/[A-C]/X/g'
abcd
$ echo 'ABCD' | LC_ALL=C awk '/[a-z]/{print "found a match"}'
$

So, using the C locale would seem to be the solution to all our problems. Just start all your scripts with

#!/bin/sh
export LC_ALL=C

and live happy. Perhaps you may think of setting it as your default locale, so the command line programs would work fine too.
Unfortunately, things are not so easy. The C locale is very barebone, and some locale-dependent command whose behavior you've get used to (maybe without knowing) may work differently (try ls -l with different locales and look at the date and time format in the output). Probably the biggest problem with using the C locale is that no UTF-8 support is present, so all the programs that should count characters default to counting bytes instead, producing odd results when the text contains multibyte-encoded characters (which is common with UTF-8 data), like our wc example we saw earlier. But it can get worse than that; see for example this sed program:

$ echo 'foo€bar' | LC_ALL=C sed 's/o.b/X/'
foo€bar
$ echo 'foo€bar' | LC_ALL=en_GB.utf8 sed 's/o.b/X/'
foXar

Here sed completely misses the match under the C locale, as it thinks that the character matched by "." should be one byte, which here is not true. It's easy to imagine similar or even weirder results if for example a multibyte character is only partially matched and replaced:

$ echo 'foo€bar' | LC_ALL=C sed 's/o../X/'
fX��bar

In short, the C locale may work for you if you exclusively deal with ASCII data. This is less and less the case these days. You could try an hybrid approach, where you unset LC_ALL, set LC_COLLATE and LC_CTYPE to "C", and all the other locale variables to a locale of your choice. This may or may not work well for you; I don't know how having LC_COLLATE set to "C" is going to interact with esoteric character sets. Also, the documentation for GNU sort has a footnote, which you should probably read carefully:

(1) If you use a non-POSIX locale (e.g., by setting `LC_ALL' to
`en_US'), then `sort' may produce output that is sorted differently
than you're accustomed to. In that case, set the `LC_ALL' environment
variable to `C'. Note that setting only `LC_COLLATE' has two problems.
First, it is ineffective if `LC_ALL' is also set. Second, it has
undefined behavior if `LC_CTYPE' (or `LANG', if `LC_CTYPE' is unset) is
set to an incompatible value. For example, you get undefined behavior
if `LC_CTYPE' is `ja_JP.PCK' but `LC_COLLATE' is `en_US.UTF-8'.

Some systems ship a special C.utf8 locale, which should offer UTF-8 support while retaining the familiar behavior of the C locale regarding sorting etc. However, this locale seems to not be universally available, and even where it is, its behavior may not be what you want. For example, it apparently sorts based purely on Unicode character value (as if it were just a bigger ASCII), so all the like-characters do not sort together anymore, and "a" and "ä" (or whatever character you may want to sort together with "a") may be very far apart in the collation sequence. Again, this may not be what you want.
But all in all, if the data you work with has certain characteristics and the C.utf8 locale is available, it may be good enough for you.

Finally, let's see what we can do if we do not wish to change the existing locale. This is something that should probably be considered the best thing to do, partly because localization isn't going away, and also because you may need to write scripts that will need to run under different locales, and their users are likely to expect locale-dependent outcomes from them.
The answer, at least as far as scripts are concerned, is something like: "don't depend on sorting order and on a specific expansion of character classes". This sounds easy, but it's hard to put in practice. The first rule is to be explicit, so rather than writing [A-D] write [ABCD]. Conversely, if you want to include the characters that are considered uppercase in your locale, even if you don't know them all, you can use the special POSIX character class [[:upper:]], which includes all the characters that your locale considers uppercase. This may be a good replacement for the old [A-Z], and should be portable to different locales without change (under the C locale, [A-Z] and [[:upper:]] match the same characters). A number of similar character classes are defined, like [[:lower:]], [[:digit:]], [[:alpha:]], [[:alnum:]], [[:blank:]], [[:space:]], and others. So use those when possible, and you should have a locale-independent way of matching the characters that you want:

$ echo 'abcd' | sed 's/[ABC]/X/g'
abcd
$ echo 'ABCD' | awk '/[[:lower:]]/{print "found a match"}'
$

Another very good source of information for dealing with locales in shell programs can be found in Greg's wiki.

Filed under faq, shell, tips Tagged localization

Comments are closed | Permalink

12 Comments

arjun says:

April 29, 2010 at 19:53

zoug@orange:~$ locale -a
C
en_AG
en_AU.utf8
en_BW.utf8
en_CA.utf8
en_DK.utf8
en_GB.utf8
en_HK.utf8
en_IE.utf8
en_IN
en_NG
en_NZ.utf8
en_PH.utf8
en_SG.utf8
en_US.utf8
en_ZA.utf8
en_ZW.utf8
POSIX
zoug@orange:~$

i have sed version 4.2.1
- waldner says:
  
  April 29, 2010 at 19:57
  
  $ locale -a | grep en_IN
  en_IN
  en_IN.utf8
  $ echo 'abcd' | LC_ALL=en_IN sed 's/[A-C]/X/g'
  aXXd
  
  Sorry, I can't help you further :-) Obviously there must be something else on your system that produces those results.
  - arjun says:
    
    April 29, 2010 at 20:00
    
    yeah, its fine, never mind! thanks for the article! :)
    
    but its weird, i dont have a en_IN.utf8
    - waldner says:
      
      April 29, 2010 at 21:28
      
      That's normal, you don't have to have it. It just means that it wasn't built when libc was installed.
      If you are on a recent Linux, you should find the list of locales that were built in the file /etc/locale.gen. You can edit the file to add a line like en_IN.UTF-8 UTF-8 and then run locale-gen (as root) to generate all the locales. That should build all the listed locales including the en_IN.utf8 locale and then you should see it when you do locale -a.
arjun says:

April 29, 2010 at 04:08

why cant i reproduce that error? i have a LANG=en_IN
- waldner says:
  
  April 29, 2010 at 09:18
  
  Sorry, what is "that error"? There are a number of them. Also, LC_ALL has precedence over LANG.
  
  In any case, keep in mind that every locale is different, and the behavior shown in the article may not be reproducible in all locales. In the same way, some locales may exhibit behavior that is not described in the article.
  The point is not to illustrate the behavior of a specific locale, but rather to point out that people should not depend on any locale-dependent particular behavior they see.
  - arjun says:
    
    April 29, 2010 at 19:06
    
    im getting the output as it was expected. though i have the same char set as en_GB. take a look here.
    
    zoug@orange:~$ echo 'abcd' | sed 's/[A-C]/X/g'
    abcd
    zoug@orange:~$ echo 'ABCD' | awk '/[a-z]/{print "found a match"}'
    zoug@orange:~$
    zoug@orange:~$ LC_ALL=en_GB.utf8 perl -e 'use locale; print +(sort grep /\w/, map { chr } 0..255), "\n";'
    _0123456789aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ
    zoug@orange:~$ LC_ALL=en_IN perl -e 'use locale; print +(sort grep /\w/, map { chr } 0..255), "\n";'
    _0123456789aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ
    zoug@orange:~$
    
    but never mind, its fine!
    - waldner says:
      
      April 29, 2010 at 19:26
      
      You don't show what your default locale is (ie, the one in effect when you run the sed and awk commands). That's probably a locale that expands ranges differently from en_GB or en_IN. For what it's worth, with en_IN I get this:
      
      $ echo 'abcd' | LC_ALL=en_IN sed 's/[A-C]/X/g'
      aXXd
      
      But I want to stress again that it may perfectly be possible that you see different results, depending on your operating system, tools and environment.
      - arjun says:
        
        April 29, 2010 at 19:31
        
        yeah, there is something else, but never mind, its working as it should. :) probably some different bashrc files i guess.my default is en_IN
        
        zoug@orange:~$ LC_ALL=en_IN echo 'abcd' | sed 's/[A-C]/X/g'
        abcd
        
        BTW, thanks for this wonderful article!
        
        waldner says:
        
        April 29, 2010 at 19:36
        
        Sorry, maybe I wasn't clear. If you do
        
        LC_ALL=en_IN echo 'abcd' | sed 's/[A-C]/X/g'
        
        the effect of LC_ALL=en_IN is limited to the "echo" command. If you look at the examples in the article, you'll see that the assignment to LC_ALL always is before the command that should be influenced by the locale. So if you want to check what LC_ALL=en_IN does to sed, you have to do
        
        echo 'abcd' | LC_ALL=en_IN sed 's/[A-C]/X/g'
        
        I hope this is clearer.
        
        arjun says:
        
        April 29, 2010 at 19:42
        
        oh, yes, exactly. but something wrong on my side :D
        
        zoug@orange:~$ LC_ALL=en_IN echo 'abcd' | sed 's/[A-C]/X/g'
        abcd
        zoug@orange:~$ echo 'abcd' | LC_ALL=en_IN sed 's/[A-C]/X/g'
        abcd
        zoug@orange:~$
        
        waldner says:
        
        April 29, 2010 at 19:48
        
        Ok, as I said it may be that your system implements en_IN differently. Or it may also be that the en_IN locale isn't available on your system, so it's ignored and the default used. To see if the en_IN locale is available, you have to check if it appears in the list output by locale -a. It may also be that your sed is old and does not implement localization.
        Part of the weirdness of localization is exactly this imho.

\1