Notes on UTF-8 and locales

Notes on using UTF-8 and locales on Linux systems.

Paul Heinlein | March 23, 2005

For some time now, the default shell environments shipped with many Linux distributions use UTF-8 (a.k.a. “Unicode”) locale information. This can be a bit confusing, especially for those accustomed to the old-style ASCII sorting order.

The starting point for documentation on the issue is the locale(1) man page. The locale command issued without arguments will provide a summary of your current environment.

$ locale

For my purposes, two of these environment variables have more impact on my day-to-day work than the others. LC_CTYPE specifies “character classification and case conversion,” in other words, font information. LC_COLLATE influences sorting order.

Old-time sorting

If you’re accustomed to ASCII sorting, then the results of ls or sort might initially be confusing in a modern locale. Take a look at the difference the locale makes in these two directory listings. The first uses the old-time raw ASCII sort order. In the second, however, the locale “knows” that ‘C’ and ‘c’ are the same letter and that leading dots shouldn’t influence the sorting order.

$ LC_COLLATE="C" ls -a
.  ..  .CCC  .ccc  AAA  BBB  aaa  bbb
$ LC_COLLATE="en_US" ls -a
.  ..  aaa  AAA  bbb  BBB  .ccc  .CCC

Since I prefer the old sorting order, the first item of business was to alter LC_COLLATE in my shell environment. It appears I could achieve my desired results by setting it to a null value, “POSIX,” or “C.” I use the latter because that’s what the Fedora init scripts use.

# in my .profile script

UTF-8 fonts in xterms

The full explanation of getting unicode characters to display correctly in xterm windows is somewhat lengthy, but a quick-start recipe is pretty easy.

Testing UTF-8 support with GNU date

Here’s a little script that’ll print the locale-specific names of all the days and months for all the UTF-8 locales available on your system. It’ll allow you to see the locales for which you do and don’t have local font support.

for loc in $(locale -a | grep utf8 | sort); do
  echo "Locale: $loc"
  # Aug 1, 2004 was a Sunday, Aug 7 a Saturday
  for n in $(seq 1 7); do
    LANG="$loc" date +"%A (%a)" -d 2004/8/${n}
  for n in $(seq 1 12); do
    LANG="$loc" date +"%B (%b)" -d 2004/${n}/1

You might also try saving the script’s output to a file and then viewing that file with a web browser. On many of my systems, the browsers have better UTF-8 support than xterm and its system font.

Useful links

A great starting place for UTF-8/Linux information is Markus Kuhn’s UTF-8 and Unicode FAQ for Unix/Linux. Markus is also the author of the helpful unicode(7) and utf-8(7) man pages that are found on many Linux systems.

Other helpful pages include Using UTF-8 with Gentoo, The Unicode HOWTO at the Linux Documentation Project, and Jan Stumpel’s UTF-8 on Linux.