Is Unicode case-insensitive?

Code:
$ echo $LANG

$ echo $LC_ALL

$ echo $LC_CTYPE
de_DE.UTF-8
$ echo $LC_COLLATE

( echo A ; echo b ; echo C ) | LC_COLLATE=de_DE.UTF-8 sort
A
b
C
$ ( echo A ; echo b ; echo C ) | sort
A
C
b

This is not very funny, and is one of the reasons why I do not use these normally.

But, the problem goes deeper. This is postgres, and here LC_COLLATE cannot be changed (because it builds all the indexes on it):
Code:
flowm=> select 'b' < 'A';
?column?
----------
f
(1 row)

flowm=> select 'b' < 'C';
?column?
----------
t
(1 row)
flowm=> show LC_COLLATE;
lc_collate 
-------------
de_DE.UTF-8
(1 row)

 
Ha, found it. It is simply de_De versus de_DE (or en_Us versus en_US). All the babble on the Internet just makes one stupid...

This was probably bogus. It seems to just not have found the name, and fallen back to whatever 'C' and then given the traditional result... investigating further...
 
  • Thanks
Reactions: a6h
I have
Code:
export LANG="en_US.UTF-8"
export LC_ALL=$LANG
export LC_COLLATE=$LANG
export LC_CTYPE=$LANG
export LC_MESSAGES=$LANG
export LC_MONETARY=$LANG
export LC_NUMERIC=$LANG
And
Code:
postgres:\
    :lang=en_US.UTF-8:\
    :setenv=LC_COLLATE=C:\
    :tc=default:
Note, with postgresql you can create databases without "conversion" and they are faster.
Why is default LC_COLLATE not "en_US.UTF-8" ?
A link,
Why not run,
Code:
initdb -E UTF8
If you know beforehand you won't need funny characters.

You probably know this but databases have Encoding,Collate & Ctype.
These are 3 settings.
Defaults are UTF8,C,C-UTF-8
 
I have
Code:
export LANG="en_US.UTF-8"
export LC_ALL=$LANG
export LC_COLLATE=$LANG
export LC_CTYPE=$LANG
export LC_MESSAGES=$LANG
export LC_MONETARY=$LANG
export LC_NUMERIC=$LANG
And
Code:
postgres:\
    :lang=en_US.UTF-8:\
    :setenv=LC_COLLATE=C:\
    :tc=default:
Yes, but the question is: do you have case-sensitive or case-insensitive sort?

Here, LC_COLLATE=en_US.UTF-8 gives case-insensitive sort (upper- and lowercase mixed), while LC_CTYPE=C (that is what we are all used to) gives case-sensitive sort, i.e. the uppercase filenames are on top of the ls listing.

And while I am not yet sure inhowfar that is a real problem, it certainly is a serious problem to suddenly have an essentially different sort order in data that may not even contain any "funny characters".

I do not find suitable documentation, but it seems Unicode would support case-sensitive as well as case-insensitive collations. So the most straightforward question is: how can one decide on which to use?

Note, with postgresql you can create databases without "conversion" and they are faster.
I know that I can do that.

Why is default LC_COLLATE not "en_US.UTF-8" ?A link,
Because I have this in /etc/rc.conf:
Code:
postgresql_initdb_flags="--data-checksums --encoding=utf-8 --lc-collate=de_DE.UTF-8 --lc-ctype=de_DE.UTF-8 --lc-messages=en_US.UTF-8 --lc-monetary=en_US.UTF-8 --lc-numeric=en_US.UTF-8 --lc-time=en_US.UTF-8"

Why not run,
Code:
initdb -E UTF8
If you know beforehand you won't need funny characters.
Now that's exactly the issue. I do not yet know what might be the best approach. Applications can be internationalized (if one would bother to do that pile of work), so it would finally be the user who decides on the language. For now I know that some of the payload is certainly german, so I decided to give that a try.
 
Read this and weep or laugh: https://unicode.org/reports/tr10/
The answer is: It's complicated.
No, this is quite suitable.

By default, the algorithm makes use of three fully-customizable levels. For the Latin script, these levels correspond roughly to:
  1. alphabetic ordering
  2. diacritic ordering
  3. case ordering
A final level may be used for tie-breaking between strings not otherwise distinguished.

So, by standard, the case ordering is only the last level of sorting, i.e. uppercase and lowercase versions of a letter are sorted near each other like in a dictionary, and not all uppercase before all lowercase as in ASCII.

Furthermore, this page about Collation describes the ASCII/'C' sort order as something unusual and unconvenient:
So a computer program might treat the characters a, b, C, d, and $ as being ordered $, C, a, b, d (the corresponding ASCII codes are $ = 36, a = 97, b = 98, C = 67, and d = 100). Therefore, strings beginning with C, M, or Z would be sorted before strings with lower-case a, b, etc. This is sometimes called ASCIIbetical order. This deviates from the standard alphabetical order, particularly due to the ordering of capital letters before all lower-case ones (and possibly the treatment of spaces and other non-letter characters).
So, while unix folks are used to that kind of sorting almost since the teletype learned case distinction, normal people seem not to like it, and Unicode does probably not provide for it. And then, the apparently case-insensitive sorting where 'A' appears beneath 'a' would indeed be case sensitive (but only in the third grade), and appears to be the one that is considered correct nowadays.

The problem with this is, machines normally working in the 'C' locale produce seriously different directory listings than those in an Unicode locale, one cannot compare these quickly with a glance. Given in addition the possible alterability of (inter)national sorting conventions, it might be best to leave LC_COLLATE=C and only change this for applications which do concern.
The database is a more difficult matter, especially if it carries system data (like filesystem backups).
 
What does surprise me is that there are no comments on the matter to find. When doing a simple ls /usr/ports, one can see how much conventions have adapted to the ASCII collation: all payload is lowercase, and all things that a sysop might want to notice are uppercase. Makefile README Changes UPDATING.

For a database it would depend on what it carries:
Code:
                                  List of databases
   Name    |  Owner   | Encoding  |   Collate   |    Ctype    |   Access privileges   
-----------+----------+-----------+-------------+-------------+-----------------------
 bareos    | postgres | SQL_ASCII | C           | C           | 
 postgres  | postgres | UTF8      | de_DE.UTF-8 | de_DE.UTF-8 | 
 template0 | postgres | UTF8      | de_DE.UTF-8 | de_DE.UTF-8 | =c/postgres          +
           |          |           |             |             | postgres=CTc/postgres

Bacula/bareos does switch Unicode off entirely.
For blog-like applications where one might want to implement e.g. full-text-search I thought it best to enable Ctype and Collate for the mostly-used language, while Numeric and Currency can be left to the application to handle (and localized system Messages are always horrifying).
 
Back
Top