Is Unicode case-insensitive?

PMc · Jun 25, 2021

Code:

$ echo $LANG

$ echo $LC_ALL

$ echo $LC_CTYPE
de_DE.UTF-8
$ echo $LC_COLLATE

( echo A ; echo b ; echo C ) | LC_COLLATE=de_DE.UTF-8 sort
A
b
C
$ ( echo A ; echo b ; echo C ) | sort
A
C
b

This is not very funny, and is one of the reasons why I do not use these normally.

But, the problem goes deeper. This is postgres, and here LC_COLLATE cannot be changed (because it builds all the indexes on it):

Code:

flowm=> select 'b' < 'A';
?column?
----------
f
(1 row)

flowm=> select 'b' < 'C';
?column?
----------
t
(1 row)
flowm=> show LC_COLLATE;
lc_collate 
-------------
de_DE.UTF-8
(1 row)

Why is my PostgreSQL ORDER BY case-insensitive?

I have Postgres 9.4.4 running on Debian and I get the following ORDER BY behavior: veure_test=# show LC_COLLATE; lc_collate ------------- en_US.UTF-8 (1 row) veure_test=# SELECT

dba.stackexchange.com

Problems with sort order (UTF8 locales don't work) · Issue #216 · PostgresApp/PostgresApp

Hello, The pg_config points to a local directory that does not exists LOCALEDIR = /Applications/Postgres.app/Contents/Versions/9.3/share/locale Can you include it on further versions? ((enjoy)) cr

github.com

PMc · Jun 25, 2021

~~Ha, found it. It is simply de_De versus de_DE (or en_Us versus en_US). All the babble on the Internet just makes one stupid...~~

This was probably bogus. It seems to just not have found the name, and fallen back to whatever 'C' and then given the traditional result... investigating further...

Alain De Vos · Jun 25, 2021

I have

Code:

export LANG="en_US.UTF-8"
export LC_ALL=$LANG
export LC_COLLATE=$LANG
export LC_CTYPE=$LANG
export LC_MESSAGES=$LANG
export LC_MONETARY=$LANG
export LC_NUMERIC=$LANG

And

Code:

postgres:\
    :lang=en_US.UTF-8:\
    :setenv=LC_COLLATE=C:\
    :tc=default:

Note, with postgresql you can create databases without "conversion" and they are faster.
Why is default LC_COLLATE not "en_US.UTF-8" ?
A link,

23.3. Character Set Support

23.3. Character Set Support 23.3.1. Supported Character Sets 23.3.2. Setting the Character Set 23.3.3. Automatic Character Set Conversion Between Server and …

www.postgresql.org

Why not run,

Code:

initdb -E UTF8

If you know beforehand you won't need funny characters.

You probably know this but databases have Encoding,Collate & Ctype.
These are 3 settings.
Defaults are UTF8,C,C-UTF-8

ralphbsz · Jun 25, 2021

Read this and weep or laugh: https://unicode.org/reports/tr10/
The answer is: It's complicated.

PMc · Jun 25, 2021

Alain De Vos said:

I have

Code:

export LANG="en_US.UTF-8"
export LC_ALL=$LANG
export LC_COLLATE=$LANG
export LC_CTYPE=$LANG
export LC_MESSAGES=$LANG
export LC_MONETARY=$LANG
export LC_NUMERIC=$LANG

And

Code:

postgres:\
    :lang=en_US.UTF-8:\
    :setenv=LC_COLLATE=C:\
    :tc=default:

Yes, but the question is: do you have case-sensitive or case-insensitive sort?

Here, LC_COLLATE=en_US.UTF-8 gives case-insensitive sort (upper- and lowercase mixed), while LC_CTYPE=C (that is what we are all used to) gives case-sensitive sort, i.e. the uppercase filenames are on top of the ls listing.

And while I am not yet sure inhowfar that is a real problem, it certainly is a serious problem to suddenly have an essentially different sort order in data that may not even contain any "funny characters".

I do not find suitable documentation, but it seems Unicode would support case-sensitive as well as case-insensitive collations. So the most straightforward question is: how can one decide on which to use?

Alain De Vos said:
Note, with postgresql you can create databases without "conversion" and they are faster.

I know that I can do that.

Alain De Vos said:
Why is default LC_COLLATE not "en_US.UTF-8" ?A link,

Because I have this in /etc/rc.conf:

Code:

postgresql_initdb_flags="--data-checksums --encoding=utf-8 --lc-collate=de_DE.UTF-8 --lc-ctype=de_DE.UTF-8 --lc-messages=en_US.UTF-8 --lc-monetary=en_US.UTF-8 --lc-numeric=en_US.UTF-8 --lc-time=en_US.UTF-8"

Alain De Vos said:
23.3. Character Set Support

23.3. Character Set Support 23.3.1. Supported Character Sets 23.3.2. Setting the Character Set 23.3.3. Automatic Character Set Conversion Between Server and …

www.postgresql.org

Why not run,

Code:

initdb -E UTF8

If you know beforehand you won't need funny characters.

Now that's exactly the issue. I do not yet know what might be the best approach. Applications can be internationalized (if one would bother to do that pile of work), so it would finally be the user who decides on the language. For now I know that some of the payload is certainly german, so I decided to give that a try.

Alain De Vos · Jun 25, 2021

Relativation time,

View: https://www.youtube.com/watch?v=UmCs12aUO20

PMc · Jun 25, 2021

ralphbsz said:
Read this and weep or laugh: https://unicode.org/reports/tr10/
The answer is: It's complicated.

No, this is quite suitable.

By default, the algorithm makes use of three fully-customizable levels. For the Latin script, these levels correspond roughly to:

alphabetic ordering

diacritic ordering

case ordering

A final level may be used for tie-breaking between strings not otherwise distinguished.

So, by standard, the case ordering is only the last level of sorting, i.e. uppercase and lowercase versions of a letter are sorted near each other like in a dictionary, and not all uppercase before all lowercase as in ASCII.

Furthermore, this page about Collation describes the ASCII/'C' sort order as something unusual and unconvenient:

So a computer program might treat the characters a, b, C, d, and $ as being ordered $, C, a, b, d (the corresponding ASCII codes are $ = 36, a = 97, b = 98, C = 67, and d = 100). Therefore, strings beginning with C, M, or Z would be sorted before strings with lower-case a, b, etc. This is sometimes called ASCIIbetical order. This deviates from the standard alphabetical order, particularly due to the ordering of capital letters before all lower-case ones (and possibly the treatment of spaces and other non-letter characters).

So, while unix folks are used to that kind of sorting almost since the teletype learned case distinction, normal people seem not to like it, and Unicode does probably not provide for it. And then, the apparently case-insensitive sorting where 'A' appears beneath 'a' would indeed be case sensitive (but only in the third grade), and appears to be the one that is considered correct nowadays.

The problem with this is, machines normally working in the 'C' locale produce seriously different directory listings than those in an Unicode locale, one cannot compare these quickly with a glance. Given in addition the possible alterability of (inter)national sorting conventions, it might be best to leave LC_COLLATE=C and only change this for applications which do concern.
The database is a more difficult matter, especially if it carries system data (like filesystem backups).

Alain De Vos · Jun 25, 2021

What do you think of:

Code:

CREATE DATABASE a3
WITH
  ENCODING='UTF8'
  LC_COLLATE='POSIX'
  LC_CTYPE='POSIX'
  TEMPLATE='template0';

PMc · Jun 25, 2021

What does surprise me is that there are no comments on the matter to find. When doing a simple ls /usr/ports, one can see how much conventions have adapted to the ASCII collation: all payload is lowercase, and all things that a sysop might want to notice are uppercase. Makefile README Changes UPDATING.

For a database it would depend on what it carries:

Code:

                                  List of databases
   Name    |  Owner   | Encoding  |   Collate   |    Ctype    |   Access privileges   
-----------+----------+-----------+-------------+-------------+-----------------------
 bareos    | postgres | SQL_ASCII | C           | C           | 
 postgres  | postgres | UTF8      | de_DE.UTF-8 | de_DE.UTF-8 | 
 template0 | postgres | UTF8      | de_DE.UTF-8 | de_DE.UTF-8 | =c/postgres          +
           |          |           |             |             | postgres=CTc/postgres

Bacula/bareos does switch Unicode off entirely.
For blog-like applications where one might want to implement e.g. full-text-search I thought it best to enable Ctype and Collate for the mostly-used language, while Numeric and Currency can be left to the application to handle (and localized system Messages are always horrifying).