Strange sort behavior compared to linux

Hi everyone,

we found a strange behavior, that we can't explain atm. On a FreeBSD 10.2 we get this using sort:

Code:
[root@FBSD /]# setenv LC_ALL de_DE.UTF-8
[root@FBSD /]# sort test.list
Apfel
Orange
Zelt
Äpfel
Österreich

which is just an wrong order in our findings.

On a linux (older redhat) this comes up sorting the same list:

Code:
[root@LINUX /]# export LC_ALL=de_DE.utf8
[root@LINUX /]# sort test.list
Apfel
Äpfel
Orange
Österreich
Zelt

which looks just correct.
The same happens when using sort functions in e.g. PHP.

Any help would be really appreciated as this seems to be a serious problem, if sorting fails at OS level. But maybe we just blind and someone could shed a light!

Many thanks in advance!
Jimmy
 
Can you point out any reference as to what the proper sorting order for german language is? To me it looks like the FreeBSD implementation uses just character code values and both 'Ö' and 'Ä' have character code values greater than any of the ASCII characters.

Edit: I believe the difference is in decomposition of the 'Ö' and 'Ä' characters. There are two ways to read in them in Unicode, either as single characters 'Ö' and 'Ä' or composites 'O' + '¨' and 'A' + '¨'.
 
The sorting works fine here (and it's correct as per OP), just checked it.
What I see is that if a locale does not exist, csh just silently does what the OP describes, but bash complains about missing locale.
icecoke , have you checked if you have such locale in /usr/share/locale/?
 
kpa

Citation from wiki (https://en.wikipedia.org/wiki/Alphabetical_order):

  • In German letters with umlaut (Ä, Ö, Ü) are treated generally just like their non-umlauted versions; ß is always sorted as ss. This makes the alphabetic order Arg, Ärgerlich, Arm, Assistent, Aßlar, Assoziation. For phone directories and similar lists of names, the umlauts are to be collated like the letter combinations "ae", "oe", "ue" because a number of German surnames appear both with umlaut and in the non-umlauted form with "e" (Müller/Mueller). This makes the alphabetic order Udet, Übelacker, Uell, Ülle, Ueve, Üxküll, Uffenbach.
There is a DIN5007 v2 which is specially for lists, that is shown in https://de.wikipedia.org/wiki/Alphabetische_Sortierung - so like the order example in the english wiki citiation above.

So my linux example from the first post seems to be wrong, too. This kind of sorting seems to be more common in austria and does not follow any DIN standard.

EDIT: trying it on linux with different words, it comes out it is the DIN5007 v1 not the austrian order.

aragats

Could you copy your output here? And what env you used?

The locale files are existing on the server. Same output with bash as tested before.



Is there a way to modify this ordering? Maybe by creating an own locale dir?
 
Her is my output (I'm on FreeBSD 11-CURRENT though):
Code:
root@dendrobates:~ # setenv LC_ALL de_DE.UTF-8
root@dendrobates:~ # sort /tmp/test.list
Apfel
Äpfel
Orange
Österreich
Zelt
 
aragats Thanks! I just installed a vanilla 10.2 RELEASE and it failed again - so no latter settings that killed that. Will set up 10.3 + 11 to check if this was changed between these releases. I will follow up here.
 
I confirm it does not work properly in FreeBSD 10.3.
icecoke , see the difference between locales in different builds:
FreeBSD 10.3:
Code:
root@freebsd10:~ # ls -l /usr/share/locale/de_DE.UTF-8/
total 8
lrwxr-xr-x  1 root  wheel  28 25 Mär 02:18 LC_COLLATE -> ../la_LN.US-ASCII/LC_COLLATE
lrwxr-xr-x  1 root  wheel  17 25 Mär 02:18 LC_CTYPE -> ../UTF-8/LC_CTYPE
lrwxr-xr-x  1 root  wheel  30 25 Mär 02:18 LC_MESSAGES -> ../de_DE.ISO8859-1/LC_MESSAGES
-r--r--r--  1 root  wheel  36 25 Mär 02:18 LC_MONETARY
lrwxr-xr-x  1 root  wheel  29 25 Mär 02:18 LC_NUMERIC -> ../de_DE.ISO8859-1/LC_NUMERIC
-r--r--r--  1 root  wheel  370 25 Mär 02:18 LC_TIME
FreeBSD 11:
Code:
root@dendrobates:~ # ls -l /usr/share/locale/de_DE.UTF-8/
total 11
lrwxr-xr-x  1 root  wheel  25 13 Mai  07:07 LC_COLLATE -> ../en_US.UTF-8/LC_COLLATE
lrwxr-xr-x  1 root  wheel  23 13 Mai  07:07 LC_CTYPE -> ../en_US.UTF-8/LC_CTYPE
-r--r--r--  1 root  wheel  148 13 Mai  07:07 LC_MESSAGES
lrwxr-xr-x  1 root  wheel  26 13 Mai  07:07 LC_MONETARY -> ../sl_SI.UTF-8/LC_MONETARY
lrwxr-xr-x  1 root  wheel  25 13 Mai  07:07 LC_NUMERIC -> ../tr_TR.UTF-8/LC_NUMERIC
-r--r--r--  1 root  wheel  409 13 Mai  07:07 LC_TIME
 
aragats Yep - found the same here. And if you make a head into the files, you see they have different formats:

10.2/3
Code:
[root@tulobu ~]# head /usr/share/locale/la_LN.US-ASCII/LC_COLLATE
1.2



 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™šžŸ ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ðñòóôõö÷øùúûüýþÿ


11
Code:
root@fbsd11:~ # head /usr/share/locale/en_US.UTF-8/LC_COLLATE
BSD 1.0
"…ÿÿ-Iÿÿÿ      !


"
 (,8&B97+-.4 <  $*5-I.J/K0L1M2N3O4Q5R6S
                                       '
@@v@@™N"@@c@@5c@@3c@@9c@@E˜@@þ˜@@ü˜@@˜@@W@@@@@@@@@@@÷5@@(D@@MD@@KD@@QD@@Wp@@ˆLp@@77@7c@7@7D@7Z@7p@7W@Ww@@W~@y7

So - does anyone has an idea HOW these files are created? So that maybe I'm in the position to convert/create one single LC_COLLATE file that respects the correct sorting in <11.0 ?
 
Well, since LC_COLLATE is a symlink, instead of creating another one, you can simply change it to point to the same original file as in FreeBSD 11. i.e. ../en_US.UTF-8/LC_COLLATE.
 
aragats I would, if this would exist. It is just a symlink to the same target in 10.2/3 - so no correct LC_COLLATE to find:

Code:
[root@tulobu ~]# ll /usr/share/locale/en_US.UTF-8/
total 10
drwxr-xr-x    2 root  wheel    8 26 Jan  2014 ./
drwxr-xr-x  176 root  wheel  176 26 Jan  2014 ../
lrwxr-xr-x    1 root  wheel   28 26 Jan  2014 LC_COLLATE@ -> ../la_LN.US-ASCII/LC_COLLATE

In the meantime I found the tool that seems to be used to create LC_COLLATE files: colldef - so I guess I need find the .src files somewhere in the base system, too ;)
Not sure how much the effort will be for this.
 
So, in the meantime I tried around a lot and find out, how the LC_COLLATE files in FreeBSD <11.0 are build and which sources used for it.

The problem is, that the 'old' colldef() does not support multibyte signs completely. E.g the substitution needed for e.g. DIN5007v2 is not possible with multibyte.

What I did now is to create a colldef source file for the ISO8859-15 character set, which handles UTF-8 input, so that it is able to order such chars. In the result it is 'nearly' an DIN5007v1 order except that ß is not substituted with 'ss' but just handled as an single, small 's'. Just an small issue in the german language.

So if someone with FreeBSD 10.x want's to get near to the correct collation of FreeBSD 11.x and DIN5007v1 than use this source to create your own UTF-8 LC_COLLATE file:

Code:
# ISO8859-15/UTF-8 (ISO8859-15 character set)
#
# order (DIN5007v1 like) for characters from the ISO8859-15 set in unicode to be used in FreeBSD <11.0
# no map used
#
# $FreeBSD: colldef/de_DE.UTF-8.pre11.src james t. koerting $
#
#
# substitution for ISO8859-15 input only - DIN5007 v1 + v2
substitute \xdf with "ss"
#
# v2 Order - ISO8859-15 input only. So disable it.
#substitute \xc4 with "Ae"
#substitute \xd6 with "Oe"
#substitute \xdc with "Ue"
#substitute \xe4 with "ae"
#substitute \xf6 with "oe"
#substitute \xfc with "ue"


order \
# controls
#        <NU>;...;<US>;<PA>;...;<AC>;\
         \x00;...;\x1f;\x80;...;\x9f;\
         \xc2\x80;\xc2\x81;\xc2\x82;\xc2\x83;\xc2\x84;\xc2\x85;\xc2\x86;\xc2\x87;\xc2\x88;\xc2\x89;\xc2\x8a;\xc2\x8b;\xc2\x8c;\xc2\x8d;\xc2\x8e;\xc2\x8f;\
         \xc2\x90;\xc2\x91;\xc2\x92;\xc2\x93;\xc2\x94;\xc2\x95;\xc2\x96;\xc2\x97;\xc2\x98;\xc2\x99;\xc2\x9a;\xc2\x9b;\xc2\x9c;\xc2\x9d;\xc2\x9e;\xc2\x9f;\
#
#        <NS>;           <SP>;!;   <!I>;           \";  <<<>;           </>/>>;         <Nb>;\
         (\xa0,\xc2\xa0);\x20;\x21;(\xa1,\xc2\xa1);\x22;(\xab,\xc2\xab);(\xbb,\xc2\xab);\x23;\
#        EUR;            <Ct>;           <DO>;<Pd>;           <Ye>;\
         (\xa4,\x20\xac);(\xa2,\xc2\xa2);\x24;(\xa3,\xc2\xa3);(\xa5,\xc2\xa5);\
#        %;&;';\(;\);*;+;<+->;         <-:>;           <*X>;           \,;  <-->;           -;   .;   /;\
         \x25;...;\x2b;(\xb1,\xc2\xb1);(\xf7,\xc3\xb7);(\xd7,\xc3\x97);\x2c;(\xad,\xc2\xad);\x2d;\x2e;\x2f;\
# digits
#        0;(1,<1S>);         (2,<2S>);         (3,<3S>);         4;...;9;\
         0;(1,\xb9,\xc2\xb9);(2,\xb2,\xc2\xb2);(3,\xb3,\xc2\xb3);4;...;9;\
#
#        :;\;;\<;=;>;?;<?I>;           <SE>;           <PI>;           <Co>;           <Rg>;           <At>;\
         \x3a;...;\x3f;(\xbf,\xc2\xbf);(\xa7,\xc2\xa7);(\xb6,\xc2\xb6);(\xa9,\xc2\xa9);(\xae,\xc2\xae);\x40;\
# capital
#        (A,   <A'>,<A!>,<A/>>,<AA>,<A?>);                                            (<A:>,<AE>);\
         (\x41,\xc1,\xc0,\xc2,\xc5,\xc3,\xc3\x80,\xc3\x81,\xc3\x82,\xc3\x83,\xc3\x85,\xc4,\xc6,\xc3\x86,\xc3\x84);\
#        B;   (C,   <C,>);         (D,   <D->);         (E,   <E'>,<E!>,<E/>>,<E:>);\
         \x42;(\x43,\xc7,\xc3\x87);(\x44,\xd0,\xc3\x90);(\x45,\xc9,\xc8,\xca,\xcb,\xc3\x88,\xc3\x89,\xc3\x8a,\xc3\x8b);\
#        F;   G;   H;   (I,   <I'>,<I!>,<I/>>,<I:>);\
         \x46;\x47;\x48;(\x49,\xcd,\xcc,\xce,\xcf,\xc3\x8d,\xc3\x8c,\xc3\x8e,\xc3\x8f);\
#        J;   ...;M;   (N,<N?>);            (O,   <O'>,<O!>,<O/>>,<O?>,<O//>,<O:>,<OE>);\
         \x4a;...;\x4d;(\x4e,\xd1,\xc3\x91);(\x4f,\xd3,\xd2,\xd4,\xd5,\xd8,\xc3\x93,\xc3\x92,\xc3\x94,\xc3\x95,\xc3\x98,\xd6,\xbc,\xc3\x96,\xc5\x92);\
#        P;   ...; R;   (S,   <S<>);         T;   (U,<U'>,<U!>,<U/>>,<U:>);\
         \x50;\x51;\x52;(\x53,\xa6,\xc5\xa0);\x54;(\x55,\xda,\xd9,\xdb,xc3\x9a,xc3\x99,xc3\x9b,\xdc,xc3\x9c);\
#        V;   W;   X;   (Y,   <Y'>,<Y:>);                  (Z,   <Z<>);\
         \x56;\x57;\x58;(\x59,\xdd,\xbe,\xc3\x9d,\xc5\xb8);(\x5a,\xb4,\xc5\xbd);\
#        <TH>;\
         (\xde,\xc3\x9e);\
#
#        [;\\;];^;_;   <'m>;`;\
         \x5b;...;\x5f;(\xaf,\xc2\xaf);\x60;\
# small
#        (a,   <a'>,<a!>,<a/>>,<aa>,<a?>,<a:>,<ae>)\
         (\x61,\xe1,\xe0,\xe2,\xe5,\xe3,\xc3\xa1,\xc3\xa0,\xc3\xa2,\xc3\xa5,\xc3\xa3,\xe4,\xe6,\xc3\xa4,\xc3\xa6);\
#        b;   (c,   <c,>);         (d,   <d->);         (e,   <e'>,<e!>,<e/>>,<e:>);\
         \x62;(\x63,\xe7,\xc3\xa7);(\x64,\xf0,\xc3\xb0);(\x65,\xe9,\xe8,\xea,\xeb,\xc3\xa9,\xc3\xa8,\xc3\xaa,\xc3\xab);\
#        f;   g;   h;   (i,   <i'>,<i!>,<i/>>,<i:>);\
         \x66;\x67;\x68;(\x69,\xed,\xec,\xee,\xef,\xc3\xad,\xc3\xac,\xc3\xae,\xc3\xaf);\
#        j;   ...;m;   (n,   <n?>);         (o,   <o'>,<o!>,<o/>>,<o?>,<o//>,<o:>,<oe>);\
         \x6a;...;\x6d;(\x6e,\xf1,\xc3\xb1);(\x6f,\xf3,\xf2,\xf4,\xf5,\xf8,\xc3\xb3,\xc3\xb2,\xc3\xb4,\xc3\xb5,\xc3\xb8,\xf6,\xbd,\xc3\xb6,\xc5\x93);\
#        p;   ...; r;   (s,   <s<>);                  t;   (u,   <u'>,<u!>,<u/>>,<u:>);\
         \x70;\x71;\x72;(\x73,\xa8,\xc5\xa1,\xc3\x9f);\x74;(\x75,\xfa,\xf9,\xfb,\xfc,\xc3\xbc);\
#        v;   w;   x;   (y,   <y'>,<y:>);(z,   <z<>);\
         \x76;\x77;\x78;(\x79,\xfd,\xff,\xc3\xbd,\xc3\xbf);(\x7a,\xb8,\xc5\xbe);\
#        <th>;\
         (\xfe,\xc3\xbe);\
#
#        \{;  <NO>;           |;   \};  ~;   <.M>;           <DG>;           <My>;           <DT>;\
         \x7b;(\xac,\xc2\xac);\x7c;\x7d;\x7e;(\xb7,\xc2\xb7);(\xb0,\xc2\xb0);(\xb5,\xc2\xb5);\x7F;\
# remains
#        <-a>;<-o>
         (\xaa,\xc2\xaa);(\xba,\xc2\xba)

I'm sure I might missed something here or did a mistake in trying to create the DIN5007v1 order - so just drop me a diff or something if you find a failure.
 
Back
Top