Collating element in RE doesn't work

From re_format(7)
Within a bracket expression, a collating element (a character, a multi-
character sequence that collates as if it were a single character, or a
collating-sequence name for either) enclosed in `[.' and `.]' stands for
the sequence of characters of that collating element. The sequence is a
single element of the bracket expression's list. A bracket expression
containing a multi-character collating element can thus match more than
one character, e.g. if the collating sequence includes a `ch' collating
element, then the RE `[[.ch.]]*c' matches the first five characters of
`chchcc'.
Code:
# echo 'lhhhocate chchccccclulzc dlen' | egrep --color '[[.chte.]]*'
egrep: Invalid collation character

# echo 'lhhhocate chchccccclulzc dlen' | egrep --color '[.chte.]+'
# echo 'lhhhocate chchccccclulzc dlen' | egrep --color '[chte]+'
    Will give SAME output and matching (colored)

The same goes for an equivalence class:
Code:
# echo 'lhhhocate chchccccclulzc dlen' | egrep --color '[[=chte=]]+'
egrep: Invalid collation character

# echo 'lhhhocate chchccccclulzc dlen' | egrep --color '[=chte=]+'

All the same result! Is this a bug?
 
Please note that the word collating must not be taken literally here. It's a language/encoding thing: in some languages combinations of certain letters are part of the alphabet and are treated as single letters (which is important for sorting, etc.). So the current set of these "collating elements" is language specific. It can be changed via the LC_COLLATE environment variable.

re_format() contains at least two errors:
  • It fails to mention that the 'ch' example is for Spanish. It requires LC_COLLATE to be set to a specific value (like "en_ES.UTF-8"), and
  • It is outdated because the Spanish alphabet was redefined and 'ch' is not considered a "collating element" anymore.

That said, neither collating elements nor equivalence classes seem to be taken into account for regular expressions:
Code:
% (export LC_ALL=de_DE.ISO8859-1; echo Motörhead | sed -E 's/[[=o=]]/_/g')                           
M_törhead
% (export LC_ALL=hu_HU.ISO8859-2; echo ty | grep -E '^[s-u]*$')
%
 
Thank you for clarification.
I really had no idea, that collating had to do with language/encoding.
 
Back
Top