Collating element in RE doesn't work

Seeker · Jan 9, 2014

Within a bracket expression, a collating element (a character, a multi-
character sequence that collates as if it were a single character, or a
collating-sequence name for either) enclosed in `[.' and `.]' stands for
the sequence of characters of that collating element. The sequence is a
single element of the bracket expression's list. A bracket expression
containing a multi-character collating element can thus match more than
one character, e.g. if the collating sequence includes a `ch' collating
element, then the RE `[[.ch.]]*c' matches the first five characters of
`chchcc'.

Code:

# echo 'lhhhocate chchccccclulzc dlen' | egrep --color '[[.chte.]]*'
egrep: Invalid collation character

# echo 'lhhhocate chchccccclulzc dlen' | egrep --color '[.chte.]+'
# echo 'lhhhocate chchccccclulzc dlen' | egrep --color '[chte]+'
    Will give SAME output and matching (colored)

The same goes for an equivalence class:

Code:

# echo 'lhhhocate chchccccclulzc dlen' | egrep --color '[[=chte=]]+'
egrep: Invalid collation character

# echo 'lhhhocate chchccccclulzc dlen' | egrep --color '[=chte=]+'

All the same result! Is this a bug?

worldi · Jan 11, 2014

Please note that the word collating must not be taken literally here. It's a language/encoding thing: in some languages combinations of certain letters are part of the alphabet and are treated as single letters (which is important for sorting, etc.). So the current set of these "collating elements" is language specific. It can be changed via the LC_COLLATE environment variable.

re_format() contains at least two errors:

It fails to mention that the 'ch' example is for Spanish. It requires LC_COLLATE to be set to a specific value (like "en_ES.UTF-8"), and
It is outdated because the Spanish alphabet was redefined and 'ch' is not considered a "collating element" anymore.

That said, neither collating elements nor equivalence classes seem to be taken into account for regular expressions:

Code:

% (export LC_ALL=de_DE.ISO8859-1; echo Motörhead | sed -E 's/[[=o=]]/_/g')                           
M_törhead
% (export LC_ALL=hu_HU.ISO8859-2; echo ty | grep -E '^[s-u]*$')
%

Seeker · Jan 18, 2014

Thank you for clarification.
I really had no idea, that collating had to do with language/encoding.

Collating element in RE doesn't work

Seeker

worldi

Seeker