C Low level str to uppercase?

aragats · Nov 7, 2018

olli@ said:
the cases that you mentioned cannot happen with the wide character API as defined by ISO C

So, what's going to happen with ß when used through that API?
We have the same problem in Eastern Armenian: և becomes ԵՎ when upper-cased.

olli@ · Nov 7, 2018

aragats said:
So, what's going to happen with ß when used through that API?
We have the same problem in Eastern Armenian: և becomes ԵՎ when upper-cased.

Good question. I haven't actually tried it, but the manual page says (important part highlighted): “If the argument is a lower-case letter, the towupper(3) function returns the corresponding upper-case letter if there is one; otherwise the argument is returned unchanged.”
So, if the system supports the newest Unicode version that has the upper-case “ẞ”, that one will be returned. Otherwise, the lower-case “ß” is returned unchanged. There is no way it can return two characters (“SS”). I guess you'll have to use a third-party library if you need to perform conversions that can change the number of characters.

yuripv · Nov 7, 2018

aragats said:
So, what's going to happen with ß when used through that API?
We have the same problem in Eastern Armenian: և becomes ԵՎ when upper-cased.

Looks like it's the following entry in UnicodeData.txt (which we will hopefully use as a source for our utf-8 ctype maps soon):

Code:

0587;ARMENIAN SMALL LIGATURE ECH YIWN;Ll;0;L;<compat> 0565 0582;;;;N;;;;;

So there's no *simple* upper/lower mapping, which is used by towupper/towlower.

Same for ß:

Code:

00DF;LATIN SMALL LETTER SHARP S;Ll;0;L;;;;;N;;;;;

aragats · Nov 7, 2018

There is a separate file for special cases!

Code:

The data in this file, combined with the simple case mappings in UnicodeData.txt, defines the full case mappings
Lowercase_Mapping (lc), Titlecase_Mapping (tc), and Uppercase_Mapping (uc).

yuripv · Nov 7, 2018

aragats said:

There is a separate file for special cases!

Code:

The data in this file, combined with the simple case mappings in UnicodeData.txt, defines the full case mappings
Lowercase_Mapping (lc), Titlecase_Mapping (tc), and Uppercase_Mapping (uc).

Indeed, but as olli@ mentioned, towupper/towlower can only return single character (POSIX description), so we have to use simple mappings there. I guess there are libraries (ICU?) already providing the means for proper case conversion, but I didn't really look into it.

ralphbsz · Nov 8, 2018

Let me attempt to summarize this discussion: Uppercasing a string is not always the same as uppercasing a single character. To uppercase a string, you have to do more than just uppercase every character in the string.

From this I conclude that I never ever want to work on a project that requires i18n; and if I have to, I'll have to buy lots of alcohol.

ikbendeman · Nov 8, 2018

Use:
ASCII Table
Then a do <add or subtract> while <within bounds of A-Z or a-z> {Subtract/Add}
It's up to you how you'd want to parse it but you could use an array and do the usual, as well as do the operation during the parsing loop.

C Low level str to uppercase?

aragats

olli@

yuripv

aragats

yuripv

ralphbsz

ikbendeman