C Low level str to uppercase?

Spartrekus · Apr 21, 2018

How about...

C:

char*
toucase_str_copy(const char *s)
{
        char *copy = strdup(s);
        if (copy) {
                for (s = copy; *s; ++s)
                        *s = toupper((unsigned char)*s);
        }
        return copy; /* remember to free() the copy! */
}

Thank you so much all!

I have a question concerning "char *copy = strdup(s);" in the above example.
A friend told me that strdup is rather not that good to be employed, because strncpy is better to be used. strncpy is recommended. What do you think ?

"/* remember to free() the copy! */" why? It will be free after returning to main() automatically, no? no need of free(copy) ?

p3rj · Apr 21, 2018

strncpy works even if your string isn't NUL-terminated, provided you supply a valid length, and your target buffer is large enough. The bad thing about that is that the result of strncpy may also not be NUL-terminated, if no NUL occurs within the given length. A reasonable template for this situation could be

C:

char *s = malloc(len + 1);
if (s != 0) {
    strncpy(s, source, len);
    s[len] = 0;
    return s;
}
...

strdup conceptually is roughly the combination of malloc and strcpy, probably similar to

C:

char *s = malloc(strlen(source) + 1);
if (s != 0)
    strcpy(s, source);
return s;

If your source is terminated properly, this should work well (and avoids the risk of forgetting to allocate the extra space for the terminating NUL).
In any case, strdup allocates memory from the heap like malloc, so you must free the result later, or you leak that memory. Only memory allocated on the stack by declaring local variables is released on return from a function (concepts like alloca not withstanding).
Perhaps the best alternative would be strndup, which is somewhat of a combination of the functions, but is required to always produce a properly terminated result (unless it returns NULL). The result should also be passed to free later.

Bobi B. · Apr 21, 2018

Spartrekus said:
I have a question concerning "char *copy = strdup(s);" in the above example.
A friend told me that strdup is rather not that good to be employed, because strncpy is better to be used. strncpy is recommended. What do you think ?

I cannot speak for you friend; better ask him. However, both have really different purpose: strdup(3) will allocate a heap buffer and will return a copy of given string (transferring new pointer ownership to you, hence the necessity of free(3)), whereas strcpy(3) (or strncpy(3)) will copy string into a buffer you provide; that can be a stack buffer, a heap buffer, memory-mapped memory, etc.

Spartrekus said:
"/* remember to free() the copy! */" why? It will be free after returning to main() automatically, no? no need of free(copy) ?

Your statement is correct -- the OS will free process' memory upon exit; omitting free(3) is perfectly fine for a small, quick-running program. But consider a library, whose use you cannot foresee, or a daemon, which runs for days, weeks or months. You better not get used to being lazy

leebrown66 · Apr 21, 2018

Spartrekus said:
"/* remember to free() the copy! */" why? It will be free after returning to main() automatically, no? no need of free(copy) ?

It will not be free after returning to main. During the lifetime of main it will not be free (which is the point).

When main finishes and returns, then yes all memory the running program used is returned to the OS.

ralphbsz · Apr 22, 2018

Spartrekus said:
I have a question concerning "char *copy = strdup(s);" in the above example.
A friend told me that strdup is rather not that good to be employed, because strncpy is better to be used. strncpy is recommended. What do you think ?

The problem with strdup() is that the caller might give you an arbitrarily long string, perhaps something that is not actually null-terminated. This is the fundamental problem of "buffer overrun attacks": A program expects a string of a sensible length, perhaps of a maximum length, but the caller gives it a much longer string, perhaps an unterminated one. It is obviously a source of bugs; but it can also be much worse, and be a source of hack attacks.

In the case of strdup(), buffer overruns are actually not all that dangerous: strdup() will measure the length of the string, and if it doesn't happen to be properly null-terminated, it will eventually find some zeros in memory. Then it will allocate sufficient memory (or fail if we're out of memory), and do the copy. If the attacker gives you a ridiculously long or unterminated string, it will just use a lot of memory temporarily.

But in general, a well written program that manages strings needs to think about what the longest possible input can be, and then enforce that consistently. This means using the strnXXX version of all functions, and dealing with the fact that many strings will not be null-terminated. It also requires coming up with a plan of what to do when the string is longer than acceptable: shorten it? give an error message?

why? It will be free after returning to main() automatically, no? no need of free(copy) ?

No. There is not automatic free(). Unless you want to leak memory, for every successful malloc(), there has to be a free(). Things that are allocated with malloc() are not automatic variables, which go out of scope at the end of a block (typically at the end of a function) and get destructed.

As an example, you had the following block of code above:

Code:

      char *r = malloc( ... something not important ...);
      return r ? memcpy(r, ptr, siz ) : NULL;
      free( ptr );

This code is a memory leak. Why? Because you first malloc() something, then return, and because you have already returned, you never get to the free line. (In addition, this code also has a zillion other problems, like it was using the wrong size, and it frees the wrong variable, but those are minor). If you code something like this, either the caller has to supply a buffer, or the caller has to understand that the memory was allocated for them and needs to be free'd (like strip).

The requirement that everything that is malloc'ed also has to be free'd is the main difficulty in programming in C or traditional C++ (which has exactly the same problem with new and delete). And it causes particular difficulty in string handling, because there one has to frequently create new strings of different length. This is the single most important reason that modern programming language have taken the management of memory allocation away from the programmer, because it is just pointless extra work and leads to bugs.

Bobi B. · Apr 22, 2018

Well, if we go this way, than what about unlink("/lib/libc.so.7");; lets avoid unlink(2), because it is dangerous? An API/standard library function has a purpose. The developer is the One to use it the Right way. strdup(3) will allocate memory for and will make a copy of given string. Can you pass it non-null terminated string? Yes. What will happen? Either segmentation fault, either it will "succeed".

Can you delete system-critical file with rm(1)? Yes (I have, on several occasions). Well, it's just that rm does what you asked/is-designed for, not what you want/meant. Same goes for APIs.

Spartrekus · Apr 22, 2018

Hi, First of all, thank very much for opening the discussion and also your amazing posts.

The memory leak was really bad, I am sorry that I came to such a solution.

Well, to say the truth, I have an issue with toupper() because of special chars.
The check of 'a' to 'z' to because 'A' to 'Z' is good
I do believe that we shall not use toupper(), but do it by hand. Because of learning and, really, full control.
Ideally would be to have only stdlib, stdio, and string in many portable programs on any machines.

but well... here it is:
É and é, È and è, Ä and lower of it.... üöä,... ÖÄÜ, ... those chars needs more "space".
I am not so sure then that strdup is good, because it gives identical space, but sometimes we
need more space for chars.

This why maybe allocate a double size "space" might allow to place UTF8 special chars...

Bobi B. · Apr 22, 2018

Then lets dig deeper. There is this thing called Code Page that defines mappings between characters and bytes (like 0x30 <=> '0' (zero), 0x41 <=> 'A' (capital latin A), 0xc0 <=> 'А' (capital cyrillic A in code page 1251), 0xc0 <=> 'Ŕ' in code page 1250). Than you might be able to encode each character in a single byte or you might need multiple bytes to encode a single character (there are lots of characters in Chinese or Japanese, for example). UTF-8 enabled a universal way to encode characters that doesn't need a mapping table.

Then there is your program and its internal representation of text strings. You can use char type, which is mostly a signed 8-bit type. There is wchar_t that is 16-bit unsigned on some platforms (like Windows) or 32-bit unsigned on another platforms (like Unixes), read multibyte(3). Using chars with UTF-8 is fine, in most cases, but you should be aware, that a byte is not necessarily same as a character, and vice-versa (code above ignores that). With other words, when looping over buffer with UTF-8 encoded data you should be parsing bytes to extract characters. Or you simply use an array of wchar_t, instead, and manipulate strings with wide-character version of respective functions.

C:

/* call with toucase_wstr_copy(L"Wide-character string constant"); see wprintf(3) */
wchar_t*
toucase_wstr_copy(const wchar_t *s)
{
        wchar_t *copy = wcsdup(s);
        if (copy) {
                for (s = copy; *s; ++s)
                        *s = towupper(*s);
        }
        return copy; /* remember to free() the copy! */
}

PS: BTW there is even more. There are locales, in some languages, like German, 'ß' (lowercase) translates to 'SS' in uppercase. Letters in the alphabet are not always ordered the same way. i18n is hard

ralphbsz · Apr 22, 2018

Bobi B. said:
strdup(3) will allocate memory for and will make a copy of given string. Can you pass it non-null terminated string? Yes. What will happen? Either segmentation fault, either it will "succeed".

And if "you" (meaning the routine that uses strdup) doesn't bother to check that the string is sensible, then strdup() will do something insensible. Which is what I said above: If you want to write software that pleases the user even when used in edge cases, and is not vulnerable to doing random crazy things when presented with unsensible inputs, then using functions such as strdup() is dangerous (except in those cases where it is safe). An easy way to deal with string inputs is to clearly define maximum lengths, and then use functions such as strndup(). Another way is to handle arbitrary length strings correctly, which means handling out of memory conditions in all cases. Either way, coding string handling in C is a lot of work to get correct.

ralphbsz · Apr 22, 2018

Spartrekus said:
I do believe that we shall not use toupper(), but do it by hand. Because of learning and, really, full control.
Ideally would be to have only stdlib, stdio, and string in many portable programs on any machines.

For learning programming, I understand your argument. As a student of software engineering, you need to know (using this silly example of uppercasing a string) how to do it in the 7-bit ASCII character set, and you need to be able to get that perfect, without any bugs.

But as a professional software engineer, it is insane if you have to rewrite simple functions (like toupper()) every time you need them. Or it is just as insane that every engineer ends up having his own library of such functions. There are two reasons for that. First, getting them done right is actually very difficult. We have discussed some of the problems of internationalization and character sets here, and writing a toupper() routine that can handle Unicode in its full glory, including locale-dependency, is an enormous amount of work. And for a person who doesn't understand Unicode completely, it is likely to lead to very buggy software. For example, if I were to try to write it, I might get it to work correctly for English and German (the two languages I write regularly), and forget that there are some exceptions already in Swiss German (I think it uses a slightly different character set, at least Swiss keyboards have different keys on them). There is no way I could get the accents in French right, since I didn't learn French in school. As an extreme example: there is a textbook on how to handle CJKV (Chinese Japanese Korean Vietnamese) character sets, written by a guy I know a little bit, and the book is about 2 inches thick. I would never attempt to write toupper() for CJKV without first reading the whole book cover to cover, and then sitting down with the author for a few hours so he can critique my plan. If I tried to write toupper() for CJKV given my current knowledge, I would end up with a completely broken disaster.

The second argument against writing these kinds of things yourself is simply efficiency. Doing them is hard and takes a lot of work. By using canned libraries of high-quality software, you can do your real work more faster and productively. As a software engineer, you want to solve interestingly complex problems, and you get paid to solve those problems quickly and correctly. You don't get paid to implement toupper() again, if a very high quality version is already available.

Old joke, from 30 years ago when using computers to run a business was still difficult and required lots of programming: A medium-size business has decided to go away from doing all accounting by hand on paper. So they buy a computer, and hire a programmer, and ask the programmer to write them billing, accounts payable, and accounts receivable. He says that it will take about 3 years, one year for each (in 1980 that was actually a very reasonable estimate). So after a year and a half, at the halfway point of the development, the business owner decides to check in with the programmer, and asks him how the progress is going. The answer is: He's already done writing an editor, and is now working on the Great American Compiler.

What is the moral of this story? We can't build all our own tools. In order to be productive, and have fun with computers, we need to use tools and components that are engineered by others. Now that doesn't mean that we can ignore learning how these tools work. Yes, starting programmers have to learn how to write a simple version toupper() that works for ASCII only. And computer science students will spend one semester in a compiler class, and they will probably write a tiny compiler for a silly little language as a project. But if they need to develop "accounts receivable", they don't start with just stdlib, stdio and string, but they use pre-made things like web servers, scripting languages, databases, report generators, and so on.

Spartrekus · Apr 22, 2018

> There are two possibilities: opensource or closesource:
1) If you use Unix or BSD, you need to put your hands in to C / C++ and programming.
You are free to work, study and evolve skills anywhere and anytime, on any world available machines.
In 10 years, your code and work will still work.
The most important are algorithms that will keep working over the time! This is gold.

2) If not, you use MS Windows like a common folk.
Excel, Office,... are made for everyone.
Your may lock yourself to close source software forever.
Good then, go use C#, Windows and Qt.

In 10 years, your code and work will likely no longer work.
It is highly efficient, simple to use, reliable, and excellent software.
The Best choice, really !
(=Considering long term, this is pure waste.=)
Let's take example of visual basic for instance, .net, ... or take for instance java ?
doc, docx, ppt, pptx, and their lack of compatibility. It looks different depending on the installation.
Better to use pdf, ok, but ...
These softwares are of course excellent, but they are not open source.
But, no one force you to learn how to "smoke" (close source and dependency).

> There may be two points about programming (or more) to be considered:
1) It is noteworthy that using libraries is good, it allows to save time, but it can be an issue for portability and later use.
2) Learning basics gives you full freedom, and give all success.
As example, many innovative companies rely on making themselves new production lines, machines and simulators. It takes millions investments, but the outcome is enormous and huge. ($).

> Major focus: Education versus Money?
- It is important to focus on knowledge and the education of our society. Besides, step by step, without considering money, but just considering education. Having full knowledge of how things work and capability to modify is good. Having basic source code allows you to create steadily more complex ones.
- Publication of source code in various fields (chemistry, maths, physics,.. computer sciences) is good.
- Using libraries may help but does not allows all deep understanding always, depending on interests and situation. Gaming/software industries live from fast programming implementation. Libraries are there, allowing to create good software in a limited time. Highly efficient.
- It depends on each interests and demands. Opensource, closesource, libraries or not,... it is up to you. You are free.

Spartrekus · Nov 3, 2018

This would give a possible way using C.

Code:

//////////////////////////////////////////////////////////////////
char *strtouppercase( char *str )
{ 
      char ptr[strlen(str)+1];
      int i,j=0;
      for(i=0; str[i]!='\0'; i++)
      {
           if  ( (str[c] >= 'a') && (str[c] <= 'z'))
                ptr[j++] = str[i] - 32;
           else
              ptr[j++]=str[i];
      }
      ptr[j]='\0';
      char *base=ptr; return (base);
}

shkhln · Nov 3, 2018

The array there is (roughly) lexically scoped, so the returned pointer is invalid. On a related note, I was surprised to learn that C99 variable-length arrays are actually wildly unsafe. You'd think if something is defined in the language standard, it's not completely YOLO.

Spartrekus · Nov 3, 2018

shkhln said:
The array there is (roughly) lexically scoped, so the returned pointer is invalid. On a related note, I was surprised to learn that C99 variable-length arrays are actually wildly unsafe. You'd think if something is defined in the language standard, it's not completely YOLO.

I noticed it just now... the pointer usage returns time to time to not accurate content

ralphbsz · Nov 3, 2018

First the obnoxious comment: You must urgently learn about C's memory model. Where do variables come from? What memory do they use? What is their lifetime? What is the difference between stack, heap, and static? If you don't know this stuff by heart, you are completely unsafe as a programmer.

Here is my attempt. You can argue that my error handling isn't what the callers really wanted; that would be a good argument to have. You can also argue that trying to make this code rudimentary thread-safe is a waste of time (that would not be a good argument to have). And feel free to find mistakes.

Code:

// You have to comment your code.  Comments are actually more important than
// code, because more people read comments.

// This function takes an input string, which must be null-terminated, and in
// normal operation return a newly malloc'ed copy of the string, where all the
// lower case characters have been changed into upper-case characters.

// Restrictions and error handling: The input string has to be 7-bit ASCII,
// meaning all characters in the input must be between 0x00 and 0x7F.  Why this
// restriction?  Because if we find different characters, we do not known the
// character set and encoding of the string (it could be UTF-8, it could be
// 8859-1 ...), and we don't know how to uppercase those in general.  Also, the
// input string must be at most 1024 characters long (including the terminating
// zero); otherwise, we have to assume that someone is performing a buffer
// overrun attack.  If either restriction is violated, the function will return
// NULL.  It will also return NULL if there isn't enough memory to make a copy
// of the string.

// TODO: The error handling could be done many other ways.  We could return an
// errno, and pass the copied and converted string back in a second argument.
// We could raise an exception.  The best form of error handling depends on the
// needs of the functions that call this.

#define MAX_STRING_TO_UPPERCASE (1024)

char* strtouppercase(char* input)
{
  // Measure the length of the input string, only looking at the maximum length:
  size_t l = strnlen(input, MAX_STRING_TO_UPPERCASE);
  // If the input is too long: done, return with error.
  if (l >= MAX_STRING_TO_UPPERCASE) {
    return NULL;
  }

  // Try to get memory for the output.  If there isn't any: return with error.
  char* output = (char*) malloc(l+1);
  if (!output) {
    return NULL;
  }

  // Iterate over both input and output strings, using pointers.  End the
  // iteration when the input string ends with a nul character.  By
  // construction, that has to happen within the maximum length.  We assert on
  // that anyway, in case someone is modifying the input string while we are
  // running (race condition in a multi-threaded process).
  char* p_input = input;
  char* p_output = output;
  // Start of the loop: nothing, the two variables are already initialized.
  // End the loop when there are no more characters to copy.
  // Loop by incrementing both pointers.
  for (; *p_input; ++p_input, ++p_output) {
    // If we run off the end of the input, someone moved the goalpoast.
    if (p_input-input > l) {
      free(output);
      return NULL;
    }
    // If the input has characters with the high bit set, it wasn't 7-bit:
    if (*p_input & 0x80) {
      free(output);
      return NULL;
    }
    if (*p_input >= 'a' && *p_input <= 'z') {
      *p_output = *p_input + 'A' - 'a';
    }
    else {
      *p_output = *p_input;
    }
  }

  // Success. We need to quickly terminate the output string, then we're done.
  *p_output = '\0';
  return output;
}
// Reminder for the caller: You have to free the results!

yuripv · Nov 3, 2018

Well now, how about converting the input multibyte string to wide characters, and simply using towupper()/towlower() on that? You'd need to convert it back to MB then, of course. But at least that way you don't have to care about encodings other than setting correct LC_CTYPE using setlocale().

yuripv · Nov 3, 2018

And here's quick and dirty example:

Code:

#include <locale.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <wchar.h>
#include <wctype.h>

char *
strtoupper(const char *mbs)
{
        wchar_t *wcs;
        char *ret;
        size_t inlen;
        size_t outlen;
        int i;

        inlen = strlen(mbs);

        wcs = calloc(inlen + 1, sizeof(wchar_t));
        if (mbstowcs(wcs, mbs, inlen) == -1)
                return (NULL);
        for (i = 0; i < wcslen(wcs); i++)
                wcs[i] = towupper(wcs[i]);
        outlen = wcstombs(NULL, wcs, 0);
        if (outlen == -1)
                return (NULL);
        ret = malloc(outlen + 1);
        (void) wcstombs(ret, wcs, outlen);

        return (ret);
}

int
main(void)
{
        const char *input = "foo тест bar";
        char *output = NULL;

        setlocale(LC_CTYPE, "");

        if ((output = strtoupper(input)) != NULL)
                printf("output=%s\n", output);
        free(output);

        return (0);
}

EDIT: Yes, I see that I forgot to free wcs, but it's not that important for an example

ralphbsz · Nov 4, 2018

That's reasonable (let's not quibble about corner cases, like buffer overruns). But I have a question, which comes from not being a Unicode / character set expert: does the conversion to/from wchar_t work for all unicode characters? Even those with code points > 65536? These days there are 32-bit Unicode characters. It might work ... sizeof(wchar_t) is 4, which is in theory enough space.

yuripv · Nov 4, 2018

ralphbsz said:
That's reasonable (let's not quibble about corner cases, like buffer overruns). But I have a question, which comes from not being a Unicode / character set expert: does the conversion to/from wchar_t work for all unicode characters? Even those with code points > 65536? These days there are 32-bit Unicode characters. It might work ... sizeof(wchar_t) is 4, which is in theory enough space.

Yes, it should work for all unicode characters we have defined in tolower and toupper maps in the LC_CTYPE category for current locale. https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt is a full set of characters defined by latest Unicode version (11.0.0), so the highest defined character at the moment is 0x10FFFD.

zirias@ · Nov 6, 2018

Without reading all the comments in depth: Why do you insist on returning a new string object you need to allocate memory for? The idiomatic C way is to let the caller provide the memory, like this:

C:

size_t strtoupper(char *result, size_t n, const char *str)
{
    // write a maximum of n characters to result, including a NUL terminator
    // return the number of characters written
}

This way, your function doesn't need to care about memory allocation -- if the caller wants dynamically allocated memory, it will just pass a pointer to that.

As for how to do the translation: toupper(3) is the *only* correct thing to do, as the behavior depends on the current locale. Sure, if you want to limit yourself to english language and ASCII (7bit) encoding, you can roll your own ...

ralphbsz · Nov 6, 2018

Because the caller can't know how much memory will be required. Because some strings may change size when being uppercased in certain Unicode situations.

yuripv · Nov 6, 2018

toupper() isn't exactly correct as it's limited to single byte locales accepting "unsigned char" as its argument; towupper/towlower are the way to go as I've shown above.

As for the allocating memory, it was not that important for example of doing conversion.

EDIT: to clarify, toupper() prototype says argument is an int, yes, but description also says "The argument must be representable as an unsigned char or the value of EOF.".

olli@ · Nov 7, 2018

ralphbsz said:
Because the caller can't know how much memory will be required. Because some strings may change size when being uppercased in certain Unicode situations.

Yes and no ...
Here's a specific example: In the Turkish language, there are two lower case letters “i” and “ı” (with and without a dot). Both of them have their own uppercase letter: “İ” and “I”, respectively (again, with and without a dot). So, when using UTF-8, a case conversion changes the size of the letter in both cases because “i” and “I” take just one byte, while “ı” and “İ” take two bytes.
However – Normally you don't work with UTF-8 strings directly. Instead, you decode them to wide strings on input, then work with them, and encode back to UTF-8 on output. And in wide strings every character has the same size. So a case conversion will never change the size of a string, including when using Unicode (wide strings).

ralphbsz · Nov 7, 2018

It's more complicated.

Even when using wide characters, case conversion can change the length of the string. The canonic example is the german es-zet "ß" (the characters looks similar to a greek beta in lower case, and is pronounced like a sharp voiceless ess): Until recently it only existed as a lowercase character; when uppercasing it, it gets converted to SS, and takes up two characters. Now, the german orthography standards body has decided to create a new uppercase character for it, but neither Austria and Switzerland have agreed (Switzerland doesn't even use that character). Unicode had added the uppercase character "ẞ" a while ago, but it is not yet in widespread use, since the official uppercasing rule is very recent. So for now the correct uppercase conversion depends on whether you are in Austria or Switzerland or Germany, and how up-to-date your OS is. And most of the time it will turn a single lowercase character into two uppercase characters.

(Side remark: It is fascinating that the rules for typography change with time. That makes "string comparison" across documents of different age difficult, meaning impossible for computer programs. For example, the es-zet used to be considered a ligature or "s" and "z", and about 100 years ago, it was uppercased to "SZ". In my high school we had a very old map of the world's oceans, and it showed the pacific ocean as "GROSZE ODER STILLE OZEAN" (the ocean was traditionally called "large ocean" in German). The question whether the meaning of the word "GROSZE OZEAN" is the same as today's "GROẞE OZEAN" or "PAZIFIK" is something that strcmp() can not answer, and has to be left to (human) linguists.)

(Another side remark: This is one of the examples that shows that changing the case of a character loses information: Take the es-zet, uppercase it into "SS", lowercase it again, and you have "ss", which is not the original starting point. The same happens in greek, where they have a different "sigma" that is used at the end of a word, but in uppercase there is only one "sigma" letter, so after lowercasing it back, the difference between the two sigma glyphs is lost.)

I think that there are also cases there lowercase characters are expressed as ligatures or precomposed accented characters, and the corresponding uppercase characters either are not a ligature, or no precomposed version exists. An example might be the lowercase character "ffl" (which as a ligature is a single character = code point = wide character), which *might* turn into the three characters "FFL" because uppercase characters are not glued together visually.

And then there are "digraphs". I know they are sort of like ligatures, but also sort of like single characters. If I remember right (not sure at all), the croatian digraph "lj" is expressed in unicode as a single (wide) character or code point, and is *not* equal to the string "lj" if written as two separate characters. In contrast the ligature "ffl" (also expressed as a single wide character) would compare equal to the 3-character string "ffl". But I can't remember whether the digraphs exist in uppercase versions too, or whether they turn into two characters "LJ" when uppercasing them.

And that's all in addition to the change of length of the encoded string in UTF-8, which you already discussed.

In summary: Unicode is complicated.

olli@ · Nov 7, 2018

ralphbsz said:
It's more complicated.

Even when using wide characters, case conversion can change the length of the string. The canonic example is the german es-zet "ß" (the characters looks similar to a greek beta in lower case, and is pronounced like a sharp voiceless ess): Until recently it only existed as a lowercase character; when uppercasing it, it gets converted to SS, and takes up two characters. Now, the german orthography standards body has decided to create a new uppercase character for it, but neither Austria and Switzerland have agreed (Switzerland doesn't even use that character). Unicode had added the uppercase character "ẞ" a while ago, but it is not yet in widespread use, since the official uppercasing rule is very recent. So for now the correct uppercase conversion depends on whether you are in Austria or Switzerland or Germany, and how up-to-date your OS is. And most of the time it will turn a single lowercase character into two uppercase characters.

(Side remark: It is fascinating that the rules for typography change with time. That makes "string comparison" across documents of different age difficult, meaning impossible for computer programs. For example, the es-zet used to be considered a ligature or "s" and "z", and about 100 years ago, it was uppercased to "SZ". In my high school we had a very old map of the world's oceans, and it showed the pacific ocean as "GROSZE ODER STILLE OZEAN" (the ocean was traditionally called "large ocean" in German). The question whether the meaning of the word "GROSZE OZEAN" is the same as today's "GROẞE OZEAN" or "PAZIFIK" is something that strcmp() can not answer, and has to be left to (human) linguists.)

(Another side remark: This is one of the examples that shows that changing the case of a character loses information: Take the es-zet, uppercase it into "SS", lowercase it again, and you have "ss", which is not the original starting point. The same happens in greek, where they have a different "sigma" that is used at the end of a word, but in uppercase there is only one "sigma" letter, so after lowercasing it back, the difference between the two sigma glyphs is lost.)

I think that there are also cases there lowercase characters are expressed as ligatures or precomposed accented characters, and the corresponding uppercase characters either are not a ligature, or no precomposed version exists. An example might be the lowercase character "ffl" (which as a ligature is a single character = code point = wide character), which *might* turn into the three characters "FFL" because uppercase characters are not glued together visually.

And then there are "digraphs". I know they are sort of like ligatures, but also sort of like single characters. If I remember right (not sure at all), the croatian digraph "lj" is expressed in unicode as a single (wide) character or code point, and is *not* equal to the string "lj" if written as two separate characters. In contrast the ligature "ffl" (also expressed as a single wide character) would compare equal to the 3-character string "ffl". But I can't remember whether the digraphs exist in uppercase versions too, or whether they turn into two characters "LJ" when uppercasing them.

And that's all in addition to the change of length of the encoded string in UTF-8, which you already discussed.

In summary: Unicode is complicated.

While I agree with you in principle, the cases that you mentioned cannot happen with the wide character API as defined by ISO C. The towupper(3) and towlower(3) functions take exactly one character and return exactly one character. So, as long as you use the standard API, there is no way that the length of a wide string can change by upper/lower case conversion.

By the way, some ligatures (but not all) do have upper case variants in Unicode, even though the characters are not connected, for example “ij” (U+0133) and “IJ” (U+0132).