C Low level str to uppercase?

Hello,

I would like to find a possible method to convert a string to uppercase in C (cross-platform, low-level, from scratch,... char* mystring) ?
C++:
printf( "%s\n", toupper_alternative_lowlevel( "hello world" ));

I think that I will probably make it by hands, read the string and convert at least 'a' to 'A' (chr 65),... to 'z' to 'Z'. Long way but at least mega low - level, but... portable.

Eventually, would you know any?
 
Last edited by a moderator:
That can get quite tricky depending on what exactly you are trying to do:
  • Not all character sets have a-z consecutively (EBCDIC comes to mind, for example i is character 137, j is 145).
  • Internationalization -- some languages have no upper/lower case at all.
  • Unicode -- not all characters occupy a single byte.
 
That can get quite tricky depending on what exactly you are trying to do:
  • Not all character sets have a-z consecutively (EBCDIC comes to mind, for example i is character 137, j is 145).
  • Internationalization -- some languages have no upper/lower case at all.
  • Unicode -- not all characters occupy a single byte.

well, self explaining ...

C++:
char *touppercase(char *instr ) {
   int c = 0;
   while (instr[c] != '\0')
   {
      if  ( (instr[c] >= 'a') && (instr[c] <= 'z'))
         instr[c] = instr[c] - 32;
      c++;
   }
 
  .... // return, now pass to char*

}



or a simple pipeline may look like:
C++:
// low level

    #include <stdio.h>
    #include <stdlib.h>

    int main()
    {
        int begin = 1;
        int c ; int d ;
        c = getchar();
        while( c != EOF )
        {
           if (( c >= 'a') && ( c <= 'z'))
              d = c - 32;
           else
              d = c;
           putchar( d );
           begin = 0;
           if ( c == '\n' ) begin = 1;
           c = getchar();
        }
    return 0;
    }
 
Last edited by a moderator:
I would replace the hard coded -32 with ('a'-'A'). Other than that, I would say that is portable, maybe not efficient (I would use pointers), but more efficient than using the standard libc toupper function (which is locale aware and can handle split range sets, check it out here)
 
If you assume only US-ASCII, your solution isn't half bad. I would not use array notation for strings; it is correct C, but not idiomatic for C-style string processing. Also, since you modify the string in place, why do you bother returning the pointer? The caller by definition has it already. The only reason to return the pointer is to users can call chains of functions, but I personally don't like that style, since you are relying on a side-effect of the function, and there is no clear distinction whether the string is an input / output / I-O argument or return value. You also don't need extra parentheses around the expressions in the if test; anyone who programs in C needs to know simple operator precedence well enough to read that statement.

I also removed your extra variable, which is not needed: You can modify the pointer that is passed in; the caller's copy will not be modified. Once that's done, I think that a for loop is clearer than a while loop, but that's a question of taste.

Another deep philosophical question is: Should you modify the string in place? I can understand why one would want to do that in C, because there copying strings involves memory management, and a huge amount of work for tracking allocating and freeing (which is usually handled wrong, which is why amateur-written C code tends to leak memory like the Titanic). But in general, it is cleaner if a functio is a "pure" function, which gets constant arguments in, and return a new value. More about that below.
C++:
void touppercase(char *instr ) {
    for(; *instr!='\0'; instr++)
    {
        assert(*instr <= '~'); // Twiddle is the last ASCII character before internationalization kicks in
        if (*instr<='a' && *instr=<='z')
            *instr += 'a'-'A';
    }
}

The moment you get internationalization or Unicode into the game, this becomes exponentially difficult. I'm not even sure whether you can define what "capitalize" means in a locale-independent fashion; it is possible that the definition of what the uppercase character is might depend on language and country. Furthermore, there exist lowercase characters for which there is no uppercase equivalent. Until 6 weeks ago, the german "sharp ess" a.k.a. "es-zet" was one such example: it only existed in lowercase form, and when capitalizing a word, it was replaced by a double 's": "Straße" -> "STRASSE" (that word means "street"). See how the string length changes? Starting in 2018, the german organization that standardizes orthography defined a new character which is an uppercase es-zet, but I doubt that many unicode implementations are ready for it. And I don't know whether the standards bodies for Austria, Switzerland, and other german-speaking places have followed that example yet. There are other examples of characters that behave strangely when capitalizing. But what this leads to: In Unicode, the length of a string may change when it is capitalized, so you are forced to allocate a new string and copy it (see above).

I think the correct answer to the question is: Anyone who tries to implement this today is crazy and stupid; one should instead find a good string library with internationalization, and call the correct function. Now, if you give this answer in a homework problem or during a job interview, and get in big trouble, please don't blame me.
 
Last edited by a moderator:
Hello,

I have tried the following method to make a uppercase (in C).

I recently noticed that it eats up completely my memory usage (given by "top" or "ps aux").

Some help would be greatly helpful there.

The problem is here:
Code:
     size_t siz = sizeof ptr ;
      char *r = malloc( sizeof ptr );
      return r ? memcpy(r, ptr, siz ) : NULL;
      free( ptr );
Why?

Many thanks in advance !

Code:
////////////////////////////////////////////////////////////////////
char *strtouppercase( char *str )
{  
      char ptr[strlen(str)+1];
      int i,j=0;
      for(i=0; str[i]!='\0'; i++)
      {
           if  ( (str[c] >= 'a') && (str[c] <= 'z'))
                ptr[j++] = str[i] - 32;
           else
              ptr[j++]=str[i];
      }
      ptr[j]='\0';
      size_t siz = sizeof ptr ;
      char *r = malloc( sizeof ptr );
      return r ? memcpy(r, ptr, siz ) : NULL;
      free( ptr );
}
 
Delete the line free( ptr );, since it will never be reached because it comes after the return statement, and besides this, it is wrong anyway, since ptr is allocated on the stack and not on the heap, so ptr would be freed implicitly.

In addition it is not necessary to first convert the string on the stack and later copy it to a dynamic memory allocation, you could convert it directly into the memory allocation instead.
C:
char *strtouppercase( char *str )
{
      char *ptr = malloc(strlen(str)+1);
      int i;
      for(i=0; str[i]!='\0'; i++)
      {
           if  ( (str[i] >= 'a') && (str[i] <= 'z'))
                ptr[i] = str[i] - 32;
           else
              ptr[i]=str[i];
      }
      ptr[j]='\0';
      return ptr;
}

PS: Perhaps you might want to impose a maximum length to circumvent cases where the passed string is not '\0' terminated.
C:
      size_t len = strlen(str);
      char *ptr = malloc(len <= 255 ? len + 1 : 256);
      ...
 
Delete the line free( ptr );, since it will never be reached because it comes after the return statement, and besides this, it is wrong anyway, since ptr is allocated on the stack and not on the heap, so ptr would be freed implicitly.

In addition it is not necessary to convert the string on stack and later copy it to a dynamic memory allocation, you could convert it directly into the memory allocation.
C:
char *strtouppercase( char *str )
{
      char *ptr = malloc(strlen(str)+1);
      int i,j=0;
      for(i=0; str[i]!='\0'; i++)
      {
           if  ( (str[i] >= 'a') && (str[i] <= 'z'))
                ptr[j++] = str[i] - 32;
           else
              ptr[j++]=str[i];
      }
      ptr[j]='\0';
      return ptr;
}

PS: Perhaps you might want to impose a maximum length to circumvent cases where the passed string is not '\0' terminated.
C:
      size_t len = strlen(str);
      char *ptr = malloc(len <= 255 ? len + 1 : 256);
      ...

hello

the problem is that Tcc and Gcc tell me

warning function returns address of local variable
return ptr

I would be glad to have this without warning.... if possible and clean for the memory usage.
 
hello

the problem is that Tcc and Gcc tell me

warning function returns address of local variable
return ptr

I would be glad to have this without warning.... if possible and clean for the memory usage.
Of course the compilers would complain about returning ptr, it is a stack variable. Simply read again what I have written, and try the code that I send. If you don't understand this then repeat reading until you do.
 
My assumption is that you want to allocate memory for the resulting string each time the function is called. The caller is expected to free the memory when it's finished, i.e.:
Code:
char *x = strtouppercase("some string");
printf("Here it is: %s\n", x);
free(x);
(typed without testing):
Code:
char *strtouppercase( char *str )
{  
      char *ptr = malloc(strlen(str)+1);
      int i,j=0;
      for(i=0; str[i]!='\0'; i++)
      {
           if  ( (str[c] >= 'a') && (str[c] <= 'z'))
                ptr[j++] = str[i] - 32;
           else
              ptr[j++]=str[i];
      }
      ptr[j]='\0';
      return ptr;
}
Your existing code is allocating memory for a pointer, pointing that to the temporary 'ptr' heap data, then returning that, which becomes undefined.
 
It depends on what do you expect that function should really do. If you stick to the plain ASCII you could do:

C:
char* tou(char* str) {
        if (!str) return NULL;

        char* cp = str;
        while (*cp) {
                if (*cp >= 'a' && *cp <= 'z')
                        *cp++ &= 0xdf;
                else
                        cp++;
        }
        return str;
}

But this function expects you can write to whatever cp is pointing to, i.e. you can't do tou("test").

So you could be returning new string:
C:
#include <string.h>
#include <stdlib.h>

char* tou2(char* cp) {
        if (!cp) return NULL;
        char* nbuf = (char*)malloc(strlen(cp)+1);

        if(!nbuf) return NULL;

        char* wp = nbuf;
        while(*cp) {
                *wp++ = (*cp >= 'a' && *cp <= 'z') ? (*cp++ & 0xdf) : *cp++;
        }
        *wp = 0;
        return nbuf;
}
But now you need to pay attention to the returned string as you need to clean after this function (i.e. you need to call free). If you use tou2("test") you're not keeping the reference to the returned string ; you should be using char* s = tou2("test") ; free(s)

I'm not that big of a fan of ?: operator. More often than not it makes the code less readable. But as you used it in the examples I did the same.

EDIT: fixing classic off-by-one BOF
 
Code:
      size_t siz = sizeof ptr ;

The size of a pointer is not the same as the size of the addressed object. Consider:

Code:
char y[127];
char *x = y;
printf("%d vs. %d\n", sizeof x, sizeof y);
 
What's wrong with toupper(3)? It is ISO C and part from standard C library:
C:
#include <ctype.h>
#include <stdio.h>

char*
toucase_str(char *s)
{
    char *initial = s;
    for (; *s; ++s)
        *s = toupper(*s);
    return initial;
}

int
main(void)
{
    char s[] = { "Hello, World!" };
    printf("%s\n", toucase_str(s));
    return 0;
}

Edit, after taking into account unitrunker's remark: replace *s = toupper(*s); with *s = toupper((unsigned char)*s);.
 
Perhaps no more, but that code was once problematic. Some isxxxx/toxxxx CRT functions used the int parameter as an array index to a lookup table. Any unsigned character above 127 became negative as a signed character.

You have to typecast the signed char to unsigned to prevent a negative index on a lookup table.
 
the point is that I would like to avoid the SEGMENT Fault and bufffer overrun.

It should be compilable on Gcc but also on Tcc and this without warning.

return ptr gave anyhow segmentation fault.
Code:
char*
toucase_str(char *s)
{
    char *initial = s;
    for (; *s; ++s)
        *s = toupper(*s);
    return initial;
}
this gives warnings. I am not sure that everyone like compiler warning and errors.
 
the point is that I would like to avoid the SEGMENT Fault and bufffer overrun.

It should be compilable on Gcc but also on Tcc and this without warning.

return ptr gave anyhow segmentation fault.

char*
toucase_str(char *s)
{
char *initial = s;
for (; *s; ++s)
*s = toupper(*s);
return initial;
}

this gives warnings. I am not sure that everyone like compiler warning and errors.
Please provide a complete example (full source code), as well as compiler warnings.
 
the point is that I would like to avoid the SEGMENT Fault and bufffer overrun.
But if s points to read-only memory it will always segfault as you are trying to modify ro memory. My example code does what you are asking in tou2(), now without bof issue, which was a small facepalm moment on my side.
 
indeed, your code with return initial and char *initial = s; works, using it.

However, when using a new char ptr[strlen(str)... will cause segmentation fault.
but the compiler C is happy.

This results in segmentation fault.
Code:
#include <ctype.h>
#include <stdio.h>
#include <string.h>

char *toucase_str_adapted(char *str)
{
    char *initial = str;
    char ptr[strlen(str)+1];
    int i, j;
    for(i=0; str[i]!='\0'; i++)
    {
           if  ( (str[i] >= 'a') && (str[i] <= 'z'))
                ptr[j++] = str[i] - 32;
           else
              ptr[j++]=str[i];
    }
    return ptr;
}



int main(void)
{
    char s[] = { "Hello, World!" };
    printf("%s\n", toucase_str_adapted(s));
    return 0;
}
 
How about...
C:
char*
toucase_str_copy(const char *s)
{
        char *copy = strdup(s);
        if (copy) {
                for (s = copy; *s; ++s)
                        *s = toupper((unsigned char)*s);
        }
        return copy; /* remember to free() the copy! */
}
 
indeed, your code with return initial and char *initial = s; works, using it.

However, when using a new char ptr[strlen(str)... will cause segmentation fault.
but the compiler C is happy.

This results in segmentation fault.
Code:
#include <ctype.h>
#include <stdio.h>
#include <string.h>

char *toucase_str_adapted(char *str)
{
    char *initial = str;
    char ptr[strlen(str)+1];
ptr is allocated space on the stack (AKA heap). That memory is only valid during the lifetime of this function. You could imagine it as automatically malloc()'d on entry and free()'d on exit.

Once the function executes return, that memory is totally invalid and the 'ptr' you return now points to unallocated memory.

This is why _martin and my solutions allocate memory for the destination string; and why we say it has to be free()'d afterwards. Please go back and re-read what we have suggested. Hopefully you can spot the difference.
 
This results in segmentation fault.
Now it depends how much you'd like to understand the issue. You see the code above from various people how to do this. It may segfault, but not necessarily. That segfault may very well be due to uninitialized j variable and you trying to index ptr way out of bounds.

To understand what's going on you'll need to understand the layout of the running process in the memory. Something easy, just an overview without too much detail. I quoted this guy few times and I think his explanation is easy to grasp. He talks about the Linux, but FreeBSD is not that different in this aspect: anatomy of a program in memory.

I suggest you to get familiar with gdb too. If you have core dump of the segfaulted program you could see where it failed and get better understanding why.
 
Back
Top