• This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn more.

C/C++ Low level str to uppercase?

Spartrekus

Well-Known Member

Thanks: 41
Messages: 274

#1
Hello,

I would like to find a possible method to convert a string to uppercase in C (cross-platform, low-level, from scratch,... char* mystring) ?
C++:
printf( "%s\n", toupper_alternative_lowlevel( "hello world" ));
I think that I will probably make it by hands, read the string and convert at least 'a' to 'A' (chr 65),... to 'z' to 'Z'. Long way but at least mega low - level, but... portable.

Eventually, would you know any?
 
Last edited by a moderator:

leebrown66

Well-Known Member

Thanks: 102
Messages: 298

#2
That can get quite tricky depending on what exactly you are trying to do:
  • Not all character sets have a-z consecutively (EBCDIC comes to mind, for example i is character 137, j is 145).
  • Internationalization -- some languages have no upper/lower case at all.
  • Unicode -- not all characters occupy a single byte.
 

Spartrekus

Well-Known Member

Thanks: 41
Messages: 274

#3
That can get quite tricky depending on what exactly you are trying to do:
  • Not all character sets have a-z consecutively (EBCDIC comes to mind, for example i is character 137, j is 145).
  • Internationalization -- some languages have no upper/lower case at all.
  • Unicode -- not all characters occupy a single byte.
well, self explaining ...

C++:
char *touppercase(char *instr ) {
   int c = 0;
   while (instr[c] != '\0')
   {
      if  ( (instr[c] >= 'a') && (instr[c] <= 'z'))
         instr[c] = instr[c] - 32;
      c++;
   }
 
  .... // return, now pass to char*

}


or a simple pipeline may look like:
C++:
// low level

    #include <stdio.h>
    #include <stdlib.h>

    int main()
    {
        int begin = 1;
        int c ; int d ;
        c = getchar();
        while( c != EOF )
        {
           if (( c >= 'a') && ( c <= 'z'))
              d = c - 32;
           else
              d = c;
           putchar( d );
           begin = 0;
           if ( c == '\n' ) begin = 1;
           c = getchar();
        }
    return 0;
    }
 
Last edited by a moderator:

leebrown66

Well-Known Member

Thanks: 102
Messages: 298

#4
I would replace the hard coded -32 with ('a'-'A'). Other than that, I would say that is portable, maybe not efficient (I would use pointers), but more efficient than using the standard libc toupper function (which is locale aware and can handle split range sets, check it out here)
 

ralphbsz

Daemon

Thanks: 609
Best answers: 3
Messages: 1,062

#5
If you assume only US-ASCII, your solution isn't half bad. I would not use array notation for strings; it is correct C, but not idiomatic for C-style string processing. Also, since you modify the string in place, why do you bother returning the pointer? The caller by definition has it already. The only reason to return the pointer is to users can call chains of functions, but I personally don't like that style, since you are relying on a side-effect of the function, and there is no clear distinction whether the string is an input / output / I-O argument or return value. You also don't need extra parentheses around the expressions in the if test; anyone who programs in C needs to know simple operator precedence well enough to read that statement.

I also removed your extra variable, which is not needed: You can modify the pointer that is passed in; the caller's copy will not be modified. Once that's done, I think that a for loop is clearer than a while loop, but that's a question of taste.

Another deep philosophical question is: Should you modify the string in place? I can understand why one would want to do that in C, because there copying strings involves memory management, and a huge amount of work for tracking allocating and freeing (which is usually handled wrong, which is why amateur-written C code tends to leak memory like the Titanic). But in general, it is cleaner if a functio is a "pure" function, which gets constant arguments in, and return a new value. More about that below.
C++:
void touppercase(char *instr ) {
    for(; *instr!='\0'; instr++)
    {
        assert(*instr <= '~'); // Twiddle is the last ASCII character before internationalization kicks in
        if (*instr<='a' && *instr=<='z')
            *instr += 'a'-'A';
    }
}
The moment you get internationalization or Unicode into the game, this becomes exponentially difficult. I'm not even sure whether you can define what "capitalize" means in a locale-independent fashion; it is possible that the definition of what the uppercase character is might depend on language and country. Furthermore, there exist lowercase characters for which there is no uppercase equivalent. Until 6 weeks ago, the german "sharp ess" a.k.a. "es-zet" was one such example: it only existed in lowercase form, and when capitalizing a word, it was replaced by a double 's": "Straße" -> "STRASSE" (that word means "street"). See how the string length changes? Starting in 2018, the german organization that standardizes orthography defined a new character which is an uppercase es-zet, but I doubt that many unicode implementations are ready for it. And I don't know whether the standards bodies for Austria, Switzerland, and other german-speaking places have followed that example yet. There are other examples of characters that behave strangely when capitalizing. But what this leads to: In Unicode, the length of a string may change when it is capitalized, so you are forced to allocate a new string and copy it (see above).

I think the correct answer to the question is: Anyone who tries to implement this today is crazy and stupid; one should instead find a good string library with internationalization, and call the correct function. Now, if you give this answer in a homework problem or during a job interview, and get in big trouble, please don't blame me.
 
Last edited by a moderator:

Spartrekus

Well-Known Member

Thanks: 41
Messages: 274

#6
Hello,

I have tried the following method to make a uppercase (in C).

I recently noticed that it eats up completely my memory usage (given by "top" or "ps aux").

Some help would be greatly helpful there.

The problem is here:
Code:
     size_t siz = sizeof ptr ;
      char *r = malloc( sizeof ptr );
      return r ? memcpy(r, ptr, siz ) : NULL;
      free( ptr );
Why?

Many thanks in advance !

Code:
////////////////////////////////////////////////////////////////////
char *strtouppercase( char *str )
{  
      char ptr[strlen(str)+1];
      int i,j=0;
      for(i=0; str[i]!='\0'; i++)
      {
           if  ( (str[c] >= 'a') && (str[c] <= 'z'))
                ptr[j++] = str[i] - 32;
           else
              ptr[j++]=str[i];
      }
      ptr[j]='\0';
      size_t siz = sizeof ptr ;
      char *r = malloc( sizeof ptr );
      return r ? memcpy(r, ptr, siz ) : NULL;
      free( ptr );
}
 

Spartrekus

Well-Known Member

Thanks: 41
Messages: 274

#9
Delete the line free( ptr );, since it will never be reached because it comes after the return statement, and besides this, it is wrong anyway, since ptr is allocated on the stack and not on the heap, so ptr would be freed implicitly.

In addition it is not necessary to convert the string on stack and later copy it to a dynamic memory allocation, you could convert it directly into the memory allocation.
C:
char *strtouppercase( char *str )
{
      char *ptr = malloc(strlen(str)+1);
      int i,j=0;
      for(i=0; str[i]!='\0'; i++)
      {
           if  ( (str[i] >= 'a') && (str[i] <= 'z'))
                ptr[j++] = str[i] - 32;
           else
              ptr[j++]=str[i];
      }
      ptr[j]='\0';
      return ptr;
}
PS: Perhaps you might want to impose a maximum length to circumvent cases where the passed string is not '\0' terminated.
C:
      size_t len = strlen(str);
      char *ptr = malloc(len <= 255 ? len + 1 : 256);
      ...
hello

the problem is that Tcc and Gcc tell me

warning function returns address of local variable
return ptr

I would be glad to have this without warning.... if possible and clean for the memory usage.
 

leebrown66

Well-Known Member

Thanks: 102
Messages: 298

#10
My assumption is that you want to allocate memory for the resulting string each time the function is called. The caller is expected to free the memory when it's finished, i.e.:
Code:
char *x = strtouppercase("some string");
printf("Here it is: %s\n", x);
free(x);
(typed without testing):
Code:
char *strtouppercase( char *str )
{  
      char *ptr = malloc(strlen(str)+1);
      int i,j=0;
      for(i=0; str[i]!='\0'; i++)
      {
           if  ( (str[c] >= 'a') && (str[c] <= 'z'))
                ptr[j++] = str[i] - 32;
           else
              ptr[j++]=str[i];
      }
      ptr[j]='\0';
      return ptr;
}
Your existing code is allocating memory for a pointer, pointing that to the temporary 'ptr' heap data, then returning that, which becomes undefined.
 

_martin

Aspiring Daemon

Thanks: 127
Messages: 686

#11
It depends on what do you expect that function should really do. If you stick to the plain ASCII you could do:

C:
char* tou(char* str) {
        if (!str) return NULL;

        char* cp = str;
        while (*cp) {
                if (*cp >= 'a' && *cp <= 'z')
                        *cp++ &= 0xdf;
                else
                        cp++;
        }
        return str;
}
But this function expects you can write to whatever cp is pointing to, i.e. you can't do tou("test").

So you could be returning new string:
C:
#include <string.h>
#include <stdlib.h>

char* tou2(char* cp) {
        if (!cp) return NULL;
        char* nbuf = (char*)malloc(strlen(cp)+1);

        if(!nbuf) return NULL;

        char* wp = nbuf;
        while(*cp) {
                *wp++ = (*cp >= 'a' && *cp <= 'z') ? (*cp++ & 0xdf) : *cp++;
        }
        *wp = 0;
        return nbuf;
}
But now you need to pay attention to the returned string as you need to clean after this function (i.e. you need to call free). If you use tou2("test") you're not keeping the reference to the returned string ; you should be using char* s = tou2("test") ; free(s)

I'm not that big of a fan of ?: operator. More often than not it makes the code less readable. But as you used it in the examples I did the same.

EDIT: fixing classic off-by-one BOF
 

unitrunker

Member

Thanks: 27
Messages: 53

#13
Code:
      size_t siz = sizeof ptr ;
The size of a pointer is not the same as the size of the addressed object. Consider:

Code:
char y[127];
char *x = y;
printf("%d vs. %d\n", sizeof x, sizeof y);
 

Bobi B.

Active Member

Thanks: 51
Best answers: 2
Messages: 108

#14
What's wrong with toupper(3)? It is ISO C and part from standard C library:
C:
#include <ctype.h>
#include <stdio.h>

char*
toucase_str(char *s)
{
    char *initial = s;
    for (; *s; ++s)
        *s = toupper(*s);
    return initial;
}

int
main(void)
{
    char s[] = { "Hello, World!" };
    printf("%s\n", toucase_str(s));
    return 0;
}
Edit, after taking into account unitrunker's remark: replace *s = toupper(*s); with *s = toupper((unsigned char)*s);.
 

unitrunker

Member

Thanks: 27
Messages: 53

#15
Perhaps no more, but that code was once problematic. Some isxxxx/toxxxx CRT functions used the int parameter as an array index to a lookup table. Any unsigned character above 127 became negative as a signed character.

You have to typecast the signed char to unsigned to prevent a negative index on a lookup table.
 

Spartrekus

Well-Known Member

Thanks: 41
Messages: 274

#16
the point is that I would like to avoid the SEGMENT Fault and bufffer overrun.

It should be compilable on Gcc but also on Tcc and this without warning.

return ptr gave anyhow segmentation fault.
Code:
char*
toucase_str(char *s)
{
    char *initial = s;
    for (; *s; ++s)
        *s = toupper(*s);
    return initial;
}
this gives warnings. I am not sure that everyone like compiler warning and errors.
 

Bobi B.

Active Member

Thanks: 51
Best answers: 2
Messages: 108

#17
the point is that I would like to avoid the SEGMENT Fault and bufffer overrun.

It should be compilable on Gcc but also on Tcc and this without warning.

return ptr gave anyhow segmentation fault.

char*
toucase_str(char *s)
{
char *initial = s;
for (; *s; ++s)
*s = toupper(*s);
return initial;
}

this gives warnings. I am not sure that everyone like compiler warning and errors.
Please provide a complete example (full source code), as well as compiler warnings.
 

_martin

Aspiring Daemon

Thanks: 127
Messages: 686

#18
the point is that I would like to avoid the SEGMENT Fault and bufffer overrun.
But if s points to read-only memory it will always segfault as you are trying to modify ro memory. My example code does what you are asking in tou2(), now without bof issue, which was a small facepalm moment on my side.
 

Spartrekus

Well-Known Member

Thanks: 41
Messages: 274

#19
indeed, your code with return initial and char *initial = s; works, using it.

However, when using a new char ptr[strlen(str)... will cause segmentation fault.
but the compiler C is happy.

This results in segmentation fault.
Code:
#include <ctype.h>
#include <stdio.h>
#include <string.h>

char *toucase_str_adapted(char *str)
{
    char *initial = str;
    char ptr[strlen(str)+1];
    int i, j;
    for(i=0; str[i]!='\0'; i++)
    {
           if  ( (str[i] >= 'a') && (str[i] <= 'z'))
                ptr[j++] = str[i] - 32;
           else
              ptr[j++]=str[i];
    }
    return ptr;
}



int main(void)
{
    char s[] = { "Hello, World!" };
    printf("%s\n", toucase_str_adapted(s));
    return 0;
}
 

Bobi B.

Active Member

Thanks: 51
Best answers: 2
Messages: 108

#20
How about...
C:
char*
toucase_str_copy(const char *s)
{
        char *copy = strdup(s);
        if (copy) {
                for (s = copy; *s; ++s)
                        *s = toupper((unsigned char)*s);
        }
        return copy; /* remember to free() the copy! */
}
 

leebrown66

Well-Known Member

Thanks: 102
Messages: 298

#21
indeed, your code with return initial and char *initial = s; works, using it.

However, when using a new char ptr[strlen(str)... will cause segmentation fault.
but the compiler C is happy.

This results in segmentation fault.
Code:
#include <ctype.h>
#include <stdio.h>
#include <string.h>

char *toucase_str_adapted(char *str)
{
    char *initial = str;
    char ptr[strlen(str)+1];
ptr is allocated space on the stack (AKA heap). That memory is only valid during the lifetime of this function. You could imagine it as automatically malloc()'d on entry and free()'d on exit.

Once the function executes return, that memory is totally invalid and the 'ptr' you return now points to unallocated memory.

This is why _martin and my solutions allocate memory for the destination string; and why we say it has to be free()'d afterwards. Please go back and re-read what we have suggested. Hopefully you can spot the difference.
 

_martin

Aspiring Daemon

Thanks: 127
Messages: 686

#23
This results in segmentation fault.
Now it depends how much you'd like to understand the issue. You see the code above from various people how to do this. It may segfault, but not necessarily. That segfault may very well be due to uninitialized j variable and you trying to index ptr way out of bounds.

To understand what's going on you'll need to understand the layout of the running process in the memory. Something easy, just an overview without too much detail. I quoted this guy few times and I think his explanation is easy to grasp. He talks about the Linux, but FreeBSD is not that different in this aspect: anatomy of a program in memory.

I suggest you to get familiar with gdb too. If you have core dump of the segfaulted program you could see where it failed and get better understanding why.
 

Spartrekus

Well-Known Member

Thanks: 41
Messages: 274

#24
How about...
C:
char*
toucase_str_copy(const char *s)
{
        char *copy = strdup(s);
        if (copy) {
                for (s = copy; *s; ++s)
                        *s = toupper((unsigned char)*s);
        }
        return copy; /* remember to free() the copy! */
}
Thank you so much all!

I have a question concerning "char *copy = strdup(s);" in the above example.
A friend told me that strdup is rather not that good to be employed, because strncpy is better to be used. strncpy is recommended. What do you think ?

"/* remember to free() the copy! */" why? It will be free after returning to main() automatically, no? no need of free(copy) ?
 

p3rj

Member

Thanks: 24
Messages: 44

#25
strncpy works even if your string isn't NUL-terminated, provided you supply a valid length, and your target buffer is large enough. The bad thing about that is that the result of strncpy may also not be NUL-terminated, if no NUL occurs within the given length. A reasonable template for this situation could be
C:
char *s = malloc(len + 1);
if (s != 0) {
    strncpy(s, source, len);
    s[len] = 0;
    return s;
}
...
strdup conceptually is roughly the combination of malloc and strcpy, probably similar to
C:
char *s = malloc(strlen(source) + 1);
if (s != 0)
    strcpy(s, source);
return s;
If your source is terminated properly, this should work well (and avoids the risk of forgetting to allocate the extra space for the terminating NUL).
In any case, strdup allocates memory from the heap like malloc, so you must free the result later, or you leak that memory. Only memory allocated on the stack by declaring local variables is released on return from a function (concepts like alloca not withstanding).
Perhaps the best alternative would be strndup, which is somewhat of a combination of the functions, but is required to always produce a properly terminated result (unless it returns NULL). The result should also be passed to free later.
 
Top