C/C++ Low level str to uppercase?

Bobi B.

Active Member

Thanks: 74
Messages: 165

#26
I have a question concerning "char *copy = strdup(s);" in the above example.
A friend told me that strdup is rather not that good to be employed, because strncpy is better to be used. strncpy is recommended. What do you think ?
I cannot speak for you friend; better ask him. However, both have really different purpose: strdup(3) will allocate a heap buffer and will return a copy of given string (transferring new pointer ownership to you, hence the necessity of free(3)), whereas strcpy(3) (or strncpy(3)) will copy string into a buffer you provide; that can be a stack buffer, a heap buffer, memory-mapped memory, etc.

"/* remember to free() the copy! */" why? It will be free after returning to main() automatically, no? no need of free(copy) ?
Your statement is correct -- the OS will free process' memory upon exit; omitting free(3) is perfectly fine for a small, quick-running program. But consider a library, whose use you cannot foresee, or a daemon, which runs for days, weeks or months. You better not get used to being lazy :)
 

leebrown66

Well-Known Member

Thanks: 102
Messages: 300

#27
"/* remember to free() the copy! */" why? It will be free after returning to main() automatically, no? no need of free(copy) ?
It will not be free after returning to main. During the lifetime of main it will not be free (which is the point).

When main finishes and returns, then yes all memory the running program used is returned to the OS.
 

ralphbsz

Daemon

Thanks: 687
Messages: 1,158

#28
I have a question concerning "char *copy = strdup(s);" in the above example.
A friend told me that strdup is rather not that good to be employed, because strncpy is better to be used. strncpy is recommended. What do you think ?
The problem with strdup() is that the caller might give you an arbitrarily long string, perhaps something that is not actually null-terminated. This is the fundamental problem of "buffer overrun attacks": A program expects a string of a sensible length, perhaps of a maximum length, but the caller gives it a much longer string, perhaps an unterminated one. It is obviously a source of bugs; but it can also be much worse, and be a source of hack attacks.

In the case of strdup(), buffer overruns are actually not all that dangerous: strdup() will measure the length of the string, and if it doesn't happen to be properly null-terminated, it will eventually find some zeros in memory. Then it will allocate sufficient memory (or fail if we're out of memory), and do the copy. If the attacker gives you a ridiculously long or unterminated string, it will just use a lot of memory temporarily.

But in general, a well written program that manages strings needs to think about what the longest possible input can be, and then enforce that consistently. This means using the strnXXX version of all functions, and dealing with the fact that many strings will not be null-terminated. It also requires coming up with a plan of what to do when the string is longer than acceptable: shorten it? give an error message?

why? It will be free after returning to main() automatically, no? no need of free(copy) ?

No. There is not automatic free(). Unless you want to leak memory, for every successful malloc(), there has to be a free(). Things that are allocated with malloc() are not automatic variables, which go out of scope at the end of a block (typically at the end of a function) and get destructed.

As an example, you had the following block of code above:
Code:
      char *r = malloc( ... something not important ...);
      return r ? memcpy(r, ptr, siz ) : NULL;
      free( ptr );
This code is a memory leak. Why? Because you first malloc() something, then return, and because you have already returned, you never get to the free line. (In addition, this code also has a zillion other problems, like it was using the wrong size, and it frees the wrong variable, but those are minor). If you code something like this, either the caller has to supply a buffer, or the caller has to understand that the memory was allocated for them and needs to be free'd (like strip).

The requirement that everything that is malloc'ed also has to be free'd is the main difficulty in programming in C or traditional C++ (which has exactly the same problem with new and delete). And it causes particular difficulty in string handling, because there one has to frequently create new strings of different length. This is the single most important reason that modern programming language have taken the management of memory allocation away from the programmer, because it is just pointless extra work and leads to bugs.
 

Bobi B.

Active Member

Thanks: 74
Messages: 165

#29
Well, if we go this way, than what about unlink("/lib/libc.so.7");; lets avoid unlink(2), because it is dangerous? An API/standard library function has a purpose. The developer is the One to use it the Right way. strdup(3) will allocate memory for and will make a copy of given string. Can you pass it non-null terminated string? Yes. What will happen? Either segmentation fault, either it will "succeed".

Can you delete system-critical file with rm(1)? Yes (I have, on several occasions). Well, it's just that rm does what you asked/is-designed for, not what you want/meant. Same goes for APIs.
 
OP
OP
Spartrekus

Spartrekus

Well-Known Member

Thanks: 45
Messages: 292

#30
Hi, First of all, thank very much for opening the discussion and also your amazing posts.

The memory leak was really bad, I am sorry that I came to such a solution. :(

Well, to say the truth, I have an issue with toupper() because of special chars.
The check of 'a' to 'z' to because 'A' to 'Z' is good
I do believe that we shall not use toupper(), but do it by hand. Because of learning and, really, full control.
Ideally would be to have only stdlib, stdio, and string in many portable programs on any machines.

but well... here it is:
É and é, È and è, Ä and lower of it.... üöä,... ÖÄÜ, ... those chars needs more "space".
I am not so sure then that strdup is good, because it gives identical space, but sometimes we
need more space for chars.

This why maybe allocate a double size "space" might allow to place UTF8 special chars...
 

Bobi B.

Active Member

Thanks: 74
Messages: 165

#31
Then lets dig deeper. There is this thing called Code Page that defines mappings between characters and bytes (like 0x30 <=> '0' (zero), 0x41 <=> 'A' (capital latin A), 0xc0 <=> 'А' (capital cyrillic A in code page 1251), 0xc0 <=> 'Ŕ' in code page 1250). Than you might be able to encode each character in a single byte or you might need multiple bytes to encode a single character (there are lots of characters in Chinese or Japanese, for example). UTF-8 enabled a universal way to encode characters that doesn't need a mapping table.

Then there is your program and its internal representation of text strings. You can use char type, which is mostly a signed 8-bit type. There is wchar_t that is 16-bit unsigned on some platforms (like Windows) or 32-bit unsigned on another platforms (like Unixes), read multibyte(3). Using chars with UTF-8 is fine, in most cases, but you should be aware, that a byte is not necessarily same as a character, and vice-versa (code above ignores that). With other words, when looping over buffer with UTF-8 encoded data you should be parsing bytes to extract characters. Or you simply use an array of wchar_t, instead, and manipulate strings with wide-character version of respective functions.

C:
/* call with toucase_wstr_copy(L"Wide-character string constant"); see wprintf(3) */
wchar_t*
toucase_wstr_copy(const wchar_t *s)
{
        wchar_t *copy = wcsdup(s);
        if (copy) {
                for (s = copy; *s; ++s)
                        *s = towupper(*s);
        }
        return copy; /* remember to free() the copy! */
}
PS: BTW there is even more. There are locales, in some languages, like German, 'ß' (lowercase) translates to 'SS' in uppercase. Letters in the alphabet are not always ordered the same way. i18n is hard :)
 

ralphbsz

Daemon

Thanks: 687
Messages: 1,158

#32
strdup(3) will allocate memory for and will make a copy of given string. Can you pass it non-null terminated string? Yes. What will happen? Either segmentation fault, either it will "succeed".
And if "you" (meaning the routine that uses strdup) doesn't bother to check that the string is sensible, then strdup() will do something insensible. Which is what I said above: If you want to write software that pleases the user even when used in edge cases, and is not vulnerable to doing random crazy things when presented with unsensible inputs, then using functions such as strdup() is dangerous (except in those cases where it is safe). An easy way to deal with string inputs is to clearly define maximum lengths, and then use functions such as strndup(). Another way is to handle arbitrary length strings correctly, which means handling out of memory conditions in all cases. Either way, coding string handling in C is a lot of work to get correct.
 

ralphbsz

Daemon

Thanks: 687
Messages: 1,158

#33
I do believe that we shall not use toupper(), but do it by hand. Because of learning and, really, full control.
Ideally would be to have only stdlib, stdio, and string in many portable programs on any machines.
For learning programming, I understand your argument. As a student of software engineering, you need to know (using this silly example of uppercasing a string) how to do it in the 7-bit ASCII character set, and you need to be able to get that perfect, without any bugs.

But as a professional software engineer, it is insane if you have to rewrite simple functions (like toupper()) every time you need them. Or it is just as insane that every engineer ends up having his own library of such functions. There are two reasons for that. First, getting them done right is actually very difficult. We have discussed some of the problems of internationalization and character sets here, and writing a toupper() routine that can handle Unicode in its full glory, including locale-dependency, is an enormous amount of work. And for a person who doesn't understand Unicode completely, it is likely to lead to very buggy software. For example, if I were to try to write it, I might get it to work correctly for English and German (the two languages I write regularly), and forget that there are some exceptions already in Swiss German (I think it uses a slightly different character set, at least Swiss keyboards have different keys on them). There is no way I could get the accents in French right, since I didn't learn French in school. As an extreme example: there is a textbook on how to handle CJKV (Chinese Japanese Korean Vietnamese) character sets, written by a guy I know a little bit, and the book is about 2 inches thick. I would never attempt to write toupper() for CJKV without first reading the whole book cover to cover, and then sitting down with the author for a few hours so he can critique my plan. If I tried to write toupper() for CJKV given my current knowledge, I would end up with a completely broken disaster.

The second argument against writing these kinds of things yourself is simply efficiency. Doing them is hard and takes a lot of work. By using canned libraries of high-quality software, you can do your real work more faster and productively. As a software engineer, you want to solve interestingly complex problems, and you get paid to solve those problems quickly and correctly. You don't get paid to implement toupper() again, if a very high quality version is already available.

Old joke, from 30 years ago when using computers to run a business was still difficult and required lots of programming: A medium-size business has decided to go away from doing all accounting by hand on paper. So they buy a computer, and hire a programmer, and ask the programmer to write them billing, accounts payable, and accounts receivable. He says that it will take about 3 years, one year for each (in 1980 that was actually a very reasonable estimate). So after a year and a half, at the halfway point of the development, the business owner decides to check in with the programmer, and asks him how the progress is going. The answer is: He's already done writing an editor, and is now working on the Great American Compiler.

What is the moral of this story? We can't build all our own tools. In order to be productive, and have fun with computers, we need to use tools and components that are engineered by others. Now that doesn't mean that we can ignore learning how these tools work. Yes, starting programmers have to learn how to write a simple version toupper() that works for ASCII only. And computer science students will spend one semester in a compiler class, and they will probably write a tiny compiler for a silly little language as a project. But if they need to develop "accounts receivable", they don't start with just stdlib, stdio and string, but they use pre-made things like web servers, scripting languages, databases, report generators, and so on.
 
OP
OP
Spartrekus

Spartrekus

Well-Known Member

Thanks: 45
Messages: 292

#34
> There are two possibilities: opensource or closesource:
1) If you use Unix or BSD, you need to put your hands in to C / C++ and programming.
You are free to work, study and evolve skills anywhere and anytime, on any world available machines.
In 10 years, your code and work will still work.
The most important are algorithms that will keep working over the time! This is gold.

2) If not, you use MS Windows like a common folk.
Excel, Office,... are made for everyone.
Your may lock yourself to close source software forever.
Good then, go use C#, Windows and Qt. ;)
In 10 years, your code and work will likely no longer work.
It is highly efficient, simple to use, reliable, and excellent software.
The Best choice, really !
(=Considering long term, this is pure waste.=)
Let's take example of visual basic for instance, .net, ... or take for instance java ?
doc, docx, ppt, pptx, and their lack of compatibility. It looks different depending on the installation.
Better to use pdf, ok, but ...
These softwares are of course excellent, but they are not open source.
But, no one force you to learn how to "smoke" (close source and dependency).


> There may be two points about programming (or more) to be considered:
1) It is noteworthy that using libraries is good, it allows to save time, but it can be an issue for portability and later use.
2) Learning basics gives you full freedom, and give all success.
As example, many innovative companies rely on making themselves new production lines, machines and simulators. It takes millions investments, but the outcome is enormous and huge. ($).

> Major focus: Education versus Money?
- It is important to focus on knowledge and the education of our society. Besides, step by step, without considering money, but just considering education. Having full knowledge of how things work and capability to modify is good. Having basic source code allows you to create steadily more complex ones.
- Publication of source code in various fields (chemistry, maths, physics,.. computer sciences) is good.
- Using libraries may help but does not allows all deep understanding always, depending on interests and situation. Gaming/software industries live from fast programming implementation. Libraries are there, allowing to create good software in a limited time. Highly efficient.
- It depends on each interests and demands. Opensource, closesource, libraries or not,... it is up to you. You are free.
 
Top