Why use localization? Why not just use UTF-8 all the time?

ucsdboy · May 3, 2023

Hi guys,

I'm curious - I understand that some users would want to allow input for different languages and keyboard, and I understand why some users would want to have menu settings and feedback from FreeBSD in languages other than English. However I don't understand why all locales aren't UTF-8. In today's world, the ability to display directory contents and have files written in multiple languages display correctly seems like a no-brainer.

Is FreeBSD continuing to support legacy codepages just for supporting legacy's sake? Or is the savings between western codepages and UTF-8 really that substantial? I don't mean this as an attack or anything. I feel like I understand the localization issues only half-way and it's just been a question rolling around the back of my head for a while.

BEst,

C

msplsh · May 3, 2023

Comparison of Unicode encodings - Wikipedia

en.wikipedia.org

yuripv79 · May 3, 2023

ucsdboy said:
just for supporting legacy's sake

This. There were attempts at removing some of this legacy, and then we had to revert it:

Code:

commit 50502545ce6c0e748cfa965924b49611d0da14ae
Author: Baptiste Daroussin <bapt@FreeBSD.org>
Date:   Thu Apr 20 18:21:50 2017 +0000

    Readd Big5: some large databases setup are still requiring it.

    Reported by:    "張君天(Chun-Tien Chang)" <tcs@kitty.2y.idv.tw>

commit ad0b0cc237fff55859b32cf79d208aca10ff4343
Author: Baptiste Daroussin <bapt@FreeBSD.org>
Date:   Sun Mar 19 17:48:41 2017 +0000

    Prepare the removal of the zh_TW.Big5 encoding

It was a pretty big step with switching the default locale (default as in defined in login.conf) to C.UTF-8, and I'm pretty happy with it.

ralphbsz · May 3, 2023

ucsdboy said:
However I don't understand why all locales aren't UTF-8. In today's world, the ability to display directory contents and have files written in multiple languages display correctly seems like a no-brainer.

That is a laudable goal. For many purposes it would be a good idea. And it should go further: All uses of displayable strings on computers (from programming languages through APIs to file formats) should be Unicode, preferably UTF-8.

But getting there is a lot of work, as there is lots and lots of legacy software. Imagine how much work it would be to change all the prototypes on for example the C standard to distinguish between "arrays of chars" and "displayable strings that are encoded in Unicode". Some languages, such as Python, have gone that step (with the separation between the bytes and string types), but doing that for existing standards such as C and C++ is much harder.

And even that doesn't eliminate localization, as for example sort order depends on locale.

ucsdboy · May 3, 2023

Thanks guys! This is very enlightening!

Jose · May 3, 2023

Windows notably still uses UTF-16:

Working with Strings - Win32 apps

Working with Strings

learn.microsoft.com

But I agree with you, the character encoding problems we had in the '90s have been solved. The answer is use Unicode encoded with UTF-8.

UnixRocks · May 4, 2023

ucsdboy said:
However I don't understand why all locales aren't UTF-8.

Wouldn't that be great and so simple?

I use UTF-8 on all my own systems.

ralphbsz · May 4, 2023

Sadly, there is a variety of reasons to not use the UTF-8 encoding, or not to use Unicode at all. The biggest one is: A giant installed base of code that would have to be updated. This is likely the reason why Windows still uses 16-bit characters. But it also applies to Unix and Posix systems; for example, the few places where character strings transition between a Unix kernel and userspace (mostly uname and in particular everything related to file names in places like readdir() and open()), the encoding is simply unspecified, and changing that would take heroic effort.

Another reason is that UTF-8 can be inefficient in certain environments. For example, in Western Europe, 8859-1 saves a small amount of bytes compared to UTF-8, because the "common" special characters (French accented characters, German umlaut and sharp s) take only 1 byte as opposed to two. For some of the ideograph-based CJKV languages, a fixed 16-bit encoding is more efficient, again because two bytes can handle 2^16 or all or nearly all characters, while UTF-8 can only encode 2^11 (2048) characters in two bytes, the remainder needs 3 bytes.

I think it will take at least a few decades for UTF-8 to be universally accepted as the only encoding for text strings, and I hope that in the process, we abandon the C/C++ programming language and the Posix interface, as they are both holding back progress.

msplsh · May 4, 2023

more of these sorts of reasons in the wikipedia page I linked.

yuripv79 · May 4, 2023

ralphbsz said:
in particular everything related to file names in places like readdir() and open()

There is no reason to specify encoding here, it's file system that should do it, e.g. ZFS has an option to enforce UTF-8.

ralphbsz said:
This is likely the reason why Windows still uses 16-bit characters.

Windows uses UTF-16 (previously used UCS-2), so it's not valid example; and there's experimental support for UTF-8, so it's moving in this direction.

ralphbsz said:
we abandon the C/C++ programming language and the Posix interface, as they are both holding back progress

Could you please elaborate?

ralphbsz · May 4, 2023

yuripv79 said:
There is no reason to specify encoding here, it's file system that should do it, e.g. ZFS has an option to enforce UTF-8.

Correct, and that's a step forward. When all processes that use a file system have been moved to use UTF-8, then you can turn that option on, and it gives you some (albeit not perfect) protection against processes creating files or directories with names that are not valid UTF-8. But you have to remember that the Posix interface has very few restrictions on what file names can be: They have a maximum length (which is set per file system, not universally), and they can not contain the Nul character nor the slash character. All other binary bytes (form 0x01 to 0xFF excluding 0x2F = "/") are valid. Enforcing that all file names are valid UTF-8 immediately breaks Posix, and on an existing system, it may make certain files inaccessible.

But what's worse is that by design the file system has to be encoding agnostic: When you create a file or directory, an opaque string of bytes (not displayable characters) is passed into the file system. For example, the file system may see a file whose name is the two bytes 0xC0 0xB0. It does not know whether the file name is to be interpreted as the 8859-1 characters "A accent degree", or as a single UTF-8 character that I can't be bothered to look up, or whether the user intended the two bytes to be binary (non-displayable data). The reason for this is that the file system code (in the kernel) can not know what encoding the process (in user space) is using. Now imagine that this is a computer that is using a mix of user processes, some using 8859-1 and some UTF-8: Someone creates a file that has one name, and someone else sees a file with a different name. Once you start using non-normalized sequences, you can even have cases where two files have the same name in certain locales, which violates the invariant that no two files in the same directory have the same name. If you've ever tried this (and I have, since I used to implement file systems): it causes lots of utilities (like ls or tar) to behave very badly.

Windows uses UTF-16 (previously used UCS-2), so it's not valid example; and there's experimental support for UTF-8, so it's moving in this direction.

Cool, perhaps Windows can become completely UTF-8 eventually. That would be nice.

Could you please elaborate?

There are many reasons why I don't like C/C++ and Posix, and lack of support for distinguishing between "array of bytes, which are to be interpreted as binary" and "string of characters, which are encoded in a certain locale" is one of them. And that is exactly the case described above: The open system call has as an argument as "char * path". We have no idea whether those are binary bytes (which would be valid), or a character string which may have to be transcoded and which should have its normalization checked. Certainly, the C/C++ language would be capable of creating a new data type for "displayable string that is subject to encoding and locale rules", but to make that universally used, we'd have to change both the language standard, and the Posix standard that uses the old-style language bindings.

There are many other things I don't like. A few examples include hard links, sparse files, temporary files (link-count zero files), the ability to modify files in place, the dichotomy between read/write and mmap access, the various 2^64 limits, and so on.

yuripv79 · May 5, 2023

ralphbsz said:
But you have to remember that the Posix interface has very few restrictions on what file names can be: They have a maximum length (which is set per file system, not universally), and they can not contain the Nul character nor the slash character. All other binary bytes (form 0x01 to 0xFF excluding 0x2F = "/") are valid. Enforcing that all file names are valid UTF-8 immediately breaks Posix, and on an existing system, it may make certain files inaccessible.

Let's see what POSIX really says:

3.170 Filename

A sequence of bytes consisting of 1 to {NAME_MAX} bytes used to name a file. The bytes composing the name shall not contain the <NUL> or <slash> characters. In the context of a pathname, each filename shall be followed by a <slash> or a <NUL> character; elsewhere, a filename followed by a <NUL> character forms a string (but not necessarily a character string). The filenames dot and dot-dot have special meaning.

So how UTF-8 is going to "break" POSIX? Filename is still a sequence of bytes, and will not contain NUL or "/". (Note that you need to set the utf8 option when *creating* dataset in ZFS.)

ralphbsz said:
When all processes that use a file system have been moved to use UTF-8

"Processes" do not need to be moved, they just need to respect the locale set for them (and that includes the encoding).

ralphbsz said:
For example, the file system may see a file whose name is the two bytes 0xC0 0xB0. It does not know whether the file name is to be interpreted as the 8859-1 characters "A accent degree", or as a single UTF-8 character that I can't be bothered to look up, or whether the user intended the two bytes to be binary (non-displayable data). The reason for this is that the file system code (in the kernel) can not know what encoding the process (in user space) is using.

That's why there's utf8 option in ZFS, other FS should follow.

ralphbsz said:
Now imagine that this is a computer that is using a mix of user processes, some using 8859-1 and some UTF-8: Someone creates a file that has one name, and someone else sees a file with a different name.

As above, make "processes" respect the locale set for them.

ralphbsz said:
Once you start using non-normalized sequences, you can even have cases where two files have the same name in certain locales, which violates the invariant that no two files in the same directory have the same name. If you've ever tried this (and I have, since I used to implement file systems): it causes lots of utilities (like ls or tar) to behave very badly.

Thanks for sharing your experience, good thing that ZFS is properly implemented to use one (you can specify which) of the normalization forms when utf8 is on. And yes, I have not only "ever tried this", but also fixed various case-sensitivity/normalization/interoperability issues between ZFS and SMB, the latter being case-insensitive but case-preserving.

ralphbsz · May 5, 2023

yuripv79 said:
So how UTF-8 is going to "break" POSIX?

Using UTF-8 file names does not break Posix. But when the file system enforces that all filenames have to be valid UTF-8, that breaks Posix conformance. For example, today I can use nearly arbitrary binary data in a path (as long as I stay away from Nul and slash). Posix explicitly allows that. The moment I turn on "path must be valid UTF-8" in ZFS, there are certain binary combinations that should work, but ZFS will give me an error.

The other surprising thing is turning normalization on: I create a file with its name being non-normalized UTF-8, for example the combination "Overprinting acute accent" and "Capital letter A" (the names are from memory). Think of those as a sequence of a few binary bytes. In traditional systems, if I do a readdir() a moment after doing the creat(), I expect to find a file that has exactly the same binary bytes as the file name. If I turn the "Unicode normalization option" on, then I will find a file with a different name, which is "Capital letter A with acute accent". While both strings render the same, and are indeed the same character under Unicode normalization rules, the binary interpretation of the data is different. Note that Posix does not have a strict rule that explicitly prohibits the file system silently renaming files under the guise of normalization, but it does violate expectations.

The contradiction here is the following: Today's Posix is simply silent on whether the path name is binary data or printable character strings. If the user chooses to treat it as binary data, but the file system (inside the kernel) treats it as normalizable and checkable Unicode strings, then we get a mismatch.

To be clear: In my opinion, all strings should be interpreted as Unicode and encoded in UTF-8. And the fact that file names can be arbitrary binary gibberish has always been a bad idea; I would love to have that restricted to be only unambiguous and displayable characters (so no control-H in file names). But given the tradition that Posix stands on, that was not achievable in the standard, since there have always been people who used binary non-displayable data in file names. So both normalization and Unicode-enforcement in ZFS are good ideas, but they are also somewhat incompatible and can be troubling.

yuripv79 · May 5, 2023

ralphbsz said:
Using UTF-8 file names does not break Posix. But when the file system enforces that all filenames have to be valid UTF-8, that breaks Posix conformance. For example, today I can use nearly arbitrary binary data in a path (as long as I stay away from Nul and slash). Posix explicitly allows that. The moment I turn on "path must be valid UTF-8" in ZFS, there are certain binary combinations that should work, but ZFS will give me an error.

The opengroup text I cited does not say anything about allowed characters, only the ones that can not be used; if conformance tests *require* 0-255 range (except for NUL and slash) to be accepted in filename, it contradicts the definition.

ralphbsz said:
it does violate expectations

It shouldn't as you are required to understand the implications of turning the UTF-8 support on.

And yes, everything else is our (mis)understanding of what POSIX tries to define (and what it doesn't). If someone wants/needs to use binary data in filenames, they can use whatever FS provides it for them; there are always cases for not using UTF-8, but it's not an issue with UTF-8 itself, which was the original question, and which this discussion has moved really far away from, and not an issue with codebase/libs/syscalls/...

Why use localization? Why not just use UTF-8 all the time?

3.170 Filename​

3.170 Filename