UFS Characters allowed in UFS filenames

astyle · Aug 1, 2021

bugzeo said:
Then the graphical software lacks UTF-8. I thought that came with all fonts already.

If you compile from ports, a truckload of UTF-8 support get pulled in as a dependency. Heck, even # pkg install will pull in something.

However, there's a difference between keyboard layouts and character encodings. From an example that actually gave me trouble when I was in college - I still remember the days of 7-bit KOIx-R encoding. Reading Cyrillic characters in an email was an unreliable proposition at best. Unicode did not support Cyrillic that well, ASCII was too limited, and so was UTF-7. I had just one keyboard layout (US) to work from, but about 30 different character encodings to play with. And it was a guessing game of which one will actually display the characters properly on the other end. Worst part was that even if I had a Russian keyboard at my disposal at the time, and had the proper keyboard layout loaded to match the keys, I still had to play the guessing game with character encodings.

Too tired for research and linking tonight but I hope I made some sense.

gpw928 · Aug 1, 2021

ralphbsz said:
Implicit in the path is the root directory, which doesn't really have a name (it is zero length), and that can always (and only!) be found at the path "/".

In their original paper, Ritchie and Thompson said "As another limiting case, the null file name refers to the current directory."
That included the root.
I'm not sure how many Unix variants have preserved the original intent.
I do recall Dennis Ritchie posting to one of the Usenet groups white hot angry that the AT&T Unix Support Group (USG) had made code changes in pathname parsing that returned ENOENT for a null file name (instead of the inode number of the current working directory).

Vull · Aug 1, 2021

astyle said:
If you compile from ports, a truckload of UTF-8 support get pulled in as a dependency. Heck, even # pkg install will pull in something.

However, there's a difference between keyboard layouts and character encodings. From an example that actually gave me trouble when I was in college - I still remember the days of 7-bit KOIx-R encoding. Reading Cyrillic characters in an email was an unreliable proposition at best. Unicode did not support Cyrillic that well, ASCII was too limited, and so was UTF-7. I had just one keyboard layout (US) to work from, but about 30 different character encodings to play with. And it was a guessing game of which one will actually display the characters properly on the other end. Worst part was that even if I had a Russian keyboard at my disposal at the time, and had the proper keyboard layout loaded to match the keys, I still had to play the guessing game with character encodings.

Too tired for research and linking tonight but I hope I made some sense.

It's a monster of a problem all around. Here's another wrinkle: In HTML/PHP world, you need to use UTF-8, but in ECMAScript/javascript land, only UTF-16 encodings are usable. Two completely different encodings for Unicode, and as ralphbsz pointed out, new Unicode character encodings are being added continually on a rapid-fire basis.

bugzeo · Aug 1, 2021

ralphbsz said:
Anecdote: As I hinted at above, I once (with a colleague) for fun created a file system where files DID contain slashes in their name, and where the creat() system call was capable of creating a file named "a/b". It caused lots of very funny crashes, and misunderstandings. The other fun thing to do is to implement file systems that do Unicode translation. Say for example you have a single directory containing three files, called "à", "á" and "ä" (lowercase a with backward accent, same with forward accent, same with two dots), but you're rendering it in an environment where users can't display anything other than US-ASCII. When you do an "ls", the user will see three files, all just called "a". That will make the user's head hurt. When they try to open the file called "a", should they get one of the three, or should we refuse because it doesn't exist? Either way, the user will be angry. And if they try to create file "a", should we refuse to do that (it's unsafe because ls showed three files called "a" already), or should we let them do it? Much fun, but fortunately, this problem doesn't exist in most real-world situations.

I can create files with these letters no problem and display them. The problem is if I don't have them on my keyboard directly. Maybe I could try to view hex of the name to see what's going on:

Code:

$ touch à á ä
$ ls
à    á    ä
$ find . -maxdepth 1 -type f -exec sh -c 'printf "%-10s %s\n" "$1" "$(printf "$1" | xxd -pu )"' None {} \;               
./à       2e2fc3a0
./á       2e2fc3a1
./ä       2e2fc3a4

bugzeo · Aug 1, 2021

Vull said:
It's a monster of a problem all around. Here's another wrinkle: In HTML/PHP world, you need to use UTF-8, but in ECMAScript/javascript land, only UTF-16 encodings are usable. Two completely different encodings for Unicode, and as ralphbsz pointed out, new Unicode character encodings are being added continually on a rapid-fire basis.

Yeah of course emojis are added at countless numbers each year. If they keep that pace they will fill all UTF-8 with emojis and we will need a wider character encoding. Kind of what happened with IPV4 and IPV6 was born.
The newest ones aren't even displayed on my Google Chrome Canary I see them because of the png representation: ? New Emojis in 2021-2022

ralphbsz · Aug 1, 2021

gpw928 said:
In their original paper, Ritchie and Thompson said "As another limiting case, the null file name refers to the current directory."
That included the root.
I'm not sure how many Unix variants have preserved the original intent.
I do recall Dennis Ritchie posting to one of the Usenet groups white hot angry that the AT&T Unix Support Group (USG) had made code changes in pathname parsing that returned ENOENT for a null file name (instead of the inode number of the current working directory).

Fascinating, didn't know that. Sounds like a funny but nasty hack.

astyle · Aug 1, 2021

bugzeo said:
I can create files with these letters no problem and display them. The problem is if I don't have them on my keyboard directly. Maybe I could try to view hex of the name to see what's going on:

Code:

$ touch à á ä $ ls à á ä $ find . -maxdepth 1 -type f -exec sh -c 'printf "%-10s %s\n" "$1" "$(printf "$1" | xxd -pu )"' None {} \; ./à 2e2fc3a0 ./á 2e2fc3a1 ./ä 2e2fc3a4

Isn't that called the 'Compose' key?

But in all honesty, we do seem to have the tendency to get off-topic in these forums... the original question was about limitations of UFS... and now we all took off on a tangent about character encodings simply because it is a step in displaying the filename on the screen. Sure, you want that done right, in compliance with international standards, and consistently.

Support for Asian fonts is a piece of work, though - there's a truckload of extra stuff to download and compile. But if an OS is to claim support for any given language - somebody does need to put in the effort to line it all up.

bugzeo · Aug 1, 2021

astyle said:
Isn't that called the 'Compose' key?

But in all honesty, we do seem to have the tendency to get off-topic in these forums... the original question was about limitations of UFS... and now we all took off on a tangent about character encodings simply because it is a step in displaying the filename on the screen. Sure, you want that done right, in compliance with international standards, and consistently.

Support for Asian fonts is a piece of work, though - there's a truckload of extra stuff to download and compile. But if an OS is to claim support for any given language - somebody does need to put in the effort to line it all up.

Yes it's the right Alt or Alt Gr depending on layout. But doesn't come configured by default in most cases. In case you don't have the keys on your keyboard you could call the symbol from terminal:

Code:

$ printf '\303\275'
ý
$ printf '\u00fb'
û
$  printf '\x4a'
J

astyle · Aug 1, 2021

bugzeo said:
Yeah of course emojis are added at countless numbers each year. If they keep that pace they will fill all UTF-8 with emojis and we will need a wider character encoding. Kind of what happened with IPV4 and IPV6 was born.
The newest ones aren't even displayed on my Google Chrome Canary I see them because of the png representation: ? New Emojis in 2021-2022

I have a hard time buying the idea that emojis are the culprit in UTF-8 getting exhausted... emojis are something that can be easily expressed in ASCII or just about any of the older char encodings. Emojis being translated into a graphic is an example of interpreter translation. A

will look different depending on which web page you see it on.

Asian characters (Kanji, hiragana, katakana, etc.) are numerous. Most native speakers are only familiar with a subset of them. I vaguely recall reading in Firefox's Pocket articles that an average Japanese speaker is familiar with a few hundred characters, and that it would take a PhD equivalent to know more than a thousand. That doesn't fit on any keyboard on the market. You kind of have to agree on a usable subset for a keyboard, and then create keymaps and encodings from that.

memreflect · Aug 1, 2021

bugzeo said:
Yeah of course emojis are added at countless numbers each year. If they keep that pace they will fill all UTF-8 with emojis and we will need a wider character encoding.

Unicode is currently still limited to code point U+10FFFF, which requires 4 bytes in UTF-8 (F4 8F BF BF). As you can see in this table, the UTF-8 encoding algorithm actually supports up to 6 encoded bytes with the final possible code point being U+7FFFFFFF (FD BF BF BF BF BF). That means UTF-8 can encode up to 2147483648 code points.* UTF-8 has plenty of room to breathe, especially since some code points aren't even assigned.

If anything is going to break, it will be the UTF-16 encodings because UTF-16 "surrogate pairs" only support up to U+10FFFF (DBFF DFFF in UTF-16BE). Considering Windows, NTFS, HFS+, and so much more use UTF-16 internally, this would be a serious problem for big names like Microsoft and Apple, so it's likely that a solution would be found before this ever actually occurred. Maybe Windows 2038 will finally switch to UTF-8 internally?

* I'll be thoroughly disgusted with humanity if we reach 2 billion/milliard code points in my lifetime ?

ralphbsz · Aug 1, 2021

bugzeo said:
I can create files with these letters no problem and display them.

That's because you are using a file system that is encoding agnostic: UFS and ZFS don't care whether the user process creating a file/directory or reading a directory is running in iso-8859-1, UTF-8, or any other encoding. To some extent that's good: No funny problems like I described. To some extent it is bad: Set one window to iso-8859-1, create a file. Set another window to UTF-8, and do an ls. You get what looks like the wrong file name ... it is just a display issue. For that reason, some industrial-strength file systems that are designed for multi-computer / multi-OS / multi-domain use are capable of transcoding file names. That sounds good, until you get into the problems of file names that can't be transcoded, or that become ambigious. Oops.

Vull · Aug 1, 2021

ralphbsz said:
That's because you are using a file system that is encoding agnostic: UFS and ZFS don't care whether the user process creating a file/directory or reading a directory is running in iso-8859-1, UTF-8, or any other encoding. To some extent that's good: No funny problems like I described. To some extent it is bad: Set one window to iso-8859-1, create a file. Set another window to UTF-8, and do an ls. You get what looks like the wrong file name ... it is just a display issue. For that reason, some industrial-strength file systems that are designed for multi-computer / multi-OS / multi-domain use are capable of transcoding file names. That sounds good, until you get into the problems of file names that can't be transcoded, or that become ambigious. Oops.

A good case for international standardization of utf-8 which I've used in every aspect of web deployment possible for over 12 years and counting, but this is not to say I like to create potential problems for myself by using non-ascii characters or spaces in my own filenames. I cannot however always dictate what types of filenames end-users and other parties choose to use when the file system will support their choices.

chungy · Aug 2, 2021

ralphbsz said:
UFS and ZFS don't care whether the user process creating a file/directory or reading a directory is running in iso-8859-1, UTF-8, or any other encoding. To some extent that's good: No funny problems like I described.

ZFS will care if you set utf8only=on (I almost always do this for every file system, unless I want to explicitly have a bag-of-bytes-filenames fs; I do so along with normalization=formD so certain things like two different é representations don't co-exist). By default, ZFS won't care, it'll act just like a traditional Unix file system (any byte in file names except NULL and /).

Highly recommend that people think about making file systems with such properties enabled, unless you have a really good reason to use a legacy non-UTF-8 character set. See zfsprops(8) for all the details.

ralphbsz · Aug 2, 2021

Those are both excellent ideas. One of the most terrible things about Unicode is that you can have two strings that have different binary encodings, but look the same, if normalization is not enforced. In file systems, this leads to having two files whose names look undistinguishable, which is terribly confusing.

astyle · Aug 2, 2021

Normal users expect to not have to dive into options and line them up so that filenames display properly, and are easily found. Even if it is an interesting topic, not everybody has the expertise or time to hunt down exactly why Chinese characters are not displaying correctly in Konsole, let alone figure out what to install/configure to fix the problem. This is why we have ISO standards trying to come up with flexible, but usable and sensible defaults that FOSS can implement. Beyond that, it's a dog eat dog world, and I personally see ZFS as the top dog in that fight.