Unicode characters in file names

tankist02 · Feb 17, 2014

I followed this guide: https://cooltrainer.org/2012/01/02/a-freebsd-9-desktop-how-to/ to configure Unicode on my system. The guide is for FreeBSD 9, but I have 10-RELEASE, I don't know if this is relevant or not.

Now smplayer can't play files with Russian letters in the name. The same for Audacious. vlc doesn't have such problem. Is it related to recent iconv changes in 10? Do I need to rebuild gtk with special options?

Probably unrelated: console (mate-terminal) doesn't display Russian letters - it shows question marks instead. rsync shows Unicode codes, e.g. \#320\#235\#320\#265\#320\#220\#320\#275\#320\#263\#320\#265\#320\#273\#321\#213 FT. A-DESSA

ralphbsz · Feb 17, 2014

No idea about application-specific problems with smplayer, audacious, and such.

But the fundamental problem is much bigger. Putting characters into file names that are "ambiguous" is just a really bad idea. By "ambiguous", I mean: characters whose interpretation (for example, how they are to be rendered into pixels in a human-visible screen) depends on the environment.

Here's why. Imagine you have a single file system. As we all know, file names are defined as arrays of 8-bit bytes, and only the bytes that correspond to the characters nul and "/" are invalid (nul because it ends the string, and "/" because it tells lots of library and kernels functions where the boundaries between directory names are). As long as you only put displayable 7-bit ASCII characters (from 0x20 = space to 0x7E = ~) into file names, we know that nearly every shell, nearly every terminal emulator, and nearly every application will be able to display them, and the result will always be the same.

Obviously, it is a very bad idea to create file names that are nearly impossible to display. For example, it is technically legal to create a file whose name is space and tab (two characters long), but you will probably have a really hard time seeing that file in the output of ls (since its name is transparent, whitespace, or invisible), and operating on that file (be it from a shell, be in from a graphical tool like a X-windows file manager) will require trickery. Even more fun can be had with a file whose name is a single Control-C (character 0x03), which is awfully hard to enter.

Now lets switch to wider character sets. Say you configure your "computer" (for example your xterm and your shell) to use the ISO-8859-1 character set, and create a file whose name is two characters long, namely "a with accent" followed by "a". No problem, you can do ls and see the file name, you can use it on the command line, you can edit the content of the file with emacs or vi because you are capable of specifying the name on a command line and recognize it. Now you reconfigure your computer to use a different encoding (for example UTF-8), and: the name of the file is completely different! Matter-of-fact, you can have two windows open, with xterm configured for different encodings and the LOCALE environment variable set differently, and one window can create a file, and in the other window the file has a completely different name. Even worse: in a different environment, the file name may become "undisplayable" (for example, it could end up having whitespace or control characters in it).

If you really want to put unicode characters in file names, I only know of one sane solution: make 110% sure that ALL your windows, applications, shells, LOCALEs and such use UTF-8. You have to be religious about consistency, otherwise hell will break loose.

Now imagine what happens if you take that file system with UTF-8 encoded unicode file names, and share it with zillions of other users or computers (be it NFS, CIFS, or a cluster file system, or your HTTP server using local file names as URLs), and you can see why maintaining such consistency is very hard.

The real underlying issue is this. The file system (in the kernel) is given an array of bytes, and it is not given any information about how to interpret them as a displayable string (other than the Nul and "/" characters). All it can do (and, by orders of POSIX, all it is allowed to do) is to faithfully transport the binary values of the bytes in the string, without interpreting or modifying them. This means that the task of interpreting them (rendering them to humans, dealing with how humans enter or select them) falls to applications. Some applications are commonly used (for example the shells and the ls command), and developed with great care. Others less so. If applications think that file names are normal "strings", which follow the rules of other translatable strings that are locale-dependent, they are making a big mistake, because the raw binary version will come back out of the file system in a different locale, to haunt you.

tankist02 · Feb 17, 2014

Thanks for the detailed explanation. The problem is it is not me who named the files. I'll create a script to automatically "translate" Russian letters in English equivalents and rename the files.

Unicode characters in file names

tankist02

ralphbsz

tankist02