In theory, the only rules that the operating system enforces are: the nul characters (the character whose ASCII code is zero) may not show up in a file name, the "/" is a directory separator and therefore can't be in a file name; all other 254 character codes may be used. And there is a maximum length, which is so long in today's systems that it no longer matters. In practice, one needs sensible rules to prevent going insane.
Here is a set of rules I consider reasonable: No spaces in file names, as already said; they make using the command line harder, and wreak havoc on badly written scripts. Well written scripts can handle anything, but writing scripts correctly is surprisingly hard.
Much more important: Never special non-printing characters (like newline), anything with integer codes less than 32 and 127, below space and above twiddle; we'll get to 128 and above later. And be super careful with special characters, like "-*&%?#!<>". To begin with, never begin a file name with "-", since it will get confused for an option. Famous example: Have two files named "-rf" and "*" in your directory, and most rm
command will delete everything, unless you are super careful. Also avoid file names that look like windows devices (don't call a file "prt:"). If you want to have fun, create a file whose name is "~user" for a valid user name, and see what shell autocompletion does. (More fun in a footnote below.)
Personally, I like using extensions that clearly indicate a file name: Straight text files should be called ".txt", and PDF documents ".pdf", and source code ".C". While that is not necessary, it makes life easier, since one knows right away how to use a particular file. And one doesn't have to be restricted to a few extensions, one can also make up new ones. For example, I have several files called ".todo", which are my to-do lists, even though that are actually straight text files.
And now a painful topic: Which character set to use. If you stick to 7-bit ASCII, life will be easy. Anything else is problematic and will cause trouble unless you follow strict rules. The problem is: The file system (in the kernel) doesn't store strings with a known encoding (character set and locale), it stores an array of characters. The underlying problem is that the kernel does not know what locale and encoding a user process is using, and therefore can not convert the strings to the correct encoding when returning them. If one user creates a file (puts the file name as a string into the kernel) in utf-8 encoding, and another user looks at the directory content (gets a file name as a string from the kernel), but is running is iso8859-1, then the second user will display nonsense. So here would be my recommended rules: Either disallow any file names that contain non-ASCII characters (no european or CJKV = asian characters). Or make sure absolutely everyone who uses that file system (including people who use it via NFS and CIFS=Samba) use *only* utf-8 encoding. I understand that this rule is not friendly for people outside english-speaking countries, but it does really prevent chaos and confusion. A bad alternative is: make sure all processes use the same locale rendering (for example iso8859-1 in western europe), but that alternative doesn't work well when running processes that happen to be set to utf-8. For example logging in from a terminal emulator that's on a Windows or Mac machine and uses the local rendering. Examples of chaos include: One process creates a file (for example named "a with forward accent"), and another user sees gibberish, perhaps multiple characters, perhaps something undisplayble. Even better: A process creates two files that to him are distinguishable (for example two "a" with different accents). Another process later sees two files that in his locale's rendering seem to have exactly the same name, and he has no idea how to separate them. He may not even be able to enter their names from the keyboard, so he ends up with a file he can't even delete without doing the dangerous "rm *".
Footnote about really brutal fun: If you are a file system implementor, you can actually allow file names and directory names that contain the "/" character in the kernel. I did that once by mistake (when implementing the windows to Unix character set conversion, we by mistake created "/" in file names). It's surprising how much stuff actually works correctly: In the output of ls
, you see a single entry whose name is "a/b". If that entry is a directory, you can create "a/b/c", which is file "c" in directory "a/b". It is also surprising how spectacularly things break. Obviously, the shells are toast when it comes to globbing and autocompletion. What surprised me is how badly they blow up; core dumps from the shell mean that some programmer was sloppy. What is less obvious: The standard C library also blows up; it turns out functions like "open" like to parse file names, and the library is also written very sloppy.