UFS Characters allowed in UFS filenames

bugzeo

Member

Reaction score: 18
Messages: 95

What characters are allowed and which not in UFS partition for naming files? Can colon ":" be used in filenames?
 

ralphbsz

Son of Beastie

Reaction score: 2,299
Messages: 3,207

Finding it in the source code is really hard, because it is not where you expect it.

In practice, only two characters are illegal in file names: NUL (because the C-based code thinks that the string ends there), and slash (because that's a directory separator). So a file name "a/b" will be interpreted as file b in subdirectory a.

There is one minor complication, but I think in practice it makes no difference: The kernel doesn't know what string encoding is in use when it passes a file or directory name up and down to user space. So if the byte have the 8th bit set, the displayed result will depend on whether the userspace is running in iso-8859-1 or utf-8, but the file system itself does not care. So if a file name is the byte sequence "C4 B4", in 8859-1 it is two strange accented characters (I think A with two dots followed by an accent), while in UTF-8 it is a single even more strange accented character (J with a hat). But as far as the file system is concerned, those are legal.

But: Not everything that is legal is a good idea. If you want to have a peaceful, quiet and simple life, you should note use special characters space, "<|>;" and a few others that have special meaning to the shell, as well as control characters (the file name "a\nb" is evil, it makes the output from ls look funny). And if you don't want to have trouble from character set rendering, also all characters with the 8th bit set (a rule that's hard to enforce in locales such as French or German or Chinese that really need local character sets). But those are not illegal, they're just inconvenient if you don't follow the rules carefully.

Now, you asked: where in the source code is "/" illegal. It turns out that I tried this once: I once for fun created a few files that had slashes in file names (I was a file system developer, so I could do stuff like that). It is a really bad idea. The reason is not in the kernel (where most file system actually handle it pretty well), but in the C run time library: It has the habit of splitting path strings apart at slashes, into directory names and file names. And if you try to call functions to work on file "a/b", it will try to open file b in directory a, and fall flat on its face. So there are a few places in the kernel (don't know whether in the VFS or in the individual file systems in FreeBSD) where "/" is handled specially, and a few places in the user-space run time library. And it's not that the slash is illegal, it is simply used for a different function.
 

Trihexagonal

Son of Beastie

Reaction score: 2,308
Messages: 2,883

Crylicc characters for a file name sure to be something you remember. not to do again. They don't appear in video files downloaded from utube and changing it before backup good planning.

Those files are not recognized in downloading a Directory. 20+ files invisible to the eye you can't work with how you find that out.

I use _-_ in .mp3 filenames. Not using a space in naming files the first thing I learned about UNIX.

No keyboard character is denied a fair part in creating my passwords.
 

astyle

Aspiring Daemon

Reaction score: 363
Messages: 839

The forward slash is found in fully formed URLs, it's UNIX convention to use it to represent directories, and some UNIX utilities like sed(1)use it as an escape character.
 

ralphbsz

Son of Beastie

Reaction score: 2,299
Messages: 3,207

The forward slash is found in fully formed URLs,
Which are fundamentally nothing but filenames, with a protocol and host identifier prepended. In the URL, slashes are de-facto directory separators.

it's UNIX convention to use it to represent directories,
That's much more than a convention, it's fundamental to the way Unix file systems work. One of the radical (and revolutionary) ideas that first went into mass production in Unix was that every resource in the system looks like a file path name, and all can be found in the file system. No more "PRN:", it is /dev/lpt. No more SYS$DISK:[user.dir]file.ext, but /home/user/dir/file.ext. No more file paths that start with DUA0: versus DUA1:, the name of the disk is hidden and absorbed into the mount point. And in particular no more "SYSIN DD UNIT=...,VOL=...,DISP=(,,KEEP),RECFM=VBS,LRECL=133". The radical part was saying: everything is a file or a directory, and you can always found them by counting slashes, and they can be mounted anywhere.

and some UNIX utilities like sed(1)use it as an escape character.
I think you mean backslash, which is used for escaping things in many places.
 
OP
B

bugzeo

Member

Reaction score: 18
Messages: 95

Since FreeBSD 5.x we have UFS2 as default ufs version.

This Wikipedia comparision of file systems says "Any byte except NUL" is allowed.

So I wonder about "/". :) It should be possible to look it up in the source code.
I don't know but didn't work for me:
Code:
freebsd% touch '//'
touch: //: Permission denied

My root and home filesystems are same partition, no idea if UFS or UFS2:
Code:
mount
/dev/gpt/rootfs on / (ufs, local, soft-updates)
 
OP
B

bugzeo

Member

Reaction score: 18
Messages: 95

Crylicc characters for a file name sure to be something you remember. not to do again. They don't appear in video files downloaded from utube and changing it before backup good planning.

Those files are not recognized in downloading a Directory. 20+ files invisible to the eye you can't work with how you find that out.

I use _-_ in .mp3 filenames. Not using a space in naming files the first thing I learned about UNIX.

No keyboard character is denied a fair part in creating my passwords.
Yeah ideally you would only use ASCII non-control letters, exactly {{a..z},{A..Z}}. That would be ideal if everybody spoke english. But if you sells things in another country (say iPhones, Microsoft Windows, Play Station, etc.) you must offer those products in local languages, and that's by law is not optional. most people won't accept thing purchased as new that are not translated. Hell even happens with movies, books and videogames. So anyway a modern operating system should run in UTF8 everything already. Create any file in any language and be saved properly on filesystem. If FreeBSD can't do it by default, it's time to change it.
 

Vull

Aspiring Daemon

Reaction score: 364
Messages: 637

Code:
len@mate:~ $ mkdir work
len@mate:~ $ cd work
len@mate:~/work $ touch ¢ह€한𐍈
len@mate:~/work $ ls
¢ह€한𐍈
len@mate:~/work $ rm ¢ह€한𐍈
len@mate:~/work $ ls
len@mate:~/work $
len@mate:~/work $ freebsd-version
13.0-RELEASE-p3
 
OP
B

bugzeo

Member

Reaction score: 18
Messages: 95

Code:
len@mate:~ $ mkdir work
len@mate:~ $ cd work
len@mate:~/work $ touch ¢ह€한𐍈
len@mate:~/work $ ls
¢ह€한𐍈
len@mate:~/work $ rm ¢ह€한𐍈
len@mate:~/work $ ls
len@mate:~/work $
len@mate:~/work $ freebsd-version
13.0-RELEASE-p3
For me doesn't work. Using prebuild Virtualbox machine from FreeBSD-13.0-RELEASE-amd64.vhd.xz

X36URAc.png
 

Vull

Aspiring Daemon

Reaction score: 364
Messages: 637

I don't see the characters properly, there are squares with bits inside. didn't you notice?
It's because your fonts don't display all of the utf-8 character set. The squares with bits inside are how the system displays unsupported utf-8 characters. The filesystem still contains the correct utf-8 byte sequences.
 
OP
B

bugzeo

Member

Reaction score: 18
Messages: 95

It's because your fonts don't display all of the utf-8 character set. The squares with bits inside are how the system display unsupported utf-8 characters. The filesystem still contains the correct utf-8 byte sequences.
Then the graphical software lacks UTF-8. I thought that came with all fonts already.
 

ralphbsz

Son of Beastie

Reaction score: 2,299
Messages: 3,207

I don't know but didn't work for me:
Code:
freebsd% touch '//'
touch: //: Permission denied
Carefully read the error message. It didn't say "What you are asking for is impossible", what it said it: "you are not allowed to do this". That's because the "/" directory already exists, and is owned by root, and you are not allowed to modify it. If you do a "ls -ld" command on it, you'll see:
Code:
> ls -ld /
drwxr-xr-x  23 root  wheel  1024 Jul 12 11:04 //

But note that you can do this: "mkdir -p a/b/c", and "mkdir -p d//e///f////"
 

Vull

Aspiring Daemon

Reaction score: 364
Messages: 637

Then the graphical software lacks UTF-8. I thought that came with all fonts already.
No. I also see the little boxes for some of the characters. If you look at them with enough magnification you can see the utf-8 code sequences in the little boxes. I'm guessing that you're probably using a Windows-hosted or Linux-hosted browser with a more complete font set to look at the forum webpage, but your terminal window is running on a FreeBSD virtual host, with a different set of fonts.

I don't know offhand what fonts you need to install in your FreeBSD virtual machine to display these UTF-8 fonts but I'm sure they're available. I don't need all those UTF-8 characters for what I do with this machine, so I've never taken time to try to install them.

When you install FreeBSD, you intentionally get a very stripped-down, minimal operating system. That's considered a feature, not a shortcoming. You only need to install the features you need. When some of our forum members refer to "bloatware," they're making the point that most operating systems install a lot of features you don't necessarily need by default.

As an English-speaker who is not trying to provide internationalized software, I don't need to display Chinese characters, and so I haven't installed all the additional fonts or software needed to support display them. Nevertheless, I do write PHP scripts and HTML pages which do have multi-byte character support, so I've had to learn to at least understand them and how they work.
 
OP
B

bugzeo

Member

Reaction score: 18
Messages: 95

But note that you can do this: "mkdir -p a/b/c", and "mkdir -p d//e///f////"
Yeah but none of the two cases create a file or folder called '/', double or more / are collapsed into one:
Code:
$ mkdir -p d//e///f////
$ find .
.
./d
./d/e
./d/e/f
 

ralphbsz

Son of Beastie

Reaction score: 2,299
Messages: 3,207

Then the graphical software lacks UTF-8. I thought that came with all fonts already.
Unicode is a moving target, as it adds new characters with some regularity. If you want quick updates to the most up-to-date fonts with glyphs, you're better off with systems that have automatic quick updates and a large staff to do updates, such as MacOS, Windows or ChromeOS.
 

Vull

Aspiring Daemon

Reaction score: 364
Messages: 637

Unicode has over a million characters according to this article: https://en.wikipedia.org/wiki/UTF-8

That's a lot of characters to try to support. It's a big job. This site helps you see whatever characters you might be missing: https://unicode-table.com/en/#basic-latin

It seems pretty complete but I don't know. Here's a list of articles on the subject (it's a pretty big subject):


If you can get this all figured out please let us know. It's a pretty tall order.
 

ralphbsz

Son of Beastie

Reaction score: 2,299
Messages: 3,207

Yeah but none of the two cases create a file or folder called '/', double or more / are collapsed into one:
OK, time to slow down. In Unix, things have names. Things that are stored in file systems can be files, directories, and a few other things (soft-links, devices, FIFOs, and other crazy tuff). Everything that can be found in a file system has a string that gives the complete way to find it; that's called the path. Typical paths may be /home/ralphbsz/calendar/next_tuesday/dentist.cal, or /usr/local/bin/bash, or /dev/ttyU0. The fact that these things exist immediately implies that directories with paths /, /home, /home/ralphbsz and /home/ralphbsz/calendar and /home/ralphbsz/calendar/next_tuesday also exist. By construction, these paths have to be unique: At any given time, there is only one thing at each path, so the mapping from path to objet is unique. The opposite is not true: The same object may be visible at two path names (due to soft- and hard-links).

Every object in the file system also has a name, which is the last component of the path name. That name is called the "filename" of the object. So for example, my dentist appointment is in a file that has filename dentist.cal, which in turn is in a directory with filename next_tuesday, and so on. Filenames are completely not unique system wide: I bet my wife and son also have a file named dentist.cal somewhere in their directory hierarchy, but unique within a directory (there are some VERY bizarre exceptions with non-unique file names that involve Unicode character translation).

So, the complete path of an object is nothing but the concatenation of all the filenames of the directories the object is in, plus filename of the object itself, and the concatenated objects are separated by "/". Conversely, the filenames are nothing but the components of the path, if you split it at "/" characters. Implicit in the path is the root directory, which doesn't really have a name (it is zero length), and that can always (and only!) be found at the path "/". This would be a good place to talk about the difference between absolute and relative paths, but I'm too lazy.

There can never be the "/" character in any filename, by construction: that character is the separated between filenames. As I said above, if you try to create "a/b", that's not a filename "ah slash bee", but the filename "a" in the directory "b". By convention, it's legal to add too many slash characters: the two paths "a/b" and "a////b" mean exactly the same thing. This is to make it easier to construct path strings by concatenating substrings: You can for example take "/home/ralphbsz/" and "/calendar/" and "/next_tuesday/dentist.cal" and just shove them together, and it will work.

Anecdote: As I hinted at above, I once (with a colleague) for fun created a file system where files DID contain slashes in their name, and where the creat() system call was capable of creating a file named "a/b". It caused lots of very funny crashes, and misunderstandings. The other fun thing to do is to implement file systems that do Unicode translation. Say for example you have a single directory containing three files, called "à", "á" and "ä" (lowercase a with backward accent, same with forward accent, same with two dots), but you're rendering it in an environment where users can't display anything other than US-ASCII. When you do an "ls", the user will see three files, all just called "a". That will make the user's head hurt. When they try to open the file called "a", should they get one of the three, or should we refuse because it doesn't exist? Either way, the user will be angry. And if they try to create file "a", should we refuse to do that (it's unsafe because ls showed three files called "a" already), or should we let them do it? Much fun, but fortunately, this problem doesn't exist in most real-world situations.
 
Top