Why cpdup skips files?

Why sysutils/cpdup does not copy files, which contain symbol in its names? Here is an error: "path_to_the_file_with_№_symbol copy: open failed: No such file or directory".
 
The problem is clearly some string unicode conversion involving file names. The basic problem underneath all this is a deep design flaw in Unix. In the old days (30 and more years ago), all strings were intended to be displayed in a fixed and known character set, typically 7-bit US-ASCII. Unix implemented strings as arrays of 8-bit bytes, which worked well with the ASCII character set. Later, the definition of string had to be made more general, to allow more complex character sets, for i18n. Initially, this was done by using 8-bit character sets, and as long as all processes on a set of connected computers (a cluster) used the same character set (for example iso8859-1 in western europe), this worked fine, and the kernel and C-library routines didn't have to know what the sequence of 8-bit bytes actually meant. But this technique was quickly found to be insufficient, in two areas: first, where multiple 8-bit character sets have to coexist (for example a computer that is used simultaneously by users who operate in french using iso-8859-1 and in ukrainian using KOI8-R); and second, for CJKV languages, where 8 bits were insufficient. Slowly, over the last 30 years, this has led to most string data having been converted to Unicode, and usually stored in a UTF encoding (often UTF-8). Userspace applications slowly learned how to interpret stored data in various string formats, and where necessary how to convert them from one encoding to another. To help them do that, they end up using locale settings; in Unix, the default locale for a process comes from the LANG and LC... variables. In userspace, this either works fine, or if it doesn't it's a bug in an application.

The problem is that the kernel doesn't know the locale of a user process. There are very few places where text strings cross the userspace / kernel boundary, the main one being file names. If one process running in one encoding (for example iso8859-1) puts a text string into the kernel (by creating a file with that name), and another process running in a different encoding (for example unicode encoded as utf-8) retrieves that string (by doing a readdir() or opening the file), then it will get back the original 8-bit bytes; but if it interprets them wrongly as a utf-8 encoded string, the iso8859-1 characters will seem like gibberish. This can cause bugs in the software, as your example shows.

This design flaw remains in all Unix-derived operating systems. To my knowledge, there is only one such system that has solved this, and that's Mac OS when using Apple's HFS file system: they enforce that all file names be Unicode encoded as some form of UTF.

The real fix for this will have to be to read the source code of cpdup, or find the author or maintainer of that software, and fix the bug: when dealing with file names, one can not assume any specific encoding, and needs to just transport them as binary blobs; the only characters with clearly specified semantics in file names are nul (the zero character that ends a string), and '/' to separate directory names from file names.

Here is a suggestion for a hack, which might temporarily get around the problem: For all processes involved, clear (meaning undefined or unset) all environment variables that begin with LC or LANG. In sh derived shells (such as ksh or bash), that can be done with "unset L..."; I don't remember the syntax for csh-derived shells. Then set exactly one language variable, namely LANG=C. Do this not only for the local process that starts cpdup, but also on any remote machines that are involved via ssh; for example by putting it into the .profile or .cshrc startup file. It *might* solve the problem by preventing some string library from attempting conversion.

Good luck!
 
Thank you very much for your so detailed answer. I've tried to use your hack. But it still doesn't work.

Actually, I synchronize shared folder from windows server to my FreeBSD box. To do that, I've created .nsmbrc file with necessary settings. charsets=utf8:cp866 is one of them. Also I created script, which mounts remote folder to /mnt and then starts cpdup with some arguments. I've choosen cpdup because it gives me easy way to do it. Local folder (which contains backups) I shared using samba so computers from the network get read/write access to it. If I remove charsets=utf8:cp866 from .nsmbrc, then files with cyrillic names does not visible for window clients.

May be there is another way to copy changed files and folders from remote folder to local drive using FreeBSD? It's worth to note, that I need to copy files that have changed mtime attribute and new created files (if i'm not mistaken it is called one way sinchronization).
 
Depending on the very origin of the files in the shared folder, utf8 file name incompatibilities might be caused by the so called Normalization of Pre-Composed Characters. All Unix file systems uses the NFC (the composed form) while Mac OS X uses NFD (the decomposed form). So if some of the files were created on a Mac, this may explain the difficulties for some utilities properly identifying it.

net/rsync got a --iconv option which can be used to change the file name normalization in the course of copying, e.g. --iconv=utf-8-mac,utf-8, however, I never used this and I cannot tell whether this is smart enough to leave NFC files names alone, so the danger is, that you would only end up with a mirrored compatibility problem.

I solved a similar problem on my FreeBSD file server running converters/convmv. This utility may also rectify some older Samba file name encoding problems. I used the following 2 convmv(1) commands on the respective folder:

1. for starting a dry-run without changes:
convmv -r -f utf-8 --nfd -t utf8 --nfc /path/to/shared/folder

2. for eventually converting the NFD file names to NFC:
convmv -r -f utf-8 --nfd -t utf8 --nfc --notest /path/to/shared/folder

You could also try a cloning tool which is agnostic to file name normalization. I wrote sysutils/clone and this got an -s option for synchronizing the source and the target, i.e. only changes are updated in the target directory, s. clone(1).
 
If you're using ZFS, the normalization filesystem attribute may also be of help. But you can only set it when you create the filesystem.
 
I tried net/rsync and sysutils/clone as well. But, like cpdup(1), they does not copy files with the symbol in the names. I also created a new file system using various ZFS normalization flags, and tried to copy the files there. But this also did not help me.
 
I've solved the problem. The matter is that built in utility mount_smbfs should be supplied with correct .nsmbrc file in the user's home directory. If it does not, than ls command cannot display files with symbol "" in its names correctly. The same happens with cpdup and rsync commands - they cannot read file names that contain symbol in its names.

Solution:
First of all .nsmbrc should contain directive "charsets=cp1251:cp1251" (or "charsets=cp1251:cp866") even if your system locale other than cp1251. This one is enough to copy files from Windows server and store them with correct names in FreeBSD box. Secondly, if you want to share these files (which are stored in FreeBSD box) via samba server and browse them in Windows machine you should add directive "unix charset = cp1251" in the smb4.conf. The only problem that cannot be solved by this solution is displaying correctly these file names through console and ssh clients like putty and others.
 
Back
Top