tar and utf-8

I have a wordpress installation that includes some files in wp-content/uploads with greek characters.

If I copy the uploads directory using:


cp -fr uploads uploads_temp
rm -fr uploads
mv uploads_temp uploads


then the link to the file works and opens the file in browser.

But if I copy the files using:


tar -cf uploads.tar
rm -fr uploads
tar xf uploads.tar


then the link to the file shows a "404 Not Found"

Also if I use FTP with "wget -m" to copy the files then it works.

If I copy the file name from console and paste it in the browser then it loads fine.

locale shows:

Code:
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_ALL=

Any idea what is going on?
 
Also tried:


LC_ALL=en_US.UTF-8 tar -cf uploads.tar uploads
rm -fr uploads
LC_ALL=en_US.UTF-8 tar -xf uploads.tar


but still the same issue.
 
This works:


LC_ALL=C tar -cf uploads.tar uploads
rm -fr uploads
LC_ALL=C tar -xf uploads.tar


but I get messages like these:


Can't translate pathname 'uploads/2020/03/αφισα-για-οδοντιαατρικη-περίθαλψη-covid-19-768x1067.jpg' to UTF-8: Can't translate pathname 'uploads/2020/03/αφισα-για-οδοντιαατρικη-περίθαλψη-covid-19-216x300.jpg' to UTF-8:
 
What is strange is that I see the filename with correct greek characters in console.

Check these:


find . -name "*covid-19-216x300.jpg"
./uploads/2020/03/αφισα-για-οδοντιαατρικη-περίθαλψη-covid-19-216x300.jpg
./bad/uploads/2020/03/αφισα-για-οδοντιαατρικη-περίθαλψη-covid-19-216x300.jpg


If I copy the filename text from first line and paste it in find command:


find . -name "αφισα-για-οδοντιαατρικη-περίθαλψη-covid-19-216x300.jpg"
./uploads/2020/03/αφισα-για-οδοντιαατρικη-περίθαλψη-covid-19-216x300.jpg


If I copy the filename text from second line and paste it in find command:


find . -name "αφισα-για-οδοντιαατρικη-περίθαλψη-covid-19-216x300.jpg"
./bad/uploads/2020/03/αφισα-για-οδοντιαατρικη-περίθαλψη-covid-19-216x300.jpg


If I copy the filename text from this post and paste it in find command:


find . -name "αφισα-για-οδοντιαατρικη-περίθαλψη-covid-19-216x300.jpg"
./bad/uploads/2020/03/αφισα-για-οδοντιαατρικη-περίθαλψη-covid-19-216x300.jpg


So the "bad" directory has the correct UTF-8 filename and the original directory filename is not good UTF-8? But how it's possible both filenames are the same in the eye?
 
tar(1) does mention the LANG environment variable, and it seems to work for me without giving the translation error messages (running FreeBSD 12-STABLE).

Using this command I don't get translation error messages but after untar the files are not accessible from web-server:

LANG=en_US.UTF-8 tar -cf uploads.tar uploads

If I use this command then I get the translation messages but after untar the files are accessible:

LANG=C tar -cf uploads.tar uploads

What I don't understand is how it's possible in the eye the filenames to be exactly the same with both commands after untar.
 
I did another test:

1) I download a file with name "1692-25-8-2020-ΟΣ.-Οδηγίες-Covid-19-δεύτερο-κύμα-Ευπαθείς-ομάδες-ασθενών-κορων.pdf" using Safari in MacOS.

2) Then I upload the file using wordpress media manager.

3) Then I create a new post and made a link to this file.

4) Then I did:

tar -cf uploads.tar uploads
rm -fr uploads
tar -xf uploads.tar


5) Then I visit the post URL using Safari and click on the link and it works. But the other files links (uploaded in previous days from Windows) don't work.

Maybe the other files uploaded from Windows have no correct UTF-8 encoding? And during the tar -cf uploads.tar uploads command tar changes filename to UTF-8, and when I do the untar the file is correctly UTF-8 but the link expects it to be in different encoding so it doesn't work?
 
This is most probably a mismatch of de- and/or pre-composed UTF-8 characters in file names. Already some time ago, I wrote a BLog post about this and how to solve it:

If I run this command then 80 files are converted and then none of the links work.

This makes me believe that what I wrote above is correct:

"Maybe the other files uploaded from Windows have no correct UTF-8 encoding? And during the "tar -cf uploads.tar uploads" command tar changes filename to UTF-8, and when I do the untar the file is correctly UTF-8 but the link expects it to be in different encoding so it doesn't work?"

I also did another test by uploading using wordpress media manager another pdf with greek filename from Windows 7 and then I did the tar and untar and the link to file works. So it's difficult to reproduce the issue by uploading new files, but tomorrow I will speak with the user that upload the original files and I will try to reproduce the issue.
 
For future reference:

Some files uploaded in wordpress with greek filenames have "good" and some have "bad" encoding.

This command shows the files that are already in "good" condition:

find . -type f | perl -C -MUnicode::Normalize -n -e'print if $_ eq NFC($_)'

This command shows the files that after tar/untar will be in "bad" condition:

find . -type f | perl -C -MUnicode::Normalize -n -e'print if $_ ne NFC($_)'
 
The first thing I’d try is to find out what encoding the filenames actually are, for example:
Code:
echo uploads/2020/03/*-covid-19-768x1067.jpg | hd
That will give you a hexdump of the matching file names. With a little bit of experience you can see what encoding that ist. However, if your terminal and shell are set to UTF-8 and the file names display correctly, then they’re most probably encoded with UTF-8, unless your terminal or shell try to perform some “magic” automatic conversion.

Actually I’m surprised that tar(1) tries to encode/decode anything when LC_ALL is set to “C”. In this case, tar should not interpret the encoding at all, but just treat it as a sequence of bytes. One thing that might be worth trying is to run it with LC_ALL=en_US.ISO8859-7. I’d be interested how that behaves.

By the way, there are quite a lot of ways to make a copy of a local directory tree. For example you can use the cpdup(1) utility (from sysutils/cpdup). I think this is very handy; it’s usually among the first ports that I install on every new machine. The syntax is simple: cpdup source_dir target_dir

Another way is to use cpio(1) which is in FreeBSD’s base system. To make a copy of a directory tree, use the -p option and combine it with find(1) like this:
Code:
cd source_dir
find -d . -print0 | cpio -pdum0 ../target_dir

Note that cp -r should be avoided because it does not create an exact copy. For example, it breaks hardlinks. Better use one of the other tools mentioned above.

One final note: The fact that two characters are visually identical on the screen does not imply that they’re the same character. For example the characters “E” and “Ε” look exactly the same, but the first is code 0045 (latin capital letter E), and the second is code 0395 (greek capital letter Epsilon).
 
Actually I’m surprised that tar(1) tries to encode/decode anything when LC_ALL is set to “C”. In this case, tar should not interpret the encoding at all, but just treat it as a sequence of bytes. One thing that might be worth trying is to run it with LC_ALL=en_US.ISO8859-7. I’d be interested how that behaves.

When I set LC_ALL to "C" then tar treats it as a sequence of bytes (shows "Can't translate pathname" messages) and after the untar the file is accessible from web-server.

If I set it to "en_US.UTF-8" then tar changes the filename to NFC form.

The filename is:

Code:
αφισα-για-οδοντιαατρικη-περίθαλψη-covid-19-768x1067.jpg

Before tar/untar:

Code:
echo uploads/2020/03/*-covid-19-768x1067.jpg | hd

00000000  75 70 6c 6f 61 64 73 2f  32 30 32 30 2f 30 33 2f  |uploads/2020/03/|
00000010  ce b1 cf 86 ce b9 cf 83  ce b1 2d ce b3 ce b9 ce  |..........-.....|
00000020  b1 2d ce bf ce b4 ce bf  ce bd cf 84 ce b9 ce b1  |.-..............|
00000030  ce b1 cf 84 cf 81 ce b9  ce ba ce b7 2d cf 80 ce  |............-...|
00000040  b5 cf 81 ce b9 cc 81 ce  b8 ce b1 ce bb cf 88 ce  |................|
00000050  b7 2d 63 6f 76 69 64 2d  31 39 2d 37 36 38 78 31  |.-covid-19-768x1|
00000060  30 36 37 2e 6a 70 67 0a                           |067.jpg.|
00000068

After tar/untar:

Code:
echo uploads/2020/03/*-covid-19-768x1067.jpg | hd

00000000  75 70 6c 6f 61 64 73 2f  32 30 32 30 2f 30 33 2f  |uploads/2020/03/|
00000010  ce b1 cf 86 ce b9 cf 83  ce b1 2d ce b3 ce b9 ce  |..........-.....|
00000020  b1 2d ce bf ce b4 ce bf  ce bd cf 84 ce b9 ce b1  |.-..............|
00000030  ce b1 cf 84 cf 81 ce b9  ce ba ce b7 2d cf 80 ce  |............-...|
00000040  b5 cf 81 ce af ce b8 ce  b1 ce bb cf 88 ce b7 2d  |...............-|
00000050  63 6f 76 69 64 2d 31 39  2d 37 36 38 78 31 30 36  |covid-19-768x106|
00000060  37 2e 6a 70 67 0a                                 |7.jpg.|
00000066
 
When I set LC_ALL to "C" then tar treats it as a sequence of bytes (shows "Can't translate pathname" messages)
That is what I wonder about. With LC_ALL set to “C”, It should not try to translate anything. It should not matter if those bytes are UTF-8 or ISO8859-something or even a DOS codepage. It should take the bytes as-is without translation.

Thanks for the hex dump. There is indeed a small difference between “before” and “after” in the 5th line of each hex dump. I have highlighted it here:
Before tar/untar:
00000040 b5 cf 81ce b9 cc 81ce b8 ce b1 ce bb cf 88 ce |................|
After tar/untar:
00000040 b5 cf 81ce afce b8 ce b1 ce bb cf 88 ce b7 2d |...............-|

In the first case (“before”), there is a character followed by a so-called combining character that modifies the first character:
UTF8 ce b9 = Unicode 03B9 = “greek small letter iota”
UTF8 cc 81 = Unicode 0301 = “combining acute accent”
In other words, this is a greek small iota character, with an acute accent put on top.

In the second case (“after”), there is a single character that already has an accent:
UTF8 ca af = Unicode 03AF = “greek small letter iota with tonos”

So, obvisously, some code inside tar has converted that character with its accent. I’not sure if this is correct at all. I don’t know the Greek language, so I can’t tell if a “tonos” is even the same thing as an acute accent. It looks likely different on my screen with the font that I’m using (I have to look very closely, though). Anyway, it seems to make a real difference for the web server.
 
Nice recommendation for future reference.

I use UFS2 so it's not possible.
UFS is encoding-agnostic, so it doesn’t matter. UFS only interprets two characters specially: the slash “/” for separation of path components, and the NUL byte (0x00) for string termination. Everything else is just bytes, and UFS doesn’t care if it’s US-ASCII, UTF-8, ISO8859, CP437, KOI8-R or whatever.

PS: There are in fact encodings that don’t work with UFS, for example UTF-16. But then again, software wouldn’t be able to handle such file names anyway, because the POSIX standard functions don’t support it.
 
So, obvisously, some code inside tar has converted that character with its accent. I’not sure if this is correct at all. I don’t know the Greek language, so I can’t tell if a “tonos” is even the same thing as an acute accent. It looks likely different on my screen with the font that I’m using (I have to look very closely, though). Anyway, it seems to make a real difference for the web server.

Yes "tonos" is the same as "acute accent".
 
zpool create -O utf8only=on ... should avoid these kind of issues?

Wouldn't help as both names are perfectly correct UTF-8 strings. Turns out that /usr/bin/tar normalizes unicode glyphs when packing but only if the locale is so and so. Point to remember.

Code:
root@freebsd12:/tank# zfs get utf8only /tank
NAME  PROPERTY  VALUE     SOURCE
tank  utf8only  on        -
root@freebsd12:/tank# eval `printf '> \xcf\x81\xce\xb9\xcc\x81\xce\xb8\xce\xb1\xce\xbb\xcf\x88'`
root@freebsd12:/tank# eval `printf '> \xcf\x81\xce\xaf\xce\xb8\xce\xb1\xce\xbb\xcf\x88'`
root@freebsd12:/tank# ls -Al|cat
total 1
-rw-r--r--  1 root  wheel  0 Sep 10 08:08 ρίθαλψ
-rw-r--r--  1 root  wheel  0 Sep 10 08:06 ρίθαλψ

Please note that I had to use |cat to display the files correctly. Otherwise there is a bunch of question marks:

Code:
root@freebsd12:/tank# ls -al
total 6
drwxr-xr-x   2 root  wheel    4 Sep 10 08:08 .
drwxr-xr-x  21 root  wheel  512 Sep 10 08:04 ..
-rw-r--r--   1 root  wheel    0 Sep 10 08:08 ????????????
-rw-r--r--   1 root  wheel    0 Sep 10 08:06 ??????????????

After some locale tweaking:

Code:
root@freebsd12:/tank# LANG=en_US.UTF-8 ls -al
total 6
-rw-r--r--   1 root  wheel    0 Sep 10 08:08 ρίθαλψ
-rw-r--r--   1 root  wheel    0 Sep 10 08:06 ρίθαλψ
drwxr-xr-x   2 root  wheel    4 Sep 10 08:08 .
drwxr-xr-x  21 root  wheel  512 Sep 10 08:04 ..

But now the dots follow the letters.

I mean... WHY???
 
Wouldn't help as both names are perfectly correct UTF-8 strings. Turns out that /usr/bin/tar normalizes unicode glyphs when packing but only if the locale is so and so. Point to remember.

Code:
root@freebsd12:/tank# zfs get utf8only /tank
NAME  PROPERTY  VALUE     SOURCE
tank  utf8only  on        -
root@freebsd12:/tank# eval `printf '> \xcf\x81\xce\xb9\xcc\x81\xce\xb8\xce\xb1\xce\xbb\xcf\x88'`
root@freebsd12:/tank# eval `printf '> \xcf\x81\xce\xaf\xce\xb8\xce\xb1\xce\xbb\xcf\x88'`
root@freebsd12:/tank# ls -Al|cat
total 1
-rw-r--r--  1 root  wheel  0 Sep 10 08:08 ρίθαλψ
-rw-r--r--  1 root  wheel  0 Sep 10 08:06 ρίθαλψ

Please note that I had to use |cat to display the files correctly. Otherwise there is a bunch of question marks:

Code:
root@freebsd12:/tank# ls -al
total 6
drwxr-xr-x   2 root  wheel    4 Sep 10 08:08 .
drwxr-xr-x  21 root  wheel  512 Sep 10 08:04 ..
-rw-r--r--   1 root  wheel    0 Sep 10 08:08 ????????????
-rw-r--r--   1 root  wheel    0 Sep 10 08:06 ??????????????

After some locale tweaking:

Code:
root@freebsd12:/tank# LANG=en_US.UTF-8 ls -al
total 6
-rw-r--r--   1 root  wheel    0 Sep 10 08:08 ρίθαλψ
-rw-r--r--   1 root  wheel    0 Sep 10 08:06 ρίθαλψ
drwxr-xr-x   2 root  wheel    4 Sep 10 08:08 .
drwxr-xr-x  21 root  wheel  512 Sep 10 08:04 ..

But now the dots follow the letters.

I mean... WHY???
There's no rule that says the current and previous directory have to be displayed first. Consistency issues? Then you might have an argument, perhaps.
ls is agnostic to wide byte characters, it simply leaves it to the terminal emulator to display them correctly.
The order is, I believe, defaulting to modification time should you not specify a sort order. (That's just a hazy recollection).
 
ls is agnostic to wide byte characters
Not really (as it probably shouldn't), I've just demonstrated that with the examples above.

Anyway continuing this boring investigation – what's going to happen if we put those two files into the tar-slator? A tar archive with two files with the same name:

Code:
root@freebsd12:/tank# echo $LANG
C.UTF-8
root@freebsd12:/tank# tar -cf- * | tar -vtf- | LANG=C cat -v
-rw-r--r--  0 root   wheel       0 Sep 10 08:08 M-OM-^AM-NM-/M-NM-8M-NM-1M-NM-;M-OM-^H
-rw-r--r--  0 root   wheel       0 Sep 10 08:06 M-OM-^AM-NM-/M-NM-8M-NM-1M-NM-;M-OM-^H

(note that I had to reset LANG back to C to make cat -v work, otherwise it just kept printing greek letters).

Code:
root@freebsd12:/tank# cat *
File 1
...and file 2.
root@freebsd12:/tank# mkdir bucket
root@freebsd12:/tank# tar -cf- ρ* | tar -xf- -C bucket
root@freebsd12:/tank# cat bucket/*
...and file 2.

And we have a data loss incident.

If you're surprised why this is even an issue you may have some additional reading on how unicode is messed up (starting from ...There are a million broken assumptions). For example this:
Code that assumes every lowercase code point has a distinct uppercase one, or vice versa, is broken. For example, "ª" is a lowercase letter with no uppercase; whereas both "ᵃ" and "ᴬ" are letters, but they are not lowercase letters; however, they are both lowercase code points without corresponding uppercase versions. Got that? They are not \p{Lowercase_Letter}, despite being both \p{Letter} and \p{Lowercase}.
 
After some locale tweaking:
Code:
root@freebsd12:/tank# LANG=en_US.UTF-8 ls -al
total 6
-rw-r--r--   1 root  wheel    0 Sep 10 08:08 ρίθαλψ
-rw-r--r--   1 root  wheel    0 Sep 10 08:06 ρίθαλψ
drwxr-xr-x   2 root  wheel    4 Sep 10 08:08 .
drwxr-xr-x  21 root  wheel  512 Sep 10 08:04 ..
But now the dots follow the letters.
I mean... WHY???
Because you have set LANG to en_US.*, and the English language doesn’t know how to sort Greek letters.
Try setting LC_COLLATE to el_GR.UTF-8. Then programs should use Greek rules for ordering (provided that the programs support locale settings correctly, of course – I’m not sure if ls(1) does, though).
 
I have a much more basic question: Why is tar even changing the path names (file names and directory names)? At the POSIX interface level, path names are just binary strings of bytes, and they have no meaning or semantics. The only two bytes that are "special" and are to be interpreted when managing path names are (as olli already said) slash and nul: Slash to indicate where directories are in the path name, and nul to find the end of it. This is by the way the reason why you can't use 16-bit encodings of strings (such as UTF-16 or Windows 16-bit) as file names: one of the bytes might be a slash.

It is perfectly legal for an application to create path names that have arbitrary bytes in the range 1...255 (without 47 for slash) in path names, and they don't have to be valid UTF-8, even less do the have to be valid Unicode, even less do that have to be correctly normalized Unicode. Even if I run one particular process (login session) with a particular LANG setting, arbitrary bytes are still valid path names. If tar is interpreting path names as Unicode strings, and modifying them by normalizing them, I think that is a bug in tar. And as Matlib pointed out, it is a bug that can cause data loss (if tar modifies the names such that we get a name collision and one file overwrites another).

The underlying problem is that the Unix (POSIX) kernel interface for path names is too old, and was created before concepts such as Unicode existed. It doesn't clarify what the semantics of file and path names is. And when Unicode showed up, it was too late to change the definition of path names. A sensible definition would have been that all path names have to be valid and correctly normalized UTF-8 strings according to Unicode standard XYZ (some published version), but that wasn't done.
 
  • Like
Reactions: a6h
Back
Top