tar and utf-8

Not really (as it probably shouldn't), I've just demonstrated that with the examples above.

Anyway continuing this boring investigation – what's going to happen if we put those two files into the tar-slator? A tar archive with two files with the same name:

Code:
root@freebsd12:/tank# echo $LANG
C.UTF-8
root@freebsd12:/tank# tar -cf- * | tar -vtf- | LANG=C cat -v
-rw-r--r--  0 root   wheel       0 Sep 10 08:08 M-OM-^AM-NM-/M-NM-8M-NM-1M-NM-;M-OM-^H
-rw-r--r--  0 root   wheel       0 Sep 10 08:06 M-OM-^AM-NM-/M-NM-8M-NM-1M-NM-;M-OM-^H

(note that I had to reset LANG back to C to make cat -v work, otherwise it just kept printing greek letters).

Code:
root@freebsd12:/tank# cat *
File 1
...and file 2.
root@freebsd12:/tank# mkdir bucket
root@freebsd12:/tank# tar -cf- ρ* | tar -xf- -C bucket
root@freebsd12:/tank# cat bucket/*
...and file 2.

And we have a data loss incident.

If you're surprised why this is even an issue you may have some additional reading on how unicode is messed up (starting from ...There are a million broken assumptions). For example this:
No, that just show it displays them. ls has no concept of character width display. It does not care what encoding you have.
 
Back
Top