Beeblebrox said:
I'm nearing twenty years of telling peolpe to not use anything other than english characters for their file names.
You and me both. I think our innocence was lost when we (as a Unix community) allowed spaces, pipe characters, and pound signs in file names. After that, the descent into 8-bit hell was inevitable.
I used tar to unzip. The
file command for the ziped archive:
Code:
shouse.zip: application/zip charset=binary, Zip archive data, v2.0 to extract
Good. It seems that tar did the job well, and took whatever binary stuff is in the file names in the .zip file, and put it into the file names (more on that below). You might try unzip instead of tar, but I bet it will make no difference.
The encoding on my system is the same as the language used to name the files.
But I worry that the OP may be literally running with his encoding set to "xy", which might have fascinating side effects
Srysly? I'm dumb but not that dumb.
Good. Sorry about accusing you of running in the (so far unknown) xy language, but you never know.
I manually fixed the names of the
jpg files using original charset, but when listing folder contents I would see this:
Code:
-rw-r--r-- 1 me wheel 2220 Feb 19 15:03 insta-xyYX.png (invalid encoding)
It turns out, that was
Nautilus' doing. When I manually changed the names it took the liberty to tag the
(invalid encoding) bit of text to the tail of the file name. Nice.
Actually, from its (demented) viewpoint that even makes some sense. You are running in an UTF-8 based encoding (your locale is xy-xy.utf8). Nautilus (like many other programs) thinks that file names are strings, and therefore they must be handled using unicode-based encoding routines. Sadly, they are not "strings" in the sense of normal string processing, since you have no idea what encoding they are in. So Nautilus' internal string conversion routines do the next best thing.
Since the previous files have been modified, this was run on a fresh untar of the original zip file, and since the list is long, just a sample of the problem files should suffice:
find . -name "fsquare*" -type f -print0 | hexdump -C
Code:
00000000 2e 2f 66 73 71 75 61 72 65 2d 9f 69 9f 68 6f 75 |./fsquare-.i.hou|
...
0000008f
And here is the problem: I see the hyphen in fsquare*-* (hex 2d), and the next character after it is hex 9F. Unfortunately, that is neither a valid 8859-1 character, nor a valid UTF8 encoding. I just tried using the Python UTF8 decoder on "9f 69", and it blows up on the 9f, saying "UnicodeDecodeError: 'utf8' codec can't decode byte 0x9f in position 0: invalid start byte". So we know that the string is encoded in something which is not UTF-8.
Here's my suggestion: The real problem is that Nautilus thinks that all strings must be xy-XY.utf8 encoded, and it gets confused by the hex 9F. So try just running with your locale set to "C". Nautilus might just work, and leave the 0x9f as a character (which will probably not display anything sensible, on my Mac iTerm is is a black diamond with a question mark inside). But Nautilus might work well enough to rename the file.
If that fails: Get a shell window. Make sure all locale-specific environment variables are unset (LANG, LC_COLLATE, I typically wipe out all shell variables you find in the output of the
set that begin with "L" except for LINES and LOGNAME. At that point, you should be able to rename the offending file by using a glob pattern:
mv fstar*house-150x50.png foo1.png. The nice thing about the mv command and the shell is that they are quite stupid, and don't try to do any smart tricks with string management.
Hope this helps. If it doesn't, we'll have to take drastic measures: take the output of
find . -type f -print0, and run it into the scripting language of your choice, and inside the script carefully put the string file names together, and feed them right into the rename() system call.