fixing files with "invalid encoding"

Beeblebrox · Feb 22, 2014

I have a bunch of jpg files with names containing non-english letters. The file was zipped on a Linux server, but encoding becomes messed up when I unzip on FreeBSD. I have corrected the file names by hand, but the file info still shows "invalid encoding". /etc/login.conf has setting

Code:

LC_COLLATE=xy_XY.UTF-8:\

* How do I correct the encoding error for these files?
* Is there a way to unzip the compressed file correctly without messing up the file names?

ralphbsz · Feb 23, 2014

I don't understand. Who says "invalid encoding"?

Let's start with the basics. You have one or more files in a directory. They have names, which are sequences of bytes. Let's first find out what their real names are (in binary, not as rendered on a human-visible terminal). Do to that, go to the directory containing the files, and enter the following command: find . -type f -print0 | hexdump -C. Then we can look at the output from that and we'll know what the names of the files are.

Next, you say they are zipped. Do you mean many files are packed together into a single zip archive? Or do you mean they have been gzipped or bzipped? To find out, we could use the file utility to determine what kind of file they really are. Just say file *, and let's see what the result it. Which program did you use to unpack them?

I suspect that the problem is that the files have names containing non-ASCII characters, which were valid on the encoding scheme used when "zipping" them on the Linux machine, and which your unzip program attempts to convert using the current locale. Unfortunately, some arrays of bytes are invalid as strings in certain encodings, for example UTF-encoded unicode.

The next question is this. Your login session runs in a very strange encoding, namely xy_XY. What language and country does XY represent? A common encoding might for example be en_US or pt_BR (for the dialect of english spoken in the US, or the dialect of portuguese spoken in Brazil), but I've never heard of language xy spoken in country XY.

To help you debug this, we just need more information.

By the way, your problems are an example of why it is nearly always a really bad idea to use anything other than displayable 7-bit ASCII characters in file names. I posted a diatrabe about that here in another thread.

ShelLuser · Feb 23, 2014

Not going into the issue at hand (hardly anything more I can add) but:

ralphbsz said:
The next question is this. Your login session runs in a very strange encoding, namely xy_XY. What language and country does XY represent? A common encoding might for example be en_US or pt_BR (for the dialect of english spoken in the US, or the dialect of portuguese spoken in Brazil), but I've never heard of language xy spoken in country XY.

Not strange at all, I can immediately name one: nl_NL.

And if you check /usr/share/locale you'll find a lot more: ro_RO, no_NO, lv_LV, is_IS. It's a lot more common than you think.

ralphbsz · Feb 23, 2014

I actually grew up about 30km from Venlo, so I'm quite familiar with the "nl" language (known as dutch here). And many decades ago, I was in charge of internationalization for a large programming project, and as a test example I used both regular german, and the hessian dialect (I had some fun translating error messages, and putting in jokes about apple wine). But I worry that the OP may be literally running with his encoding set to "xy", which might have fascinating side effects, since I don't think the unicode encoding routines will know how to handle it.

ShelLuser · Feb 23, 2014

ralphbsz said:
I actually grew up about 30km from Venlo, so I'm quite familiar with the "nl" language (known as dutch here).

"Dat noemen wij Nederlands

" / "Nederlands, thank you very much"

Maybe pushing my luck a little bit here but couldn't resist at this time

ralphbsz said:
But I worry that the OP may be literally running with his encoding set to "xy", which might have fascinating side effects, since I don't think the unicode encoding routines will know how to handle it.

And I think you're right, in fact; this is my mistake here.

I did go over the thread but overlooked the xy_XY mention as being literal. Ergo; I drew the wrong conclusions when I went over your message (and didn't bother to double check with the OP's message).

Beeblebrox · Feb 23, 2014

Hi.

This is a zipped web page of a friend, and I was migrating his hosting when I ran into this problem. The original web designer in his vast wisdom, decided to use non-standart letters for jpg files. I'm nearing twenty years of telling peolpe to not use anything other than english characters for their file names. It does not matter what OS, those characters somehow find a way to get themselves garbled and they specially love archive files to make that magic happen. That said,

Do you mean many files are packed together into a single zip archive? Or do you mean they have been gzipped or bzipped? ... Which program did you use to unpack them?

I used tar to unzip. The file command for the ziped archive:

Code:

shouse.zip: application/zip charset=binary, Zip archive data, v2.0 to extract

The encoding on my system is the same as the language used to name the files.

But I worry that the OP may be literally running with his encoding set to "xy", which might have fascinating side effects

Srysly? I'm dumb but not that dumb. Let me give you an example..

As stated above, I manually fixed the names of the jpg files using original charset, but when listing folder contents I would see this:

Code:

-rw-r--r--  1 me wheel 2220 Feb 19 15:03 insta-xyYX.png (invalid encoding)

It turns out, that was Nautilus' doing. When I manually changed the names it took the liberty to tag the (invalid encoding) bit of text to the tail of the file name. Nice. Since it was quite late, this possibility did not occur to me and I assumed it was a modified file property setting. pcmanfm for example, does not append this text for same operation. See? Isn't this one of the dumbest things you've heard?

Since the previous files have been modified, this was run on a fresh untar of the original zip file, and since the list is long, just a sample of the problem files should suffice:
find . -name "fsquare*" -type f -print0 | hexdump -C

Code:

00000000  2e 2f 66 73 71 75 61 72  65 2d 9f 69 9f 68 6f 75  |./fsquare-.i.hou|
00000010  73 65 2d 32 33 32 78 35  30 2e 70 6e 67 00 2e 2f  |se-232x50.png../|
00000020  66 73 71 75 61 72 65 2d  9f 69 9f 68 6f 75 73 65  |fsquare-.i.house|
00000030  2d 31 35 30 78 35 30 2e  70 6e 67 00 2e 2f 66 73  |-150x50.png../fs|
00000040  71 75 61 72 65 2d 9f 69  9f 68 6f 75 73 65 2d 32  |quare-.i.house-2|
00000050  32 36 78 35 30 2e 70 6e  67 00 2e 2f 66 73 71 75  |26x50.png../fsqu|
00000060  61 72 65 2d 9f 69 9f 68  6f 75 73 65 2d 31 30 30  |are-.i.house-100|
00000070  78 35 30 2e 70 6e 67 00  2e 2f 66 73 71 75 61 72  |x50.png../fsquar|
00000080  65 2d 9f 69 9f 68 6f 75  73 65 2e 70 6e 67 00     |e-.i.house.png.|
0000008f

I'd still like to find out how to unzip the blasted files without borking the original encoding in the first place...

ralphbsz · Feb 24, 2014

Beeblebrox said:
I'm nearing twenty years of telling peolpe to not use anything other than english characters for their file names.

You and me both. I think our innocence was lost when we (as a Unix community) allowed spaces, pipe characters, and pound signs in file names. After that, the descent into 8-bit hell was inevitable.

Code:
I used tar to unzip. The file command for the ziped archive:

Code:

shouse.zip: application/zip charset=binary, Zip archive data, v2.0 to extract

Good. It seems that tar did the job well, and took whatever binary stuff is in the file names in the .zip file, and put it into the file names (more on that below). You might try unzip instead of tar, but I bet it will make no difference.

The encoding on my system is the same as the language used to name the files.

But I worry that the OP may be literally running with his encoding set to "xy", which might have fascinating side effects

Click to expand...

Srysly? I'm dumb but not that dumb.

Good. Sorry about accusing you of running in the (so far unknown) xy language, but you never know.

Code:
I manually fixed the names of the jpg files using original charset, but when listing folder contents I would see this:

Code:

-rw-r--r-- 1 me wheel 2220 Feb 19 15:03 insta-xyYX.png (invalid encoding)

It turns out, that was Nautilus' doing. When I manually changed the names it took the liberty to tag the (invalid encoding) bit of text to the tail of the file name. Nice.

Actually, from its (demented) viewpoint that even makes some sense. You are running in an UTF-8 based encoding (your locale is xy-xy.utf8). Nautilus (like many other programs) thinks that file names are strings, and therefore they must be handled using unicode-based encoding routines. Sadly, they are not "strings" in the sense of normal string processing, since you have no idea what encoding they are in. So Nautilus' internal string conversion routines do the next best thing.

Code:
Since the previous files have been modified, this was run on a fresh untar of the original zip file, and since the list is long, just a sample of the problem files should suffice:
find . -name "fsquare*" -type f -print0 | hexdump -C

Code:

00000000 2e 2f 66 73 71 75 61 72 65 2d 9f 69 9f 68 6f 75 |./fsquare-.i.hou| ... 0000008f

And here is the problem: I see the hyphen in fsquare*-* (hex 2d), and the next character after it is hex 9F. Unfortunately, that is neither a valid 8859-1 character, nor a valid UTF8 encoding. I just tried using the Python UTF8 decoder on "9f 69", and it blows up on the 9f, saying "UnicodeDecodeError: 'utf8' codec can't decode byte 0x9f in position 0: invalid start byte". So we know that the string is encoded in something which is not UTF-8.

Here's my suggestion: The real problem is that Nautilus thinks that all strings must be xy-XY.utf8 encoded, and it gets confused by the hex 9F. So try just running with your locale set to "C". Nautilus might just work, and leave the 0x9f as a character (which will probably not display anything sensible, on my Mac iTerm is is a black diamond with a question mark inside). But Nautilus might work well enough to rename the file.

If that fails: Get a shell window. Make sure all locale-specific environment variables are unset (LANG, LC_COLLATE, I typically wipe out all shell variables you find in the output of the set that begin with "L" except for LINES and LOGNAME. At that point, you should be able to rename the offending file by using a glob pattern: mv fstar*house-150x50.png foo1.png. The nice thing about the mv command and the shell is that they are quite stupid, and don't try to do any smart tricks with string management.

Hope this helps. If it doesn't, we'll have to take drastic measures: take the output of find . -type f -print0, and run it into the scripting language of your choice, and inside the script carefully put the string file names together, and feed them right into the rename() system call.

Beeblebrox · Feb 24, 2014

I've basically slogged my way through the immedaiate problem (the first question) and it's not really important at this point. But I would very much like to find a way to circumvent this in the future. So it comes to this:

Is there a way to unzip the compressed file correctly without messing up the file names?

Let's assume that there is a way to find out what the original encoding was, and that it's some horrible Microsoft codepage of the 125* family (or whatever your favorite nightmare is). The solution should be as simple as somthing of the sort: iconv -f <nightmare> -t UTF-8 infile > outfile
I have not tried this for an archive file, but I think it could work. The reason I have not yet tried is below.

I also used converters/convmv, but did not get results. I now understand (from your hex dump explanation) that I had assumed convmv would magically detect the input file encoding, which it cannot if the system has no record of it. Maybe the approach could be:

* Hexdump the problematic filenames and identify the hex numbers for unrecognized characters.
* Search a database (online or installed) for the encoding matching all hex codes.
* Install missing charset library (if necessary).
* Use iconv, convmv or similar program to bulk-convert, by specifying the convert-from charset manually (assuming the operator is smarter than the PC).

A list of all charsets: http://www.iana.org/assignments/charact ... sets.xhtml

kpa · Feb 24, 2014

This is probably a shortcoming of the ZIP compressor/decompressor implementation and the ZIP file format itself. The zip file should have the information about the character encoding of the file names but is it stored in it at all? I don't see how on earth the encoding could be deduced if it's not contained in the archive itself. Especially with UTF-8 it's not possible in many cases to tell apart UTF-8 and some ISO-8859-* encodings if the strings are very short, there has to be external information to tell that the strings are in fact UTF-8.

Beeblebrox · Feb 24, 2014

I just looked up the 9f hex code for Windows-125* family. It corresponds to decimal 159 slot. The windows-125* encoding is most likely the cause of the problem.

I don't see how on earth the encoding could be deduced if it's not contained in the archive itself

Why not? I know what language the files are in, all I have to do is guess at (or by trial-and-error) and force to the archive the various encodings that correspond to that language - be it a Windows variant or some other.

kpa · Feb 24, 2014

What I meant that in general there's no way to guess the encoding by an automated process. If there's a human involved it's a different matter alltogether.

ralphbsz · Feb 24, 2014

What I forgot to mention above: I had tried the 0x9F character ISO 8859-1, but the ISO encodings seem to leave the second control character space (0x80 - 0x9F) free. And I looked at Windows-1252 (which I think is the most common character set used in Windows, but I'm not sure, not being a Windows person), and there 0x9F corresponds to a "capital Y umlaut" (which AFAIK is only used in dutch, as an abbreviation for "Ij", like the Ijselmeer inland sea), And a file name of "fsquare-IjiIjhouse-232x50.png" seems mighty unlikely.

kpa said:
This is probably a shortcoming of the ZIP compressor/decompressor implementation and the ZIP file format itself. The zip file should have the information about the character encoding of the file names but is it stored in it at all? I don't see how on earth the encoding could be deduced if it's not contained in the archive itself. Especially with UTF-8 it's not possible in many cases to tell apart UTF-8 and some ISO-8859-* encodings if the strings are very short, there has to be external information to tell that the strings are in fact UTF-8.

Well, the ZIP format shares that problem with Posix file systems. They also have no idea what encoding the file name is in. Which is why I advocate (rather obnoxiously) that people use only strings that are completely unambiguous as file names. The best is to restrict yourself to displayable 7-bit ASCII characters. If that is unacceptable, the next best thing is to religiously standardize on Unicode in UTF-8, and make sure all your programs normalize the strings before creating files. Unfortunately, there is no enforcement mechanism in file systems (and the current Posix standard prevents such an enforcement mechanism from being viable).

People had actually discussed creating file systems that automatically detect the encoding used by the user-space program (that's easy, by inspecting environment variables), and then automatically encode/decode the file names on the way in/out of the kernel. This leads to bizarre edge cases though: you may have two files in a directory that seem to have the same name (since character set transformations tend to not be information-preserving), and you may have files whose names are displayable, but that you can not operate on, because their names violate rules of your user encoding. These problems already occur today when writing clustered file systems or file servers (such as CIFS, NFS or AFP), because you can end up with files whose names are not valid to Windows clients. Today, this area is handled by ad-hockery.

In some cases, as Beeblebrox suggests, the encoding could be guessed, by looking at the language of text files. The problem is that this won't always work (for example it fails in a directory full of .jpg files). For a user-space tool, guessing and heuristics (perhaps with human user confirmation or guidance for ambiguous cases) can work, but it is unsuitable for a file system.

It's a mess.

Beeblebrox · Feb 24, 2014

file name of "fsquare-IjiIjhouse-232x50.png" seems mighty unlikely

I know what the file name is, I've already modified it back to the original name. Also, AFAIK all of the Windows-125* family have that particular address occupied by a character, which means 9 possible letters (1250-1258).

What I don't understand, is why the file encoding is correct (no filename problems) when viewed on the linux server, but becomes corrupted either

a) when zipped on the linux server (highly unlikely)
b) when un-zipped on FreeBSD

I'm going to have to guess that maybe FreeBSD is missing some of the libraries available on linux?

ralphbsz · Feb 27, 2014

Beeblebrox said:
What I don't understand, is why the file encoding is correct (no filename problems) when viewed on the linux server, but becomes corrupted either

a) when zipped on the linux server (highly unlikely)
b) when un-zipped on FreeBSD

I'm going to have to guess that maybe FreeBSD is missing some of the libraries available on linux?

What locale settings is the Linux machine running at? Also UTF-8? Because the binary string you had above (the one with 9F 68 and 9F 69 in it) is not valid UTF-8. And on the Linux machine, if you log in from a shell and look at the file name with find and hexdump, do you see the same binary pattern?

Or maybe some part of the Linux machine (for example the display routines) are just very forgiving about wrongly encoded strings, and makes a good heuristic guess. Sloppy but polite of it.