This is something funny I just ran into.
Computers do only process 0 and 1. But, since this is a bit laboursome and impractical to handle, there were agreements to group these bits into larger entities. There were some different approaches, but with the spread of usually 8-bit microprocessors, the octet, or byte, has become the generally established grouping scheme.
This was all fine as long as the characters of the commonly used language, i.e. english, would fit into these 8 bit. One would then only need to agree on some mapping table to give numbers to the characters - like ASCII or EBCDIC - and everything is fine. One would not need to care about what the bits might actually represent - to decide on that is entirely left to the final addressee.
But now this is different, now we use UTF-8 for the characters of the spoken language, and a byte is no longer a character.
So, what happened - I tried to store away some cipher material. And, in the time of web applications, the best place to store away some lounging-around cipher material is the cookie. Only, this is not so very easy, and this is where the fun starts. Cipher material by its nature is random (or at least it should look as much like random as possible, because that's the whole point in it). Now if you try to find out what would be the allowed character set for a cookie, that might become entertaining[1] - but that's not really the problem, because the application should already care for that.
What the application does, is: it encrypts the cookie data, it adds a signature hash, and then most likely it does base64 on it, because of the beforementioned character set issues. So, I thought it should be just fine to throw my cipher material thereto, and it should be well cared for.
But that wasn't the case; I got a complaint: UndefinedConversionError "\xAD" from ASCII-8BIT to UTF-8 (so now You know one of the bytes from my cipher material)
As usual, I looked into the code what is happening there, and it seems, the application does not only encrypt&sign the data, as a safety measure it does beforehand convert it to JSON. And JSON, as it seems, has no notion whatsoever about what a byte might be; it only knows data as language text, and language text must have a correct encoding.
The surprizing thing is: nobody ever worried about that - nobody seems to need, use or know a byte anymore.
As a proof, I give You this example. Here it is stated: All stored data, even ASCII, has an encoding.[2]
I would think, ASCII is not "stored data", ASCII is an encoding. And I am very sure cipher material can be "stored data" and does not have a character set encoding. As does machine code. As does music. As do pictures. etc. etc. In fact everything, except the special case of language text, does not have an encoding.
So what seems to have happend here is, people have learned from the ASCII times that, for most practical purposes, byte strings and language texts can be treated the same. And now they apply that same wisdom to the UTF-8 world: anything that is not UTF-8 just doesn't exist.
The common workaround is, in the modern world of web applications all cipher material and similar delicate stuff must be treated base64 before being handed to the application. Which in this case means, do base64, then encrypt&sign, then again do base64 - and repeat as often as necessary. Luckily we have ample bandwidth and get ever more of it - except in a cookie where there is not infinite space.
Cheerio.
[1] I don't recommend doing that, but if You do, You may come to a bit more of an understanding why I am saying HTTP is a tremendous misconception; it was never intended and it is utterly unsuited to what we are nowadays doing with it, i.e. run 75% of the world's business thru it.
[2] From here You might get a little bit of a clue about where my occasional comments concerning the developer's ivory towers derive from.
Computers do only process 0 and 1. But, since this is a bit laboursome and impractical to handle, there were agreements to group these bits into larger entities. There were some different approaches, but with the spread of usually 8-bit microprocessors, the octet, or byte, has become the generally established grouping scheme.
This was all fine as long as the characters of the commonly used language, i.e. english, would fit into these 8 bit. One would then only need to agree on some mapping table to give numbers to the characters - like ASCII or EBCDIC - and everything is fine. One would not need to care about what the bits might actually represent - to decide on that is entirely left to the final addressee.
But now this is different, now we use UTF-8 for the characters of the spoken language, and a byte is no longer a character.
So, what happened - I tried to store away some cipher material. And, in the time of web applications, the best place to store away some lounging-around cipher material is the cookie. Only, this is not so very easy, and this is where the fun starts. Cipher material by its nature is random (or at least it should look as much like random as possible, because that's the whole point in it). Now if you try to find out what would be the allowed character set for a cookie, that might become entertaining[1] - but that's not really the problem, because the application should already care for that.
What the application does, is: it encrypts the cookie data, it adds a signature hash, and then most likely it does base64 on it, because of the beforementioned character set issues. So, I thought it should be just fine to throw my cipher material thereto, and it should be well cared for.
But that wasn't the case; I got a complaint: UndefinedConversionError "\xAD" from ASCII-8BIT to UTF-8 (so now You know one of the bytes from my cipher material)
As usual, I looked into the code what is happening there, and it seems, the application does not only encrypt&sign the data, as a safety measure it does beforehand convert it to JSON. And JSON, as it seems, has no notion whatsoever about what a byte might be; it only knows data as language text, and language text must have a correct encoding.
The surprizing thing is: nobody ever worried about that - nobody seems to need, use or know a byte anymore.
As a proof, I give You this example. Here it is stated: All stored data, even ASCII, has an encoding.[2]
I would think, ASCII is not "stored data", ASCII is an encoding. And I am very sure cipher material can be "stored data" and does not have a character set encoding. As does machine code. As does music. As do pictures. etc. etc. In fact everything, except the special case of language text, does not have an encoding.
So what seems to have happend here is, people have learned from the ASCII times that, for most practical purposes, byte strings and language texts can be treated the same. And now they apply that same wisdom to the UTF-8 world: anything that is not UTF-8 just doesn't exist.
The common workaround is, in the modern world of web applications all cipher material and similar delicate stuff must be treated base64 before being handed to the application. Which in this case means, do base64, then encrypt&sign, then again do base64 - and repeat as often as necessary. Luckily we have ample bandwidth and get ever more of it - except in a cookie where there is not infinite space.
Cheerio.
[1] I don't recommend doing that, but if You do, You may come to a bit more of an understanding why I am saying HTTP is a tremendous misconception; it was never intended and it is utterly unsuited to what we are nowadays doing with it, i.e. run 75% of the world's business thru it.
[2] From here You might get a little bit of a clue about where my occasional comments concerning the developer's ivory towers derive from.