FreeBSD 'tr' utility doesn't work as it should.

FreeBSD 'tr' utility doesn't work as it should.

$ dd if=/dev/zero ibs=1k count=32 | tr "\000" "\377" > file_ff.bin

In theory the file output should be 32 kb size of 'FF' inside. In fact we got this on output:

00000000 C3 BF C3 BF │ C3 BF C3 BF │ C3 BF C3 BF │ C3 BF C3 BF │ C3 BF C3 BF │ C3 BF C3 BF │ C3 BF C3 BF ÿ ÿ ÿ ÿ ÿ ÿ ÿ ÿ ÿ ÿ ÿ ÿ ÿ ÿ

And files size 65535 bytes instead expected 32768

Any ideas why it happens?

p.s. This is example how it should look after running the command:
00000000 FF FF FF FF │ FF FF FF FF │ FF FF FF FF │ FF FF FF FF │ FF FF FF FF │ FF FF FF FF │ FF FF FF FF
 
Characters are treated as UTF by the looks of it. You might want to set LC_ALL to C before running this command.
 
Text encoding issue: Your locale is probably using UTF-8, and C3 BF is the UTF-8 encoding of FF, which is the Unicode character "ÿ". See the man page for tr: "The LANG, LC_ALL, LC_CTYPE and LC_COLLATE environment variables affect the execution of tr as described in environ(7)."

Try "dd if=/dev/zero ibs=1k count=32 | LANG=C tr "\000" "\377" ..."

Edited to add: Good morning, Happy New Year, Sir Dice beat me by a few seconds.
 
Yes, of course. My locale is UTF-8, which is default locale for FreeBSD. But it should work without this LANG=C patch because the command itself tr "\000" "\377" doesn't contain any UTF characters.


Code:
\octal
         Octal sequences can be used to represent characters with specific
         coded  values. An octal sequence consists of a backslash followed
         by    the longest sequence of    one-, two-, or three-octal-digit char-
         acters (01234567).    The sequence causes the    character whose    encod-
         ing is represented    by the one-, two- or three-digit octal integer
         to    be placed into the array. Multi-byte characters    require    multi-
         ple, concatenated escape sequences    of this     type,    including  the
         leading \ for each    byte.
 
But it should work without this LANG=C patch because the command itself tr "\000" "\377" doesn't contain any UTF characters.
They're treated as UTF due to your locale(1) settings.

Code:
     LC_CTYPE         Locale to be used for character classification (letter,
                      space, digit, etc.) and for interpreting byte sequences
                      as multibyte characters.
 
It's not a bug, it just isn't what you expected to happen.
 
Whereas I agree with everything ralphbsz and SirDice said, this does in fact work for me, even with LANG and LC_ALL unset.

Code:
$ export LC_ALL=""; export LANG=""
$ dd if=/dev/zero ibs=1k count=32 | tr "\000" "\377" > foo
32+0 records in
64+0 records out
32768 bytes transferred in 0.000142 secs (230275687 bytes/sec)
$ ls -l foo
-rw-r--r--  1 jose  jose  32768 Jan  2 09:04 foo
$ file foo
foo: ISO-8859 text, with very long lines, with no line terminators
$ hexdump -C foo
00000000  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
*
00008000
$ head -c 40 foo
ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ

Edit: Export env variables, formatting problem.
 
Disagree. It should work "from the box" without any appeals to locale. If in the manual nothing about locale then it should accept the command as it described in documentation. I shouldn't be concerned about this as user.
 
I would be agreed if I entered regular characters but not the octal code which expected to be accepted as an octal code input accordingly with documentation, whatever locale on the server.
 
tr(1) operates on characters, that's what the tool does.

Code:
     \octal     A backslash followed by 1, 2 or 3 octal digits represents a
                character with that encoded value.  To follow an octal
                sequence with a digit as a character, left zero-pad the octal
                sequence to the full 3 octal digits.
Note how it says "represents a character with that encoded value". Nowhere does it say it should be treated as a "byte" which is what you were expecting. The UTF character representation of the decimal value 255 (octal 377) is C3 BF.
 
The bug is in the POSIX standard then. Take it up with them.
Tell this to Debian developers. Debian also has UTF-8 default locale but no problem there with tr(1). It works there right as expected and explained in the manual.
 
Note how it says "represents a character with that encoded value". Nowhere does it say it should be treated as a "byte" which is what you were expecting. The UTF character representation of the decimal value 255 (octal 377) is C3 BF.
Thanks for the bug confirmation. Decimal value 255 = FF in hex and = 377 in octal. This is how it is supposed to work right way.
 
I don't think it's an UTF problem but rather locales one.
Code:
# env|grep L[AC]
LANG=sk_SK-UTF-8
LC_CTYPE=sk_SK-UTF-8
LC_ALL=sk_SK-UTF-8
# dd if=/dev/zero ibs=1k count=32 | tr "\000" "\377"|hd
32+0 records in
64+0 records out
32768 bytes transferred in 0.000422 secs (77670271 bytes/sec)
00000000  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
*
00008000
#
# env|grep L[AC]
LANG=en_US.UTF-8
LC_CTYPE=en_US.UTF-8
LC_ALL=en_US.UTF-8
# dd if=/dev/zero ibs=1k count=32 | tr "\000" "\377"|hd
32+0 records in
64+0 records out
32768 bytes transferred in 0.000349 secs (93990494 bytes/sec)
00000000  c3 bf c3 bf c3 bf c3 bf  c3 bf c3 bf c3 bf c3 bf  |................|
*
00010000
I'd expect utf-8 character to be encoded the same way under EN and SK locale. So it does sound fishy.

Edit: but looking at my dd output from the SK locale I see it's in English. I'd expect the output to be in Slovak to follow the locale. I checked truss and found out:
Code:
 1274: open("/usr/share/locale/sk_SK-UTF-8/LC_CTYPE",O_RDONLY|O_CLOEXEC,013720646057) ERR#2 'No such file or directory'
So my example might not be the best. I hate locales and won't push this further; I edited my post for clarity that it may not be showing what I wanted to show.
 
Yup, I can reproduce using the en_US.UTF-8 locale
Code:
$ export LANG=en_US.UTF-8; export LC_ALL=en_US.UTF-8
$ dd if=/dev/zero ibs=1k count=32 | tr "\000" "\377" > foo
32+0 records in
64+0 records out
32768 bytes transferred in 0.000126 secs (259037621 bytes/sec)
$ ls -l foo
-rw-r--r--  1 jose  jose  65536 Jan  2 12:16 foo
$ file foo
foo: UTF-8 Unicode text, with very long lines, with no line terminators
$ hd foo
00000000  c3 bf c3 bf c3 bf c3 bf  c3 bf c3 bf c3 bf c3 bf  |................|
*
00010000
$ head -c 40 foo
ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
This is the expected behavior according to the POSIX spec. Don't expect good results when you use tools meant to handle character streams to handle binary data.
 
Thanks for the bug confirmation.
I'm not confirming anything. I was trying to explain where your expectations went the wrong way.
Decimal value 255 = FF in hex and = 377 in octal. This is how it is supposed to work right way.
IF you assume a 'character' is equal to 1 'byte'. That assumption is wrong.
 
The bug here is that tr does not return EILSEQ for \377 right from the start. Also, "tell debian developers" is kinda funny, gnu utilities are known to invent their own twisted interpretation of every other standard out there.
 
Where is the bug? Try the following on Debian:
Code:
> dd if=/dev/zero ibs=1k count=32 | LC_ALL=en_US.utf-8 tr "\000" "ÿ" | hexdump -C
00000000  c3 c3 c3 c3 c3 c3 c3 c3  c3 c3 c3 c3 c3 c3 c3 c3  |................|
*
00008000

That's broken. It is supposed to output 32768 characters, each of those with the character value 0xFF = 0377 = "ÿ" in UTF-8 encoding, which means 65536 bytes with the pattern C3 BF. It does not. It instead outputs ... I don't know, either the first half of the UTF-8 encoded character, or perhaps the "capital A with twiddle over it" character in 8859-1, but whatever it is, it does NOT resemble the desired character.

The fundamental problem is this. Text-based tools (such as more, awk, tr and so on) all had to learn how to deal with unicode characters. Unicode characters have to be encoded, and after encoding they may use more than one byte. With the advent of unicode, text-based tools (such as tr) have become difficult to use for binary data that is structured as a stream of bytes. The part that is confusing: If the input text string into tr is specified as a hex or octal number, what does it mean? The real question is: Why does the Debian version of tr treat the two strings "ÿ" and "\377" different?

To the question whether this is a bug or not: Please go and read the relevant standards, they may give clear guidance. They may also be ambiguous, in which case (a) implementations are free to pick a sensible thing to do, and (b) the next release of the standard should be tightened.
 
LC_ALL=en_US.utf-8 dd if=/dev/zero ibs=1k count=32 | tr "\000" "ÿ"
That only changes the locale for the first command in the pipeline; try moving LC_ALL before tr.

The real "bug" (or rather UB, even if it's "implementation choice") here is that \377 should NOT be treated as U+FF (and rather tr should fail with EILSEQ), octal style sequences can only encode 256 characters (per definition), and if U+ style characters are desired, it should be implemented as \u sequences.
 
That only changes the locale for the first command in the pipeline; try moving LC_ALL before tr.
You are absolutely correct! Although it makes no difference to the result I posted (because that's my default anyway). Edited the command above, so if someone does cut-and-paste it, they do the right thing.

The real "bug" (or rather UB, even if it's "implementation choice") here is that \377 should NOT be treated as U+FF (and rather tr should fail with EILSEQ), octal style sequences can only encode 256 characters (per definition), and if U+ style characters are desired, it should be implemented as \u sequences.
I agree with your reasoning. I wonder whether implementing that would break lots of existing scripts.
 
Back
Top