Backspace doesn't remove utf-8 multibyte characters on console

aeifn · Jan 24, 2020

In FreeBSD 12.1 backspace in terminal only deletes one byte of multi-byte character.
For example:

Code:

# cat > /tmp/test
tы<backspace>t
# cat /tmp/test
t�t
# hexdump -C /tmp/test
74 d1 74 0a

Where 'ы' is a Russian letter (D18B in hexadecimal form). You may see that only second byte of multi-byte character was deleted by backspace.

yuripv · Jan 25, 2020

What is locale output? What is your shell? Also, define "terminal" -- X11 terminal, ssh session, system console?

aeifn · Jan 25, 2020

Thank you for answer!

Code:

# locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_ALL=

This behavior is observed in vt(4) console, also under xterm(1).

I know, under linux it is solved by iutf8 console extension, but for FreeBSD I don't know an answer.

yuripv · Jan 25, 2020

Interesting, I see the problem, yes, ssh'ing from windows using Putty, and using xterm locally. Don't have linux installed anywhere, but can't reproduce in iTerm2 on my MBP. Something to look into.

aeifn · Jan 25, 2020

I found freebsd-bugs bugreport for june 2017 (with no answer) with the same problem.

memreflect · Jan 26, 2020

The IUTF8 terminal input flag is arguably a hack, but it does solve the problem at least...for UTF-8. Unfortunately, that still leaves other multibyte encodings like GB18030, EUC-KR, and Shift JIS that suffer from the same trouble as well (try U+6F22 U+8A9E 漢語; it's made up of two 3-byte UTF-8 sequences--E6 BC A2 and E8 AA 9E).

It's a chicken-and-egg problem really: the terminal is the one buffering the data before handing it off to read(2) (called by functions like getwchar(3) and fgets(3)), yet the terminal can't delete the entire character sequence because it can't possibly know what you're doing with it. Things like the IUTF8 hack are perhaps a step in the right direction, but multibyte encodings are simply not an easy thing to deal with for terminals, especially when you consider the many control sequences, termios(4) support, and such to contend with.

It's not that it can't be done; it's just a daunting amount of work that very few, if any, people have put in. xterm currently uses luit(1) for transforming characters from UTF-8 to the native encoding and vice-versa, but it still isn't a perfect solution.

aeifn · Jan 27, 2020

I invented a practical solution using rlwrap(1).
So I can use:
rlwrap cat > test

Also, found another principal discussion about iutf8 in FreeBSD terminal at Debian kFreeBSD community.

Oleg_NYC · Oct 10, 2023

Even though we have this commit https://cgit.freebsd.org/src/commit/?id=128f63cedc14ae21b35f74e11e2fe1a5659c58e8 and this commit https://cgit.freebsd.org/src/commit/?id=9e589b0938579f3f4d89fa5c051f845bf754184d , this problem with deleting Russian utf-8 characters still persists. I have no idea why. Maybe you can tell me why it persists.

Oleg_NYC · Oct 10, 2023

Never mind. The author of these patches spoke to me about this.