Backspace doesn't remove utf-8 multibyte characters on console

aeifn

New Member

Reaction score: 2
Messages: 13

In FreeBSD 12.1 backspace in terminal only deletes one byte of multibyte character.
For example:

# cat > /tmp/test
tы<backspace>t
# cat /tmp/test
t�t
# hexdump -C /tmp/test
74 d1 74 0a


Where 'ы' is a russian letter (D18B in hexadecimal form). You may see that only second byte of multibyte character was deleted by backspace.
 

yuripv

Well-Known Member

Reaction score: 128
Messages: 285

What is locale output? What is your shell? Also, define "terminal" -- X11 terminal, ssh session, system console?
 
OP
A

aeifn

New Member

Reaction score: 2
Messages: 13

Thank you for answer!
# locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_ALL=


This behavior is observed in vt(4) console, also under xterm(1).

I know, under linux it is solved by iutf8 console extension, but for freebsd i dont know an answer.
 

yuripv

Well-Known Member

Reaction score: 128
Messages: 285

Interesting, I see the problem, yes, ssh'ing from windows using Putty, and using xterm locally. Don't have linux installed anywhere, but can't reproduce in iTerm2 on my MBP. Something to look into.
 

memreflect

Active Member

Reaction score: 192
Messages: 219

The IUTF8 terminal input flag is arguably a hack, but it does solve the problem at least...for UTF-8. Unfortunately, that still leaves other multibyte encodings like GB18030, EUC-KR, and Shift JIS that suffer from the same trouble as well (try U+6F22 U+8A9E 漢語; it's made up of two 3-byte UTF-8 sequences--E6 BC A2 and E8 AA 9E).

It's a chicken-and-egg problem really: the terminal is the one buffering the data before handing it off to read(2) (called by functions like getwchar(3) and fgets(3)), yet the terminal can't delete the entire character sequence because it can't possibly know what you're doing with it. Things like the IUTF8 hack are perhaps a step in the right direction, but multibyte encodings are simply not an easy thing to deal with for terminals, especially when you consider the many control sequences, termios(4) support, and such to contend with.

It's not that it can't be done; it's just a daunting amount of work that very few, if any, people have put in. xterm currently uses luit(1) for transforming characters from UTF-8 to the native encoding and vice-versa, but it still isn't a perfect solution.
 
Top