How to install UTF-8 on FreeBSD

bugzeo · Jul 31, 2021

This is my Virtualbox guest FreeBSD, browser Mozilla Firefox:

This is my Windows 10, browser Firefox or Chrome:

How can I properly install UTF8 on my FreeBSD?

scottro · Jul 31, 2021

I think you need proper fonts for whatever language you're trying to use. For example, on occasion, I need Japanese. If I just install fcitx-mozc, then try to type Japanese, I'll get characters similar to the ones you just showed. However, if I install some Japanese fonts, everything is fine. The question may be how to get Cyrillic or Asian characters to work, rather than UTF-8.

bugzeo · Jul 31, 2021

Isn't a port for all, included emojis?

memreflect · Aug 1, 2021

There is a "meta-port" (port that depends on other ports) named x11-fonts/noto that installs a bunch of fonts. However, I must warn you that there's a lot of disk space required, which also means a large download size (~2 GiB/Go on my system). You will definitely want to explore the x11-fonts category if you need something easier on your bandwidth (and easier on your CPU and HDD/SSD if you compile from ports):

x11-fonts/noto-basic - Basic: Latin, Greek, Cyrillic, mathematical symbols, probably some others
x11-fonts/noto-emoji - Emoji
x11-fonts/noto-extra - more type faces for the fonts in the "Basic" set, as well as Arabic, Bengali, Devanagari, Hebrew, Lao, Telugu, Thai, and more
x11-fonts/noto-hk - Chinese (Traditional - Hong Kong)
x11-fonts/noto-jp - Japanese
x11-fonts/noto-kr - Korean
x11-fonts/noto-sc - Chinese (Simplified)
x11-fonts/noto-tc - Chinese (Traditional)

Of course, you can simply install the ones you need rather than all of them. While I only need the "Basic" and "Emoji" fonts, those boxes are definitely annoying enough for me to have the others installed in case I need them.

ralphbsz · Aug 1, 2021

The term "all" you used in your question is time dependent. I just looked it up: the most recent update to Unicode happened in 2020 (about a year ago) added 5,930 characters, for a total of 143,859. In addition, the emoji part of the fonts has way more possible rendering combinations (you can make more emojis by combining Unicode characters, tens of thousands more). FreeBSD relies on volunteers, and I don't know exactly where the fonts come from, but you have to expect that it will be lagging in comparison to the major platforms.

My favorite new emoji is accordion.

Vull · Aug 1, 2021

bugzeo said:
This is my Virtualbox guest FreeBSD, browser Mozilla Firefox:

This is my Windows 10, browser Firefox or Chrome:

How can I properly install UTF8 on my FreeBSD?

This is my Firefox browser on 13.0-RELEASE-p3, after installing x11-fonts/noto, at memreflect's suggestion.

Criosphinx · Aug 1, 2021

I don't know about noto but in my case for japanese characters I installed japanese/font-std/, less than 60mb.

Are those symbols Chinese or Korean characters?

chinese/font-std/ is inside the chinese ports categroy and there are several fonts in the korean category also.

Vull · Aug 1, 2021

Criosphinx said:
...
Are those symbols Chinese or Korean characters?

I don't know. They're just samples of utf-8 characters I copy/pasted from wikipedia several months ago.

bugzeo · Aug 1, 2021

ralphbsz said:
The term "all" you used in your question is time dependent. I just looked it up: the most recent update to Unicode happened in 2020 (about a year ago) added 5,930 characters, for a total of 143,859. In addition, the emoji part of the fonts has way more possible rendering combinations (you can make more emojis by combining Unicode characters, tens of thousands more). FreeBSD relies on volunteers, and I don't know exactly where the fonts come from, but you have to expect that it will be lagging in comparison to the major platforms.

My favorite new emoji is accordion.

I know it's made by volunteers, thank you.

memreflect · Aug 1, 2021

Vull said:
I don't know. They're just samples of utf-8 characters I copy/pasted from wikipedia several months ago.

Thanks for mentioning where you got those characters. Here's what I get when I inspect the fonts used by Firefox:

$ <U+0024>, ¢ <U+00A2>, and € <U+20AC> all use the Noto Sans Regular font in my case, which is comes with the "Basic" package.
ह <U+0939> uses the Noto Sans Devanagari Regular font, which comes with the "Extra" package.
𐍈 <U+10348> uses the Noto Sans Gothic Regular font, which also comes with the "Extra" package.
한 <U+D55C> uses the Noto Sans CJK SC font, which comes with the "Chinese (Simplified)" package. Adding a lang="ko" attribute to the element to reflect Korean language will change the font to Noto Sans CJK KR font, which comes with the "Korean" package, but that's possibly because I manually set the fonts in Firefox to work that way. However, the glyph appears to be identical in both fonts, so it doesn't matter in this case.

Note that without correct language information, the correct glyph may not display because the web browser loads the wrong font, usually because there isn't any language information conveyed by the web page. For example, below is a sample screenshot from a Japanese web site. Initially, the web page did not have any language information, so the font that was used was Noto Sans CJK SC. Once I added the lang="ja" attribute to the <html> tag using Firefox's DOM Inspector, the glyph changed because the Noto Sans CJK JP font was used. This may or may not work for you because I manually assigned the locale-specific Noto CJK fonts to the corresponding languages.

Vull · Aug 1, 2021

memreflect said:
Thanks for mentioning where you got those characters. Here's what I get when I inspect the fonts used by Firefox:

$ <U+0024>, ¢ <U+00A2>, and € <U+20AC> all use the Noto Sans Regular font in my case, which is comes with the "Basic" package.

ह <U+0939> uses the Noto Sans Devanagari Regular font, which comes with the "Extra" package.

𐍈 <U+10348> uses the Noto Sans Gothic Regular font, which also comes with the "Extra" package.

한 <U+D55C> uses the Noto Sans CJK SC font, which comes with the "Chinese (Simplified)" package. Adding a lang="ko" attribute to the element to reflect Korean language will change the font to Noto Sans CJK KR font, which comes with the "Korean" package, but that's possibly because I manually set the fonts in Firefox to work that way. However, the glyph appears to be identical in both fonts, so it doesn't matter in this case.

Note that without correct language information, the correct glyph may not display because the web browser loads the wrong font, usually because there isn't any language information conveyed by the web page. For example, below is a sample screenshot from a Japanese web site. Initially, the web page did not have any language information, so the font that was used was Noto Sans CJK SC. Once I added the lang="ja" attribute to the <html> tag using Firefox's DOM Inspector, the glyph changed because the Noto Sans CJK JP font was used. This may or may not work for you because I manually assigned the locale-specific Noto CJK fonts to the corresponding languages.
View attachment 10827

I use

Code:

<html lang="en-US">

and also throw in

Code:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

Don't know if this meta tag is still necessary since html5, but it was needed 10 years ago, and it doesn't seem to hurt anything.

Characters like "¢" <U+00A2> and "€" <U+20AC> are non-ascii UTF-8 characters which do come up often in English-speaking contexts.

bugzeo · Aug 1, 2021

Vull said:
This is my Firefox browser on 13.0-RELEASE-p3, after installing x11-fonts/noto, at memreflect's suggestion.

View attachment 10821

Working for me as well.

Erichans · Aug 1, 2021

Vull said:
Code:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

Don't know if this meta tag is still necessary since html5, but it was needed 10 years ago, and it doesn't seem to hurt anything.

The meta element with its charset attribute is necessary.

Allthough other encodings than UTF-8 for html are allowed, they are (strongly) discouraged; the following applies to UTF-8. Character encoding specification only in the HTTP header is possible but the preferred way is to use the meta element to specify the encoding of the html document (within the first 1024 bytes at the start of the file; therefore, put it immediately after the opening tag of the head element): How should I declare the encoding of my HTML file?

When no encoding specifications are present* you can run into trouble, specifically with non-ASCII UTF-8 characters. Characters being displayed incorrectly on screen being the obvious problem. At least as important: when the html file is processed by other software.

From the HTML 5 spec, § 4.2.5.4 Specifying the document's character encoding:

Note:
A character encoding declaration is required (either in the Content-Type metadata or explicitly in the file) even when all characters are in the ASCII range, because a character encoding is needed to process non-ASCII characters entered by the user in forms, in URLs generated by scripts, and so forth. [...]

____
Edit:
* An encoding specification in the HTTP header is external to the html file, obviously. Specification of the encoding in the html file itself can, apart from the charset attribute of the meta element, be specified by a Byte Order Mark (BOM) at the start of the file; see also: The byte-order mark (BOM) in HTML. If the BOM represents an encoding specification will be determined by the BOM sniff algorithm. A BOM is not preferred as an encoding specification mechanism for UTF-8 encoded html files.