How to install UTF-8 on FreeBSD

bugzeo

Member

Reaction score: 18
Messages: 95

This is my Virtualbox guest FreeBSD, browser Mozilla Firefox:

X36URAc.png


This is my Windows 10, browser Firefox or Chrome:

E3myIKa.png



How can I properly install UTF8 on my FreeBSD?
 

scottro

Daemon

Reaction score: 866
Messages: 2,019

I think you need proper fonts for whatever language you're trying to use. For example, on occasion, I need Japanese. If I just install fcitx-mozc, then try to type Japanese, I'll get characters similar to the ones you just showed. However, if I install some Japanese fonts, everything is fine. The question may be how to get Cyrillic or Asian characters to work, rather than UTF-8.
 

memreflect

Well-Known Member

Reaction score: 219
Messages: 256

There is a "meta-port" (port that depends on other ports) named x11-fonts/noto that installs a bunch of fonts. However, I must warn you that there's a lot of disk space required, which also means a large download size (~2 GiB/Go on my system). You will definitely want to explore the x11-fonts category if you need something easier on your bandwidth (and easier on your CPU and HDD/SSD if you compile from ports):

Of course, you can simply install the ones you need rather than all of them. While I only need the "Basic" and "Emoji" fonts, those boxes are definitely annoying enough for me to have the others installed in case I need them.
 

ralphbsz

Son of Beastie

Reaction score: 2,299
Messages: 3,207

The term "all" you used in your question is time dependent. I just looked it up: the most recent update to Unicode happened in 2020 (about a year ago) added 5,930 characters, for a total of 143,859. In addition, the emoji part of the fonts has way more possible rendering combinations (you can make more emojis by combining Unicode characters, tens of thousands more). FreeBSD relies on volunteers, and I don't know exactly where the fonts come from, but you have to expect that it will be lagging in comparison to the major platforms.

My favorite new emoji is accordion.
 

Criosphinx

Active Member

Reaction score: 50
Messages: 107

I don't know about noto but in my case for japanese characters I installed japanese/font-std/, less than 60mb.

Are those symbols Chinese or Korean characters?

chinese/font-std/ is inside the chinese ports categroy and there are several fonts in the korean category also.
 
OP
B

bugzeo

Member

Reaction score: 18
Messages: 95

The term "all" you used in your question is time dependent. I just looked it up: the most recent update to Unicode happened in 2020 (about a year ago) added 5,930 characters, for a total of 143,859. In addition, the emoji part of the fonts has way more possible rendering combinations (you can make more emojis by combining Unicode characters, tens of thousands more). FreeBSD relies on volunteers, and I don't know exactly where the fonts come from, but you have to expect that it will be lagging in comparison to the major platforms.

My favorite new emoji is accordion.
:) I know it's made by volunteers, thank you.
 

memreflect

Well-Known Member

Reaction score: 219
Messages: 256

I don't know. They're just samples of utf-8 characters I copy/pasted from wikipedia several months ago.
Thanks for mentioning where you got those characters. Here's what I get when I inspect the fonts used by Firefox:
  • $ <U+0024>, ¢ <U+00A2>, and € <U+20AC> all use the Noto Sans Regular font in my case, which is comes with the "Basic" package.
  • ह <U+0939> uses the Noto Sans Devanagari Regular font, which comes with the "Extra" package.
  • 𐍈 <U+10348> uses the Noto Sans Gothic Regular font, which also comes with the "Extra" package.
  • 한 <U+D55C> uses the Noto Sans CJK SC font, which comes with the "Chinese (Simplified)" package. Adding a lang="ko" attribute to the element to reflect Korean language will change the font to Noto Sans CJK KR font, which comes with the "Korean" package, but that's possibly because I manually set the fonts in Firefox to work that way. However, the glyph appears to be identical in both fonts, so it doesn't matter in this case.

Note that without correct language information, the correct glyph may not display because the web browser loads the wrong font, usually because there isn't any language information conveyed by the web page. For example, below is a sample screenshot from a Japanese web site. Initially, the web page did not have any language information, so the font that was used was Noto Sans CJK SC. Once I added the lang="ja" attribute to the <html> tag using Firefox's DOM Inspector, the glyph changed because the Noto Sans CJK JP font was used. This may or may not work for you because I manually assigned the locale-specific Noto CJK fonts to the corresponding languages.
font_diff.png
 

Vull

Aspiring Daemon

Reaction score: 363
Messages: 636

Thanks for mentioning where you got those characters. Here's what I get when I inspect the fonts used by Firefox:
  • $ <U+0024>, ¢ <U+00A2>, and € <U+20AC> all use the Noto Sans Regular font in my case, which is comes with the "Basic" package.
  • ह <U+0939> uses the Noto Sans Devanagari Regular font, which comes with the "Extra" package.
  • 𐍈 <U+10348> uses the Noto Sans Gothic Regular font, which also comes with the "Extra" package.
  • 한 <U+D55C> uses the Noto Sans CJK SC font, which comes with the "Chinese (Simplified)" package. Adding a lang="ko" attribute to the element to reflect Korean language will change the font to Noto Sans CJK KR font, which comes with the "Korean" package, but that's possibly because I manually set the fonts in Firefox to work that way. However, the glyph appears to be identical in both fonts, so it doesn't matter in this case.

Note that without correct language information, the correct glyph may not display because the web browser loads the wrong font, usually because there isn't any language information conveyed by the web page. For example, below is a sample screenshot from a Japanese web site. Initially, the web page did not have any language information, so the font that was used was Noto Sans CJK SC. Once I added the lang="ja" attribute to the <html> tag using Firefox's DOM Inspector, the glyph changed because the Noto Sans CJK JP font was used. This may or may not work for you because I manually assigned the locale-specific Noto CJK fonts to the corresponding languages.
View attachment 10827
I use
Code:
<html lang="en-US">
and also throw in
Code:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
Don't know if this meta tag is still necessary since html5, but it was needed 10 years ago, and it doesn't seem to hurt anything.

Characters like "¢" <U+00A2> and "€" <U+20AC> are non-ascii UTF-8 characters which do come up often in English-speaking contexts.
 

Erichans

Member

Reaction score: 19
Messages: 28

Code:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
Don't know if this meta tag is still necessary since html5, but it was needed 10 years ago, and it doesn't seem to hurt anything.

The meta element with its charset attribute is necessary.

Allthough other encodings than UTF-8 for html are allowed, they are (strongly) discouraged; the following applies to UTF-8. Character encoding specification only in the HTTP header is possible but the preferred way is to use the meta element to specify the encoding of the html document (within the first 1024 bytes at the start of the file; therefore, put it immediately after the opening tag of the head element): How should I declare the encoding of my HTML file?

When no encoding specifications are present* you can run into trouble, specifically with non-ASCII UTF-8 characters. Characters being displayed incorrectly on screen being the obvious problem. At least as important: when the html file is processed by other software.

From the HTML 5 spec, § 4.2.5.4 Specifying the document's character encoding:
Note:
A character encoding declaration is required (either in the Content-Type metadata or explicitly in the file) even when all characters are in the ASCII range, because a character encoding is needed to process non-ASCII characters entered by the user in forms, in URLs generated by scripts, and so forth. [...]

____
Edit:
* An encoding specification in the HTTP header is external to the html file, obviously. Specification of the encoding in the html file itself can, apart from the charset attribute of the meta element, be specified by a Byte Order Mark (BOM) at the start of the file; see also: The byte-order mark (BOM) in HTML. If the BOM represents an encoding specification will be determined by the BOM sniff algorithm. A BOM is not preferred as an encoding specification mechanism for UTF-8 encoded html files.
 
Top