The computer keeps crashing during the compilation process of chromium-123.0.6312.58_1, but never crashed when I was building version 123.0.6312.58!

Oleg_NYC · Mar 29, 2024

astyle said:
Otherwise it becomes a game of numbers, trying to figure out at what point thermal throttling of the CPU will happen.

Every time I typed sysctl -a | grep cpu.*.freq: while I was compiling those ports, I kept seeing the same number: 3700. Does it mean there was no thermal throttling going on?

grahamperrin said:
Was there any subsequent kernel panic and if so, can you provide details? Thanks.

There were no more panics after I disabled Turbo Boost and "Overclock TVB" in the UEFI. I don't know if it means I resolved the issue or I am just lucky I haven't experienced a panic yet.

astyle · Mar 29, 2024

Oleg_NYC said:
Every time I typed sysctl -a | grep cpu.*.freq: while I was compiling those ports, I kept seeing the same number: 3700. Does it mean there was no thermal throttling going on?

Thermal throttling is not the same thing as setting a specific frequency for the CPU to operate with.

Oleg_NYC said:
Why would changing the number from 3700 to 3701 cause such a huge increase in temperature?

shkhln said:
3701 means Turbo Boost is enabled (and the actual working frequency is quite a bit higher).

The way thermal throttling even works: If the CPU gets too hot, the cores get told to work slower. I don't think FreeBSD has controls for that implemented... you'll need to get into the BIOS for details.

Oleg_NYC · Mar 29, 2024

astyle said:
Thermal throttling is not the same thing as setting a specific frequency for the CPU to operate with.

The way thermal throttling even works: If the CPU gets too hot, the cores get told to work slower. I don't think FreeBSD has controls for that implemented... you'll need to get into the BIOS for details.

So, the thermal throttling mechanism might reduce the clock frequency behind the scenes, even when sysctl -a | grep cpu.*.freq: keeps outputting the same number?

PMc · Mar 29, 2024

astyle said:
The way thermal throttling even works: If the CPU gets too hot, the cores get told to work slower. I don't think FreeBSD has controls for that implemented... you'll need to get into the BIOS for details.

Code:

$ sysctl dev.cpu | grep thrott
dev.cpu.3.coretemp.throttle_log: 0
dev.cpu.2.coretemp.throttle_log: 0
dev.cpu.1.coretemp.throttle_log: 0
dev.cpu.0.coretemp.throttle_log: 0

When Tjmax was exceeded, this will become 1 for the respective core.

I don't know to what extend FreeBSD is involved here, one would need to have a look into the code - I just tested it once, to see that it works.

astyle · Mar 29, 2024

Oleg_NYC said:
So, the thermal throttling mechanism might reduce the clock frequency behind the scenes, even when sysctl -a | grep cpu.*.freq: keeps outputting the same number?

pretty much. When you enable Turbo Boost (by using sysctl to go from 3700 to 3701), thermal throttling gets turned off, and you're at the mercy of the cooler you have installed. You can use the sysctl to adjust the CPU's operating frequency, true, but 3700 seems to be the hard limit beyond which it stops making sense to try and tell the cores to run slower and get cooler. That's where the CPU coolers start being really important. If you go back down to 3700 or even 3699, thermal throttling will kick back in.

Oleg_NYC · Mar 30, 2024

Okay, with the help of mathematics, can we deduce if some thermal throttling was involved if I finished compiling chromium after 3 hours and 40 minutes? The clock frequency was set to 3.7 Ghz; powerd++, ccache, and Turbo Boost were disabled; 20 make jobs were enabled in the settings.

astyle · Mar 30, 2024

Oleg_NYC said:
Okay, with the help of mathematics, can we deduce if some thermal throttling was involved if I finished compiling chromium after 3 hours and 40 minutes? The clock frequency was set to 3.7 Ghz; powerd++, ccache, and Turbo Boost were disabled; 20 make jobs were enabled in the settings.

You'll have to figure out how much time savings you'll get (specifically on the chromium compilation task, with all other setting the same) if you set the CPU to 4 GHz vs. say, 3.8 GHz.

For sake of a simple example (most likely with incorrect actual values), let's assume that it will take you 2.5 hours to compile chromium at 4 GHz, but 2 hours and 45 minutes at 3.8 GHz. Yeah, it will take 15 minutes longer. How much cooler are the cores when they run slower (at 3.8 GHz, as opposed to 4 GHz)?

After those measurements are taken, you can plug in the numbers to figure out at what speed the processor should be operating so that the compilation finishes in 3 hours and 40 minutes. let that be value a. This is the point when you should take the temperature of the cores, which will be value b.

Now pretend there's a linear correlation between processor's actual operating speed and temperature. Like, 80*C when it's running at 4 GHz, but 72*C when running at 3.8 GHz, and should be b when running at speed a. (There's a reason I pick data points in the realm of Turbo Boost vaues). Basically, the faster the processor runs, the hotter it runs.

But... with the (assumed) positive linear correlation between temps and running speeds, if the dot with our values at point (a, b) will be below the un-throttled line, that means thermal throttling is happening. Above the line - no thermal throttling.

PMc · Mar 30, 2024

astyle said:
Now pretend there's a linear correlation between processor's actual operating speed and temperature.

There isn't. There is a runaway correlation. That means, at a certain temperature the core will get hotter just because it gets hotter, without increasing speed.
You can see that with sysutils/pcm which shows the actual energy consumed - that increases with temperature, given constant frequency.

That's why some people try to cool with liquid air to run the cores faster.

astyle · Mar 30, 2024

PMc said:
There isn't. There is a runaway correlation. That means, at a certain temperature the core will get hotter just because it gets hotter, without increasing speed.
You can see that with sysutils/pcm which shows the actual energy consumed - that increases with temperature, given constant frequency.

That's why some people try to cool with liquid air to run the cores faster.

I have a quad-core processor that runs at 4.7 GHz, and is rated at 65 W... (Ryzen 5 1400) and an air cooler (OEM AMD Wraith) that is rated the same. And it worked no problem for me. It would take an LONG time to compile Chromium, like 24 hours easy with 32 GB of RAM...

PMc · Mar 30, 2024

I have no experience with AMD chips. I tuned my Xeon, measured the airflow with thermocouples, and put failsafes in place, so that I can go on travel and have it running unattended.

Oleg_NYC · Mar 30, 2024

Oh, wow, look at what I found after almost 12 hours of testing the memory. As you can see, there are 2 errors. But I don't understand something: on the one hand, it says there have been 4 passes in total, but on the other hand, it seems to imply that one error was found after one of those memory locations previously had 2 passes, and the other error was found after one of those memory locations previously had 3 passes:

IMG-20240330-095348053 hosted at ImgBB

Image IMG-20240330-095348053 hosted in ImgBB

ibb.co

astyle · Mar 30, 2024

Oleg_NYC said:
Oh, wow, look at what I found after almost 12 hours of testing the memory. As you can see, there are 2 errors. But I don't understand something: on the one hand, it says there have been 4 passes in total, but on the other hand, it seems to imply that one error was found after one of those memory locations previously had 2 passes, and the other error was found after one of those memory locations previously had 3 passes:

Maybe try to post a smaller file size image? Even setting the phone to lowest quality will be still plenty clear to convey info in this case... This one requires a 75" 8K QHD TV to see properly.

Oleg_NYC · Mar 30, 2024

I edited my post... Please explain to me what I don't understand. Maybe memtest86+ uses two different meanings of the word "pass"? One meaning is as in opposed to "fail", and the other meaning is as in completing a task? Because otherwise, I don't understand the apparent contradiction: 4 "passes" in total and yet two errors?

Cath O'Deray · Mar 30, 2024

Oleg_NYC · Mar 30, 2024

I am begging you, guys, please explain to me what it means: "Pass: 4" below "Time:", but Pass 2 for the first error and Pass 3 for the second error.

Oleg_NYC · Mar 30, 2024

It's very confusing. Does it mean that one error was detected during Pass 2, but not during Pass 3 and 4, and the other error was detected during Pass 3, but not during Pass 4?

tingo · Mar 30, 2024

Pass is the number of passes (rounds) that have run. One pass includes all the tests selected.
Pass 2, Test 6: one error detected
Pass 3, Test 6, one error detected.
Pass 4 is still ongoing in the picture you posted, hard to say if it will detect any errors in the future.

astyle · Mar 30, 2024

Oleg_NYC said:
I am begging you, guys, please explain to me what it means: "Pass: 4" below "Time:", but Pass 2 for the first error and Pass 3 for the second error.

screen shows the total time so far that the test has been running, which is 11 hours, 46 minutes and 7 seconds. You're currently on pass 4, which is still running (passes 1, 2, and 3 have completed). Pass 1 discovered no errors. Pass 2 discovered an error at one HDD address. Pass 3 discovered an error somewhere else. Pass 4 is still running.

This is less complicated than getting a mathematical handle on thermal throttling, but I'm beginning to wonder why you want to compile www/chromium at all. Nothing wrong with it, but I'm just curious - are you trying to turn on Makefile options that are off by default, trying to track down different possible hardware failures that can happen during a long compilation process, trying to decide if it makes sense to switch to SSDs, or something else? There's probably a reason for sinking so much time into this, and probably better ways to accomplish the end goal that you're shooting for.

Crivens · Mar 30, 2024

Well, the test stresses the memory bus, and memory gets heated up. That makes certain values drift (like the dielectric value), ... There is a reason high performance VLSI design is called "black magic".

But I also would like to know what is the reason you are using an off the shelf system with bling for a task it is not meant to do.

Oleg_NYC · Mar 30, 2024

Crivens said:
Well, the test stresses the memory bus, and memory gets heated up. That makes certain values drift (like the dielectric value), ... There is a reason high performance VLSI design is called "black magic".

If there is a high probability that certain values would drift, how can we even trust that tools such as memtest86+ will find real errors? It's just so confusing that the error that was found while Pass 2 was running wasn't found again. Instead, a different error was found while Pass 3 was running.

My goal is to simply figure out what caused crashes while I was compiling stuff.

Crivens · Mar 30, 2024

Oh, these are real errors. When you look at the level semiconductors are working today, you enter the world of quantum mechanics and probability. The chips are running at a state (presumably in your case) that makes the chance of a flipped bit go from one in some years (or such, no idea where they are now) to one every hour. Heck, Avionics systems need to be rad hardened because of cosmic rays at that altitude. Intel found out the hard way that certain ceramics are simply too radioactive to use as a chip case. The radiation would flip bits like crazy. IBM has manufacture processes that can not be rolled out at scale because the probability of an electron tunneling at the wrong place is so high at that scale that the CPU would be messed up several times a second. For all normal folks (those without a background in VLSI) this IS black magic. For us, it's advanced voodoo. You are running technology that is at the edge of what is possible. And you run it out of it's design parameters. No wonder shit happens, I only wonder it does not happen more.

astyle · Mar 30, 2024

Oleg_NYC said:
My goal is to simply figure out what caused crashes while I was compiling stuff.

Usually crashes are caused by old or inadequate hardware. An easy way to verify is to know what hardware was used for successful compilations, and compare your setup against that. For example, if a successful compilation requires a Threadripper or a Xeon setup that costs in excess of $10k USD, then it's easier to just accept that the home hardware is just not up to the task, and download a precompiled copy of Chromium.

Linux is developed on a Threadripper setup, BTW. Not cheap to reproduce. Torvalds and his team would have a keen interest in figuring out WTF even happened if a monster setup like a Threadripper crashed during compilation. And even these guys would probably look at their makefile options, at the code, and maybe move to some newer hardware to help with compilation, rather than analyze existing hardware to death.

Oleg_NYC · Mar 30, 2024

Crivens said:
You are running technology that is at the edge of what is possible.

It's just a desktop computer. I haven't experienced crashes since I disabled the possibility of the clock frequency going above 3.7 Ghz. But I am still not entirely sure I found out the cause of these crashes. Can I trust that memtest86+ found an actual memory error?

bakul · Mar 30, 2024

DRAM Thermal Issues Reach Crisis Point

Increased transistor density and utilization are creating memory performance issues.

semiengineering.com

The refresh requirements of volatile memory (as a standard metric, about once every 64 milliseconds) intensify the risk. “As you raise the temperature over about 85°C, you need to refresh that charge on the capacitors more often,” said Greenberg. “So, you’ll start moving to a more frequent refresh cycle to account for the fact that the charge is leaking out of those capacitors faster because the device is getting hotter. Unfortunately, the operation of refreshing that charge also is a current-intensive operation, which generates heat inside the DRAM. The hotter it gets, the more you have to refresh it, but then you’re going to continue to make it hotter, and the whole thing kind of falls apart.”

Not saying this is the cause of your problem Oleg_NYC but something to be aware of. I don’t know if FreeBSD has any support for measuring the temperature of other components such as DRAM. But if you’re having random crashes, one thing to try is to reduce cpu frequency and see if the problem goes away.

astyle · Mar 30, 2024

Oleg_NYC said:
It's just a desktop computer. I haven't experienced crashes since I disabled the possibility of the clock frequency going above 3.7 Ghz. But I am still not entirely sure I found out the cause of these crashes. Can I trust that memtest86+ found an actual memory error?

I think you can safely ignore those RAM errors - unless you enjoy hunting them down.

The computer keeps crashing during the compilation process of chromium-123.0.6312.58_1, but never crashed when I was building version 123.0.6312.58!

Oleg_NYC

astyle

Oleg_NYC

PMc

astyle

Oleg_NYC

astyle

PMc

astyle

PMc

Oleg_NYC

IMG-20240330-095348053 hosted at ImgBB

astyle

Oleg_NYC

Cath O'Deray

Oleg_NYC

Oleg_NYC

tingo

astyle

Crivens

Administrator

Oleg_NYC

Crivens

Administrator

astyle

Oleg_NYC

bakul

DRAM Thermal Issues Reach Crisis Point

astyle