The computer keeps crashing during the compilation process of chromium-123.0.6312.58_1, but never crashed when I was building version 123.0.6312.58!

it printed PASS in big letters
The point people are trying to make that you are missing is that memory can PASS but still have issues and fail on a subsequent test run.

Hence the suggestion that you let it run overnight and make sure you consistently get PASS for several tests in a row.

But given you've got CPU cores hitting 100 deg. C I think you have found the problem.
 
The point people are trying to make that you are missing is that memory can PASS but still have issues and fail on a subsequent test run.

Hence the suggestion that you let it run overnight and make sure you consistently get PASS for several tests in a row.

But given you've got CPU cores hitting 100 deg. C I think you have found the problem.
Didn't Crivens imply it's normal for CPUs to reach the temperature of 120 or even 140 degrees by Celsius?.. Thirty minutes ago, I started running memtest86+ again.
 
100 °C on mere compilation? (Which is a much more mild task than aforementioned stress tests.) What's the cooler and is it actually attached?
 
The temperature changed "recently". The sensors are now closer inside the actual cores. Since then higher temperatures are considered normal. I don't know what is considered overheating right now.
 
It wouldn't hurt to just check whether everything is physically in place. I also have no idea what would be the norm, but the computer obviously shouldn't crash. My 5600X stays around 70-73 °C, I think. However it's a lower mid-range CPU and only moderately "hot".
 
After each crash, I reboot and immediately see values such as these:

Code:
dev.cpu.19.temperature: 33.0C
dev.cpu.17.temperature: 34.0C
dev.cpu.15.temperature: 32.0C
dev.cpu.13.temperature: 33.0C
dev.cpu.11.temperature: 34.0C
dev.cpu.9.temperature: 33.0C
dev.cpu.7.temperature: 33.0C
dev.cpu.5.temperature: 31.0C
dev.cpu.3.temperature: 34.0C
dev.cpu.1.temperature: 34.0C
dev.cpu.18.temperature: 32.0C
dev.cpu.16.temperature: 34.0C
dev.cpu.14.temperature: 31.0C
dev.cpu.12.temperature: 32.0C
dev.cpu.10.temperature: 34.0C
dev.cpu.8.temperature: 33.0C
dev.cpu.6.temperature: 33.0C
dev.cpu.4.temperature: 31.0C
dev.cpu.2.temperature: 33.0C
dev.cpu.0.temperature: 34.0C

So, it's normal for cores to cool off so quickly after compilation stops?
 
I still find mprime confusing. I chose #16 in the menu and the default answers to each question:
Code:
 16.  Options/Torture Test
        17.  Options/Benchmark
        18.  Help/About
        19.  Help/About PrimeNet Server
Your choice: 16

Number of cores to torture test (10):
Use hyperthreading (more stressful) (Y):
Choose a type of torture test to run.
  1 = Smallest FFTs (tests L1/L2 caches, high power/heat/CPU stress).
  2 = Small FFTs (tests L1/L2/L3 caches, maximum power/heat/CPU stress).
  3 = Large FFTs (stresses memory controller and RAM).
  4 = Blend (tests all of the above).
Blend is the default.  NOTE: if you fail the blend test but pass the
smaller FFT tests then your problem is likely bad memory or bad memory
controller.
Type of torture test to run (4):
Customize settings (N):
Run a weaker torture test (not recommended) (N):

Accept the answers above? (Y):
It started running its test, but since I didn't know how to interpret this program's textual output, I stopped it a couple of minutes later. It created results.txt that contained this:
Code:
Self-test 384K passed!
Self-test 384K passed!
Self-test 384K passed!
While mprime was running, I saw this:
Code:
dev.cpu.19.temperature: 82.0C
dev.cpu.17.temperature: 82.0C
dev.cpu.15.temperature: 83.0C
dev.cpu.13.temperature: 99.0C
dev.cpu.11.temperature: 90.0C
dev.cpu.9.temperature: 98.0C
dev.cpu.7.temperature: 85.0C
dev.cpu.5.temperature: 99.0C
dev.cpu.3.temperature: 88.0C
dev.cpu.1.temperature: 97.0C
dev.cpu.18.temperature: 82.0C
dev.cpu.16.temperature: 82.0C
dev.cpu.14.temperature: 83.0C
dev.cpu.12.temperature: 99.0C
dev.cpu.10.temperature: 90.0C
dev.cpu.8.temperature: 98.0C
dev.cpu.6.temperature: 85.0C
dev.cpu.4.temperature: 100.0C
dev.cpu.2.temperature: 87.0C
dev.cpu.0.temperature: 97.0C

I don't know what any of this means and how to interpret this.
 
Hmmm, you use a new version of mprime than I do. Let me try yours and get back to you.

A real test runs for a long time, e.g. overnight.

It is odd that only some of your cores are at max temp.
 
Didn't Crivens imply it's normal for CPUs to reach the temperature of 120 or even 140 degrees by Celsius?..
The max allowed is in the data sheets for your CPU. It depends, but the consens here seems that 100 is a bit high. I keep the Ryzen in my laptops at 60 tops, for the sake of the batteries. Check the data sheet for your actual CPU, take a bit off from that number, and you should be good. And it never hurts to check that cooling. Many pre-build systems are a nightmare in that regard.
 
Try to keep cpu temp below 70-80ºC even at load. Typically at 95-100ºC or so a CPU will start throttling. Persistent higher temp will have other negative issues.
Yesterday, I was compiling chromium while i915kms wasn't loaded. (It was always loaded today). Maybe it sounds silly, but is it possible that yesterday, I succeeded in compiling chromium because that module wasn't loaded?
This could be a contributing factor if this means you are using a graphics desktop. How much real memory do you have? In general if you are swapping, things will slow down massively. Reduce # of concurrent jobs, watch swap usage, watch how much memory is used by each process etc. to gain an understanding.

There may still be an issue if the system is crashing but that too may be related to i915kms.
 
Code:
dev.cpu.19.temperature: 82.0C
dev.cpu.17.temperature: 82.0C
dev.cpu.15.temperature: 83.0C
dev.cpu.13.temperature: 99.0C
dev.cpu.11.temperature: 90.0C
dev.cpu.9.temperature: 98.0C
dev.cpu.7.temperature: 85.0C
dev.cpu.5.temperature: 99.0C
dev.cpu.3.temperature: 88.0C
dev.cpu.1.temperature: 97.0C
dev.cpu.18.temperature: 82.0C
dev.cpu.16.temperature: 82.0C
dev.cpu.14.temperature: 83.0C
dev.cpu.12.temperature: 99.0C
dev.cpu.10.temperature: 90.0C
dev.cpu.8.temperature: 98.0C
dev.cpu.6.temperature: 85.0C
dev.cpu.4.temperature: 100.0C
dev.cpu.2.temperature: 87.0C
dev.cpu.0.temperature: 97.0C

I don't know what any of this means and how to interpret this.

There is not much to interpret. Your chip is rated for Tjmax 100C, and you hit 100C. So your case is not able to sufficiently cool it at full system full power.
The reason may be an insufficient cpu cooler, or more likely insufficient airflow through the case as a whole.

This in itself should not lead to panics, but over some time the heat might dissipate onto the PCH or the memory or whatever, and that may then lead to panics. In any case it is not an adviseable mode of operation.
The solution is to keep the temperature some 10 Kelvin away from Tjmax with something like powerdxx,
 
The solution is to keep the temperature some 10 Kelvin away from Tjmax with something like powerdxx,
I'm repeating myself here, but the "solution" IMHO is to fix the cooling (as long as that's possible of course, might be problematic with many notebooks), so persistent "full power" won't cause overheating any more. Check for mechanical issues, check for dirt/dust, check for cables and stuff blocking airflow, etc... and if nothing else helps, start replacing components (e.g. a stronger case fan ...)
 
Mprime ran for more than an hour, and it made my cores hot. Compiling chromium earlier had not made my cores as hot. But running mprime, unlike compiling chromium, didn't crash the computer. I wonder why.

bakul said:
How much real memory do you have? In general if you are swapping, things will slow down massively.
I have 64 GB of RAM and I was never swapping while I was compiling chromium. I always keep PARALLEL_JOBS=1 in poudriere.conf and keep changing the value of MAKE_JOBS_NUMBER. Even when the value of the latter was as low as 4, the computer crashed when I was compiling chromium. (i915kms wasn't loaded at that time). MAKE_JOBS_UNSAFE=yes never crashed my computer yet. But, as I mentioned earlier, there was one time the computer didn't crash even when MAKE_JOBS_NUMBER's value was 20: ccache wasn't involved, and I finished compiling chromium after less than 3 hours of running the compilation process. It happened on 14-STABLE.

grahamperrin said:
I solved the problem of not being able to save core.txt files into /var/crash
I don't know. As I mentioned earlier, I switched to 15-CURRENT. After I did it, this particular problem was gone.

zirias@ said:
might be problematic with many notebooks
It's a desktop tower. Maybe one day I'll open it; I am just a bit afraid of doing it.

Crivens said:
Check the data sheet for your actual CPU, take a bit off from that number
Before I attempt to open the computer tower, maybe I should play with powerdxx for a little bit: something like powerdxx_flags="-H 90:91 -a max" and then check if my computer still crashes when compiling something.
 
4 hours?.. I added memtest64.efi to the boot menu using the command efibootmgr -c -a -l /mnt/memtest64.efi -L memtest and after rebooting, I selected memtest from the boot menu. After I did it, the testing program didn't ask me any questions; it immediately started testing the memory and after less than 1 hour and 30 minutes of testing it, it printed PASS in big letters.
Sometimes it need more passes to find the errors. YMMV.
 
Yesterday, I opened the desktop tower and noticed there were 5 fans in it. It looks like the CPU utilizes a Sahara air cooler. (I guess Sahara is the name of a brand). There wasn't much dust sitting on the Sahara fan's blades or below the blades. Some of the other fans were dustier. Below the Thermaltake power supply, there was this square thing that had net in it. While Thermaltake was inside the tower, the square thing was below it, but not inside the tower. I guess it's called a dust filter, but I am not completely sure I am using the right word to describe it. The filter collected a lot of dust, and I removed the dust from it. But since it was below the power supply, this thing doesn't participate in attempting to lower the CPU's temperature, right? Or maybe it does?.. I don't know... I didn't remove any fans from the tower; I tried to clean them using alcoholic wipes. I don't know if I did a good job. I wasn't able to reach one of the fans and clean it because it was next to hard drive cages. So, after I was done cleaning, turned the computer on, and tried to compile chromium, the crash issue didn't go away. I was noticing that some of the cores still reached the temperature of 100 degrees during the compilation proces. Are you absolutely sure that when compiling something, CPUs are not supposed to reach the temperature of 100 degrees?... Are you absolutely sure these crashes happen because of overheating, and not because of something else?
 
I can recommend using canned air...
1711650140483.png

the straw can reach in places where alcohol cannot easily go. Best of luck sneezing your way out of those compilation crashes! ;)
 
Are you absolutely sure that when compiling something, CPUs are not supposed to reach the temperature of 100 degrees?... Are you absolutely sure these crashes happen because of overheating, and not because of something else?

Tjuntion is 100 degrees celsius. https://ark.intel.com/content/www/u...1900k-processor-16m-cache-up-to-5-30-ghz.html

So 100 C is "legal".

Since the crashes appeared the same way with the case open another source of instability is more likely than overheating.
 
Yesterday, I opened the desktop tower and noticed there were 5 fans in it. It looks like the CPU utilizes a Sahara air cooler. (I guess Sahara is the name of a brand).
This https://www.amazon.co.uk/Ghost-Sahara-Iced-Rainbow-cooler/dp/B01LXP33NZ? I specifically said to check the attachment. Is the heat sink cold in operation? (It should not be, considering how tiny it is.)

Are you absolutely sure these crashes happen because of overheating, and not because of something else?
The obvious test would be to disable powerd and set lower operating frequency level (dev.cpu.0.freq) to get lower temps.
 
You never gave us a description of your computer. Could you state specifically what CPU you have, how much RAM etc. Ideally post your dmesg?
 
I still find mprime confusing. I chose #16 in the menu and the default answers to each question:
Code:
 16.  Options/Torture Test
        17.  Options/Benchmark
        18.  Help/About
        19.  Help/About PrimeNet Server
Your choice: 16

Number of cores to torture test (10):
Use hyperthreading (more stressful) (Y):
Choose a type of torture test to run.
  1 = Smallest FFTs (tests L1/L2 caches, high power/heat/CPU stress).
  2 = Small FFTs (tests L1/L2/L3 caches, maximum power/heat/CPU stress).
  3 = Large FFTs (stresses memory controller and RAM).
  4 = Blend (tests all of the above).
Blend is the default.  NOTE: if you fail the blend test but pass the
smaller FFT tests then your problem is likely bad memory or bad memory
controller.
Type of torture test to run (4):
Customize settings (N):
Run a weaker torture test (not recommended) (N):

Accept the answers above? (Y):
It started running its test, but since I didn't know how to interpret this program's textual output, I stopped it a couple of minutes later. It created results.txt that contained this:
Code:
Self-test 384K passed!
Self-test 384K passed!
Self-test 384K passed!
While mprime was running, I saw this:
Code:
dev.cpu.19.temperature: 82.0C
dev.cpu.17.temperature: 82.0C
dev.cpu.15.temperature: 83.0C
dev.cpu.13.temperature: 99.0C
dev.cpu.11.temperature: 90.0C
dev.cpu.9.temperature: 98.0C
dev.cpu.7.temperature: 85.0C
dev.cpu.5.temperature: 99.0C
dev.cpu.3.temperature: 88.0C
dev.cpu.1.temperature: 97.0C
dev.cpu.18.temperature: 82.0C
dev.cpu.16.temperature: 82.0C
dev.cpu.14.temperature: 83.0C
dev.cpu.12.temperature: 99.0C
dev.cpu.10.temperature: 90.0C
dev.cpu.8.temperature: 98.0C
dev.cpu.6.temperature: 85.0C
dev.cpu.4.temperature: 100.0C
dev.cpu.2.temperature: 87.0C
dev.cpu.0.temperature: 97.0C

I don't know what any of this means and how to interpret this.

Did it maybe stop with a message "killed"?
 
Back
Top