ports update gives Unrecoverable machine check exception (OVH Dedicated Server)

Hello everyone. I am having a very particular problem with FreeBSD but more probably with OVH.
I am talking about a fresh install of FreeBSD from an OVH template. Every time I run "portsnap fetch extract" I noticed from the console that I get: "panic: Unrecoverable machine check exception".
Proof:
0EwIelj.png

I contacted and called them 4 times where they told me:
-first time: the server pings normally (duh)
-second time: temperatures are ok (who asked)
-third time: changed CPU/Motherboard/MAC Address
-fourth time: "We did not detect any default (no reboot or freez). The problem would be with your OS."
I left them another ticket that the OS is a freaking fresh install of FreeBSD 11 from their templates and a MCE is an hardware related problem.
So, I am wondering, is really FreeBSD11 at fault here? It cannot run with a Ryzen 5600X?

Someone might tell me "just use the package manager", but it shouldn't be the solution, and it is really not. With just installing monit, the server rebooted itself two times in one day (checked from M/Monit which notified me the monit service stopped twice + the ssh session, with keepalive enabled, had shutdown).

Funny thing is, the same server on their webpage doesn't list FreeBSD anymore as a possible distribution.

Full specs:
CPU: AMD Ryzen 5600X
RAM: ECC 34GB DDR4 2666MHz
MotherBoard: AsRock X470D4U2-2T
Drives: 2x500GB NVMe SoftRAID
 
FreeBSD 11 is EOL in a couple of weeks, so might be worth trying 12.2 or 13 anyway?

I don't mean that in a naggy way, it's up to you what you run, but out of interest - if 12.2 or 13.0 run on the same hardware (I think it's hardware rather than VM?) without any issue then maybe there was something in FreeBSD 11 that tickles an issue.

If 12.2 and/or 13.0 work without issue (which being newer versions, they might), then yes, it would seem there was a problem in 11.4 (but I imagine you are unlikely to get any help with a soon-to-be-EOL version.)

If 12.2 and 13.0 both have exactly the same issue then you know either FreeBSD or something at OVH.
 
FreeBSD 11 is EOL in a couple of weeks, so might be worth trying 12.2 or 13 anyway?

I don't mean that in a naggy way, it's up to you what you run, but out of interest - if 12.2 or 13.0 run on the same hardware (I think it's hardware rather than VM?) without any issue then maybe there was something in FreeBSD 11 that tickles an issue.

If 12.2 and/or 13.0 work without issue (which being newer versions, they might), then yes, it would seem there was a problem in 11.4 (but I imagine you are unlikely to get any help with a soon-to-be-EOL version.)

If 12.2 and 13.0 both have exactly the same issue then you know either FreeBSD or something at OVH.
The first time with this machine I updated it to FreeBSD 13 and I had the same issues (hardware yes). I reset everything to FreeBSD11 thinking "eh, maybe I should stick with the OS they provided me". Boy was I wrong.

Did they change the memory module?
No, despite telling them that doing both a CPU or a Memtest on rescue mode, the notification says: "Your server hasn't reacted for at least 20 seconds. It is probably down. You can try to refresh the page. If the server crashed while doing a CPU test, it is possible that the cpu is faulty."
That is when they changed the CPU, more specifically:

Motherboard replacement
Date 2021-09-12 10:45:33 CEST (UTC +02:00), Motherboard replacement:
Diagnosis:
HS motherboard

Actions:
Replacing the motherboard.
Updating the MAC address for DHCP.
Server restart.

An issue has been detected with the CPU.

We have replaced the CPU.

result:
DHCP OK. Boot OK. Server on login screen. Ping OK, services started.
I even asked them to try for themselves (to update ports) but I guess they ignored me/can't do that.


Edit:
I was looking at that thread, didn't know the existence of sysutils/mcelog, it sounds awesome.
If it doesn't get stuck at "Updating FreeBSD repository catalogue..." (again, this server is legit trash or probably OVH repos just hang the package manager for a while) maybe before the end of the day I can look for additional errors.

Update: mcelog doesn't give me anything, run as daemon. I may be doing something wrong, but I guess, as last resort, I would try to contact them again and tell them to better test the memory.

Update2 (from OVH): "Risulta comunque dai log che dopo l'intervento eseguito, il sistema operativo è stato correttamente installato. I reboot inoltre non risultano a livello infrastrutturale, segno che la causa non è il server e che vengono scatenati a livello software."
which translated it goes on the lines of "reboots don't result on an infrastructural level, meaning the cause is not the server but they are based on a software level".
I've done a memory test -> "Your server hasn't reacted for at least 20 seconds. It is probably down. You can try to refresh the page. If the server crashed while doing a cpu test, it is possible that the cpu is faulty." three times, same with CPU test. I am gonna reply to them again about this error and underline the fact that "panic: Unrecoverable machine check exception" is NOT a software error.
BTW the level of competence from OVH technicians is baffling to me.
 
New happy updates: so, I remember having problems with FreeBSD 13 as well, but, at this point, I may be wrong. Anyway, I updated to FreeBSD 12.2-RELEASE and I can finally update ports.
Therefore OVH should stop listing FreeBSD 11 for servers that are unable to run it (I mean, the server I rented does not list it anymore, but it did before!).
 
Thanks for the update, but I wouldn't put anything too important on this server for a while until you've given it a bit of a thrashing to see if you can trigger the fault again. Maybe certain usage in 12.x will tickle the same issue - so it's hidden rather than resolved.

And given the response you are getting from them - maybe they re-seated the RAM or something when you weren't looking - so might not be FreeBSD 12 that's the fix. Not likely, but any mysteries around servers are troubling.

Means you are stuck on 12.x - but if I read the table correctly, that takes you to mid-2024 so that's some time away.
 
Thanks for the update, but I wouldn't put anything too important on this server for a while until you've given it a bit of a thrashing to see if you can trigger the fault again. Maybe certain usage in 12.x will tickle the same issue - so it's hidden rather than resolved.

And given the response you are getting from them - maybe they re-seated the RAM or something when you weren't looking - so might not be FreeBSD 12 that's the fix. Not likely, but any mysteries around servers are troubling.

Means you are stuck on 12.x - but if I read the table correctly, that takes you to mid-2024 so that's some time away.
Yeah, I am going to run some memtest, compile some stuff or anything cpu heavy trying to trigger the error again.

Also I am not quite sure they changed the RAM. On the Interventions tab I can only see a Motherboard change (so not even the CPU change is listed, which is very odd).

EDIT: Nvm, there's "Component replacement". But no, after the replacement I tried again on FreeBSD11 and I still had the same error.
 
I was going to suggest trying the microcode updates - but you need to get something from ports to do that (if you want to use ports). Not sure if there are updates for AMD/Ryzen - is it just Intel updates?
It says both: https://www.freshports.org/sysutils/devcpu-data/

Or you could install from packages to see if after the microcode updates if portsnap fetch still has issues.

You could try gitup to get the ports on FreeBSD 11 - does that trigger the same issue? Still seems though, even if gitup doesn't trigger it, that might just be because you don't hit a particular code path at this point. But you might hit the issue in three months time via something else tickling the issue.

I hate these confidence-denting issues on machines at the initial set-up - sometimes you can never quite know if a one-off glitch/solar flare or something that will come back and bite you hard when you least need it. Servers and mysteries aren't a happy combination.

Good luck!
 
I was going to suggest trying the microcode updates - but you need to get something from ports to do that (if you want to use ports). Not sure if there are updates for AMD/Ryzen - is it just Intel updates?
It says both: https://www.freshports.org/sysutils/devcpu-data/

Or you could install from packages to see if after the microcode updates if portsnap fetch still has issues.

You could try gitup to get the ports on FreeBSD 11 - does that trigger the same issue? Still seems though, even if gitup doesn't trigger it, that might just be because you don't hit a particular code path at this point. But you might hit the issue in three months time via something else tickling the issue.

I hate these confidence-denting issues on machines at the initial set-up - sometimes you can never quite know if a one-off glitch/solar flare or something that will come back and bite you hard when you least need it. Servers and mysteries aren't a happy combination.

Good luck!
Yes, I have already tried devcpu-data (reading this: https://lists.freebsd.org/pipermail/freebsd-current/2018-June/069799.html) and that did not work out, unfortunately. I don't know, right now the server seems completely stable with FreeBSD 12.2 but I am gonna change it ASAP because I am basically constantly thinking about "when is it gonna crash again?".
 
I don't know, right now the server seems completely stable with FreeBSD 12.2 but I am gonna change it ASAP because I am basically constantly thinking about "when is it gonna crash again?".
Better safe than sorry.

If they provision another similar server (if you try OVH again) then it will be interesting to see if FreeBSD 11.x works on there. But you might go to a different provider or a non-AMD CPU (if it's anything to do with that) so it won't be easy to prove anything. And all you really want is a stable server.
 
Better safe than sorry.

If they provision another similar server (if you try OVH again) then it will be interesting to see if FreeBSD 11.x works on there. But you might go to a different provider or a non-AMD CPU (if it's anything to do with that) so it won't be easy to prove anything. And all you really want is a stable server.
I highly doubt it's AMD, at least, not on EPYC (Contabo VPS based on an AMD EPYC with FreeBSD 12.2, super smooth never had a single issue). Yet again, probably Ryzen5xxxs were actually not compatible with FreeBSD11, but I wanted to upgrade anyway, so..
 
Back
Top