FreeBSD 9 on ESXi 5, clock stops

Hi, I have a FreeBSD 9.0 machine, GENERIC kernel on which the clock stops ticking after a seemingly random amount of time (days, weeks, minutes, ..). The system doesn't hang, most systems and services appear to work fine, it's just that the clock doesn't move forward. Things like cron therefore stop working. Running date just returns the same string over and over. I also can't reboot, it just hangs when I try that.

I've disabled ntpd, both on the host and guest. There's another VM running on the host, a Windows server, and it does not have this issue.

A quick google suggested other people also having this problem, but I didn't see a solution (or a cause for that matter).

Does anyone have any ideas?
 
I am also having this problem, on different types of hardware, different types of storage, on different clusters.

I am also seeing this problem on 8.1, 8.2 and 8.3 so it is not limited to 9.0.
 
Hi,

I have the same issue. My system: 9.0-STABLE has been updated to latest at Mar 13. I've found a suggestion to set the sysctl variable kern.eventtimer.periodic=1, I've tried. Unfortunately, it didn't solve the problem.
 
Hi,

I can also confirm this problem with FreeBSD 8.2-RELEASE.
The only thing relevant to time freeze are the following lines from vmware log (before each time freeze):

Code:
2012-04-20T06:08:53.612Z| vmx| GuestRpcSendTimedOut: message to toolbox timed out.
2012-04-20T06:09:08.612Z| vmx| GuestRpcSendTimedOut: message to toolbox timed out.
2012-04-20T06:09:08.612Z| vmx| GuestRpc: app toolbox's second ping timeout; assuming app is down
2012-04-20T06:09:08.613Z| vmx| GuestRpc: Reinitializing Channel 0(toolbox)
2012-04-20T06:09:08.613Z| vmx| GuestMsg: Channel 0, Cannot unpost because the previous post is already completed
2012-04-20T06:09:08.613Z| vmx| GuestRpc: Channel 0 reinitialized.
2012-04-20T06:09:08.613Z| vmx| GuestRpc: Channel 0 reinitialized.
2012-04-20T06:12:08.616Z| vmx| GuestRpcSendTimedOut: message to toolbox timed out.
2012-04-20T06:12:08.616Z| vmx| Vix: [4537798 guestCommands.c:2194]: Error VIX_E_TOOLS_NOT_RUNNING in VMAutomationTranslateGuestRpcError(): VMware Tools are not running in the guest
 
Btw, ntpd is not running, I've enabled timesync with esxi via open-vm-tools (to be clear, it's also not solved the problem).
 
Try setting up a test VM, verify the problem exists in this VM, then recompile the kernel with the 4BSD scheduler and see if it helps.

The reason I'm suggesting this, is I've had issues with random processes not being assigned any CPU cycles when running FBSD FreeBSD 9 under ESXi 5, using the default kernel (ULE scheduler), and recompiling with the 4BSD scheduler solved this issue.

/usr/src/sys/<arch>/conf/VMWARETEST
Code:
include         GENERIC
ident           VMWARETEST

nooptions       SCHED_ULE
options         SCHED_4BSD
 
FWIW, I have a large set of virtual machines running FreeBSD 8.2 amd64 on ESX 4.1 and I'm not seeing this.

Every VM is configured with 1 vCPU and 3GB RAM. ntpd is running. The official VMware Tools package is installed (no open-vm tools). Kernel is GENERIC, no special sysctls or kern.hz configuration.
 
joel@ said:
FWIW, I have a large set of virtual machines running FreeBSD 8.2 amd64 on ESX 4.1 and I'm not seeing this.

Every VM is configured with 1 vCPU and 3GB RAM. ntpd is running. The official VMware Tools package is installed (no open-vm tools). Kernel is GENERIC, no special sysctls or kern.hz configuration.

I can confirm the same - the problem only seems to occur on ESXi 5.0 and 5.0U1 and is easy to reproduce
 
I can also confirm the problem occurs on FreeBSD 7.2 amd64.

The triggers seem to be heavy CPU & disk I/O, and/or creating snapshots under VMware.
 
Has anyone been able to trigger this on ESX 4? I'd like to be sure that the problem only affects ESXi 5.
 
npl said:
I can also confirm the problem occurs on FreeBSD 7.2 amd64.

The triggers seem to be heavy CPU & disk I/O, and/or creating snapshots under VMware.

I have just tried to recreate that bug with a simple test: about 200 bash threads to load disk (simple dd, 100 with bs=1 that copied small files constantly, and 100 with default bs that copied big files) and about 200 bash threads to load CPU (simple infinite loops) that ran for a few hours - no success, clock still ticks. Does anyone have any reliable method to recreate this bug?
 
We are also experiencing this behaviour on ESXi 5 update 1 with FreeBSD 8.2-RELEASE. This occurs with both SCHED_ULE and SCHED_4BSD, across several VMs on two different ESXi hosts. The VMware VM logs are showing the same errors that xzkto [post 4] mentioned.

When we were still on ESXi 4.1 we never witnessed this behaviour.

Some VMs are running Squid/Dansguardian and others Postfix/MailScanner - both setups have exhibited this issue but it is more prevalent on those running Squid.
 
My suggestion would be for people with valid support contracts to contact VMware support and file a bug report.
 
For those of you experiencing the problem, please provide:

  1. A description of the VMware host, including system manufacturer and model (if applicable), CPU model and count, HyperThreading/SMT state, physical RAM configuration, and a summary of the storage configuration;
  2. Version of VMware ESXi installed, including the build number (i.e., 5.0.0 623860);
  3. Summary of the VM configuration(s) that have experienced lockups, including CPU/core count, memory size, VM version, and OS type selected in the configuration;
  4. If VMware Tools or open-vm-tools is installed and running in the VM experiencing lockups;
  5. If vMotion is deployed at your site;
  6. Any special tuning applied to the FreeBSD VMs having problems, which includes anything in loader.conf and sysctl.conf and any special kernel options if not running GENERIC.

I have the ability to run test cases but I need to know if my test hardware is representative of the systems having issues.

Thanks for any information you can provide.
 
Hi,
  1. Fujitsu BX920 S2. 2 x Intel Xeon X5650. HyperThreading/SMT - Active. 128G RAM. Storage configuration: LSI 1064E + NetApp v3240 NAS via NFSv4.
  2. ESXi 5.0.0 515841
  3. 8 cpus (2 virtual sockets with 4 cores per socket). 32G RAM, VMVersion - 8. OS type selected - FreeBSD 64bit.
  4. open-vm-tools-nox11-471268_1 is installed and running (we've experienced problems with and without open-vm-tools).
  5. vMotion is deployed.
  6. Code:
    /boot/loader.conf:
    zfs_load="YES"
    vfs.root.mountfrom="zfs:zroot"
    tmpfs_load="YES"
    kern.hz="100"
    vfs.zfs.txg.timeout="5"
    vfs.zfs.vdev.min_pending="1"
    vfs.zfs.vdev.max_pending="32"
    vfs.zfs.vdev.cache.size="64m"
    vfs.zfs.cache_flush_disable="1"
    vfs.zfs.arc_max="1G"
    
    /etc/sysctl.conf:
    kern.ipc.somaxconn=1024
    net.inet.ip.intr_queue_maxlen=1000
    kern.maxvnodes=250000
    kern.maxfiles=65536
    kern.eventtimer.periodic=1
    
    Kernel options:
    options         VFS_AIO
    options         ZERO_COPY_SOCKETS
    options         DIRECTIO
Let me know, if you need more information.
Thanks.
 
ixdwhite said:
For those of you experiencing the problem, please provide:
....

Hi,

I'm not sure about some of the things you asked, but here is some info:
  1. Supermicro X8DTL, 2 x Intel Xeon E5620 2.4Ghz (4 cores per socket), HyperThreading - Active (I don't know how to check SMT), 32 GB RAM, Storage configuration: local HDD 1 Tb sata-II 300 Western Digital RE3 WD1002FB with VMFS 5.54.
  2. ESXi 5.0.0 469512
  3. 8 vcpus (4 virtual sockets with 2 cores per socket). 20Gb RAM, VMVersion - 8. OS type selected - FreeBSD 64bit.
  4. No.
  5. No.
  6. [CMD=]cat /boot/loader.conf[/CMD]
    Code:
    vmxnet_load="YES"
    vmxnet3_load="YES"

    [CMD=]cat /etc/sysctl.conf[/CMD]
    Code:
    net.inet.carp.allow=1
    net.inet.carp.preempt=1
    net.inet.carp.log=1
    net.inet.carp.arpbalance=1

    [CMD=]diff GENERIC PF_CARP_CONFIG[/CMD]
    Code:
    device carp
    device pf
    device pflog
    device pfsync
    
    options         ALTQ
    options         ALTQ_CBQ
    options         ALTQ_RED
    options         ALTQ_RIO
    options         ALTQ_HFSC
    options         ALTQ_PRIQ
    options         ALTQ_NOPCC
If you need any more information - feel free to ask.
 
Time has just stopped again on one of our virtual servers, other virtual servers with same configuration are still working, even on the same physical server. This time (no pun intended) I decided to do some experiments instead of restart.

I checked timeounters in sysctl and found out that kern.timecounter.tc.HPET.counter is not changing anymore. So, I changed kern.timecounter.choice from HPET (default) to ACPI-safe and time went on again.

I searched all logs that I could find (/var/log*, dmesg and more) and found nothing interesting except that vmware-tools didn't really die (contrary to wmware logs I posted before) - it just went to sleep. When time started ticking again vmware-tools started working again, nothing interesting in the debug log.

Now I have a virtual server with one frozen timer (HPET), other timers (i8254, ACPI-safe, TSC) are still working (most are disabled, but their counters are changing). Does anyone know if it is safe to do as I did? Can this lead to a more serious crash than one more time freeze?

P.S. If anyone wants to run some tests on this frozen timer - post your suggestions here. I will try to do it but I will not do anything dangerous - it is a production server.
 
Changing the timecounter at runtime is safe. Good to know that it seems to be related to HPET. That probably means the problems people are having with other FreeBSD releases are likely a different problem since 8.x does not typically use HPET as a timecounter.

Odd that APCI-safe was a choice for you. On my VMs it is showing ACPI-fast as the next preferred clock. Is your VM server particularly busy?

Here is the kern.timecounter.choice on my test VMs for this issue:

kern.timecounter.choice: TSC(-100) i8254(0) ACPI-fast(900) HPET(950) dummy(-1000000)

The kernel chooses the highest scored choice at boot unless overridden otherwise.

Are you (or anyone else experiencing the problem) using pmcstat(8) or similiar tools at runtime? These use the HPET hardware as well and could be conflicting or triggering a bug somewhere.
 
ixdwhite said:
That probably means the problems people are having with other FreeBSD releases are likely a different problem since 8.x does not typically use HPET as a timecounter.

I didn't understand that - we are using FreeBSD 8.2-RELEASE with minimal kernel tuning (CARP, PF, etc.) and HPET is default for all our virtual servers.

I have no idea why we have ACPI-safe instead of ACPI-fast - it seems to be default on our hardware, even new default FreeBSD GENERIC install on same server still uses same counters. Our servers are not usually busy but they experience huge spikes a few hours a day. Our counters are: TSC(-100) i8254(0) ACPI-safe(850) HPET(900) dummy(-1000000) if I remember correctly. Sorry, maybe I didn't get this either - I'm not really an administrator, just a programmer.

We are not using pmcstat(8) directly, but we are using Zabbix for monitoring and it looks like something that Zabbix may call to generate some of its statistics. I will check it on Monday, off for the weekend now.

Thank you for the ideas, I will try to find all software that we are using and check if any of it does low-level access to timecounters but I really doubt it.

By the way, is it possible that this bug is somehow related to CPU P or S states (ESXI seems to have P-state option turned on by default)? Last time timecounter stopped our server was nearly idle (but maybe I missed some spike just before it did). I have no idea how it can be related, just remember reading something like it.
 
Hi,

We have been experiencing the same problem for a few days now. We have just upgraded to ESXi 5 two months ago and it was kind of odd that two machines just suddenly behaved this way. Changing the timer to ACPI-safe worked for us as well. It is interesting to note that 60% of our machines still use VMWare virtual Hardware Version 7 and those VMs (we use FreeBSD 7.4 and 8.1) all selected ACPI-safe as default. All VMs with Version 8 Hardware were set to HPET by default. The clock stopped only for two and both were Hardware Version 8 VMs.
 
I spot-checked several 8.x machines on real hardware here and they are all using ACPI-fast with HPET as the second choice.

It wouldn't surprise me if the ACPI timecounter stability test (which decides whether to use ACPI-fast or ACPI-safe) was confused by heavily loaded systems, thought the method was unstable, and dropped back to ACPI-safe instead. Since ACPI-safe has a low base score, HPET would win out in those instances.

The VMware ESXi 5 HPET doesn't seem to implement actual performance counters so I suspect the emulation on ESXi 5 is not entirely stable and shifting to another timecounter is the appropriate workaround. Various places in FreeBSD already knows if its running on VMware; the acpi_hpet driver might need a similar check to drop its score if its running in that environment.

I ran a buildworld with pmcstat enabled in one of the test VMs and it at least finished, though the counters all returned 0 for 'instructions' which should work on everything. (It did on real hardware.)

If you have a VMware support contract it would be useful if you could open a bug on it just in case VMware is working on it or doesn't know there is a problem.
 
Back
Top