FreeBSD 9 on ESXi 5, clock stops

eezzeee · May 19, 2012

We have opened a ticket for this and the issue has already been passed along up the chain to engineering. From what I understand VMWare has been informed regarding this issue from other customers as well and they are aware of this thread.

duncan2386 · May 22, 2012

We have a support contract so raised a ticket, I had this reply today from VMware:

I just wanted to get in touch with you to let you know that I've reviewed the logs and information you have provided. I've sent the details on to our Engineering team - it appears other customers are experiencing this issue and a case was only opened with Engineering last week regarding this issue. The same workaround you found (manually force the guest OS to use the ACPI-safe source) appears to be working for other customers as well.

We are in the process of drafting a KB article for this issue while Engineering work on a fix.

They have been very helpful and responsive throughout which is nice!

bach · May 23, 2012

Good news.
Thanks.

throAU · May 25, 2012

joel@ said:
FWIW, I have a large set of virtual machines running FreeBSD 8.2 amd64 on ESX 4.1 and I'm not seeing this.

Every VM is configured with 1 vCPU and 3GB RAM. ntpd is running. The official VMware Tools package is installed (no open-vm tools). Kernel is GENERIC, no special sysctls or kern.hz configuration.

I have a few VMs running under ESX 4.1 at the moment - FreeBSD 7.4 x86, FreeBSD 8.2 x64, and FreeBSD 8.1 x64.

None have experienced this issue, and my uptime is generally >180 days or so between reboots. All kernels on them are GENERIC.

I am following this thread with interest, as I'm likely a month or so off upgrading to vSphere 5 here myself.

edit:
The 7.4 machine is running open-vm-tools, the 8.x machines are running VMware tools.

frijsdijk · Jun 10, 2012

I just had the same issue.

Esxi 5.0
FreeBSD 9.0, 64bit, GENERIC kernel. open-vm-tools-471268_1 installed.

Code:

[root@srv03 /home/admin]# kldstat
Id Refs Address            Size     Name
 1   25 0xffffffff80200000 11cd9b0  kernel
 2    1 0xffffffff813ce000 203d70   zfs.ko
 3    2 0xffffffff815d2000 5c50     opensolaris.ko
 4    1 0xffffffff815d8000 a80      accf_data.ko
 5    1 0xffffffff815d9000 17d8     accf_http.ko
 6    1 0xffffffff81812000 159f     vmmemctl.ko
 7    1 0xffffffff81814000 c16e     ipfw.ko
 8    1 0xffffffff81821000 6dda     ipmi.ko
 9    1 0xffffffff81828000 889      smbus.ko

[root@srv03 /home/admin]# cat /boot/loader.conf
accf_http_load="YES"
accf_data_load="YES"
zfs_load="YES"

[root@srv03 /home/admin]# cat /etc/sysctl.conf
# $FreeBSD: release/9.0.0/etc/sysctl.conf 112200 2003-03-13 18:43:50Z mux $
#
#  This file is read when going to multi-user and its contents piped thru
#  ``sysctl'' to adjust kernel values.  ``man 5 sysctl.conf'' for details.
#

# Uncomment this to prevent users from seeing information about processes that
# are being run under another UID.
#security.bsd.see_other_uids=0
net.inet.ip.fw.dyn_buckets=65536
net.inet.ip.fw.dyn_max=65536
net.inet.ip.fw.dyn_ack_lifetime=120
vm.pmap.shpgperproc=1000

The (HPET) clock stopped ticking. Can login, but it's not serving requests. ntpd takes 100% load.

Code:

  PID USERNAME  THR PRI NICE   SIZE    RES STATE   C   TIME   WCPU COMMAND
72566 root        1 102    0 22332K  2308K CPU2    2   4:53 100.00% ntpd
72396 www         1  21    0   297M 49968K select  0   0:02  0.98% httpd

Code:

[root@srv03 /home/admin]# date
Sun Jun 10 08:44:12 CEST 2012
[root@srv03 /home/admin]# date
Sun Jun 10 08:44:12 CEST 2012
[root@srv03 /home/admin]# date
Sun Jun 10 08:44:12 CEST 2012
[root@srv03 /home/admin]# date
Sun Jun 10 08:44:12 CEST 2012
[root@srv03 /home/admin]# date
Sun Jun 10 08:44:12 CEST 2012
[root@srv03 /home/admin]# date
Sun Jun 10 08:44:12 CEST 2012
[root@srv03 /home/admin]# date

Code:

[root@srv03 /home/admin]# sysctl kern.timecounter
kern.timecounter.tick: 1
kern.timecounter.choice: TSC(-100) i8254(0) ACPI-fast(900) HPET(950) dummy(-1000000)
kern.timecounter.hardware: HPET
kern.timecounter.stepwarnings: 0
kern.timecounter.tc.HPET.mask: 4294967295
kern.timecounter.tc.HPET.counter: 1392653989
kern.timecounter.tc.HPET.frequency: 14318180
kern.timecounter.tc.HPET.quality: 950
kern.timecounter.tc.ACPI-fast.mask: 16777215
kern.timecounter.tc.ACPI-fast.counter: 2995577
kern.timecounter.tc.ACPI-fast.frequency: 3579545
kern.timecounter.tc.ACPI-fast.quality: 900
kern.timecounter.tc.i8254.mask: 65535
kern.timecounter.tc.i8254.counter: 17227
kern.timecounter.tc.i8254.frequency: 1193182
kern.timecounter.tc.i8254.quality: 0
kern.timecounter.tc.TSC.mask: 4294967295
kern.timecounter.tc.TSC.counter: 1427630916
kern.timecounter.tc.TSC.frequency: 2266747000
kern.timecounter.tc.TSC.quality: -100
kern.timecounter.smp_tsc: 0
kern.timecounter.invariant_tsc: 1
[root@srv03 /home/admin]# sysctl kern.timecounter
kern.timecounter.tick: 1
kern.timecounter.choice: TSC(-100) i8254(0) ACPI-fast(900) HPET(950) dummy(-1000000)
kern.timecounter.hardware: HPET
kern.timecounter.stepwarnings: 0
kern.timecounter.tc.HPET.mask: 4294967295
kern.timecounter.tc.HPET.counter: 1392653989
kern.timecounter.tc.HPET.frequency: 14318180
kern.timecounter.tc.HPET.quality: 950
kern.timecounter.tc.ACPI-fast.mask: 16777215
kern.timecounter.tc.ACPI-fast.counter: 8039395
kern.timecounter.tc.ACPI-fast.frequency: 3579545
kern.timecounter.tc.ACPI-fast.quality: 900
kern.timecounter.tc.i8254.mask: 65535
kern.timecounter.tc.i8254.counter: 60099
kern.timecounter.tc.i8254.frequency: 1193182
kern.timecounter.tc.i8254.quality: 0
kern.timecounter.tc.TSC.mask: 4294967295
kern.timecounter.tc.TSC.counter: 326655140
kern.timecounter.tc.TSC.frequency: 2266747000
kern.timecounter.tc.TSC.quality: -100
kern.timecounter.smp_tsc: 0
kern.timecounter.invariant_tsc: 1

So all HPET values are stuck.

When I do:

Code:

[root@srv03 /home/admin]# sysctl kern.timecounter.hardware=ACPI-fast
kern.timecounter.hardware: HPET -> ACPI-fast
[root@srv03 /home/admin]# date
Sun Jun 10 08:44:16 CEST 2012
[root@srv03 /home/admin]# date
Sun Jun 10 08:44:16 CEST 2012
[root@srv03 /home/admin]# date
Sun Jun 10 08:44:16 CEST 2012
[root@srv03 /home/admin]# date
Sun Jun 10 08:44:17 CEST 2012
[root@srv03 /home/admin]# date
Sun Jun 10 08:44:17 CEST 2012
[root@srv03 /home/admin]# date
Sun Jun 10 08:44:18 CEST 2012
[root@srv03 /home/admin]# date
Sun Jun 10 08:44:18 CEST 2012

.. the clock again starts to tick. I ntp to sync time, and all is normal (Nagios recovers as well).

joel@ · Jun 21, 2012

Any news?

Riplakish · Jul 20, 2012

Any update to this? I still have multiple VMs hanging after switching the clock to ACPI-fast...

xzkto · Jul 23, 2012

Riplakish said:
Any update to this? I still have multiple VMs hanging after switching the clock to ACPI-fast...

Strange, after we changed out timecounter to ACPI (we have ACPI-safe) we have uptime close to two months without any problems. And our VM's never hanged, only time stopped. Are you sure that your problem is with timecounters?

frijsdijk · Jul 24, 2012

No problems here either anymore with ACPI-fast.

The most recent VM I deployed (AMD64, 9.0), selected "kern.timecounter.hardware: TSC" by itself.

bach · Jul 24, 2012

I can confirm, that ACPI-fast has solved this problem for us. We have a huge farm of FreeBSD VM's without any hangs for a long time.

grahamb413 · Aug 11, 2012

Setting:
kern.timecounter.hardware: ACPI-fast
also fixed this for me, however after setting and rebooting the value goes back to what it was before (HPET) how do I make this a permanent fix?

ESXi 5.0.0 (Dell image)
FreeNAS 8.2.0 Release (Running on FreeBSD)

joel@ · Aug 12, 2012

grahamb413 said:
Setting:
kern.timecounter.hardware: ACPI-fast
also fixed this for me, however after setting and rebooting the value goes back to what it was before (HPET) how do I make this a permanent fix?

Add kern.timecounter.hardware=ACPI-fast to /etc/sysctl.conf.

Bado · Aug 16, 2012

FWIW, I have been experiencing this issue as well, only not with EXSi 5. I've been having the problem using VMWare Fusion on my Mac. This has been happening with FreeBSD 9.0 and 9.1. The 9.1 virtual was a brand new instance that I just created, barely set up and configured yet. It ran for a few days, then the date froze. I never (that I recall) had the issue on Fusion with 8.x instances, but I haven't run any at home for a year or more.

I'm attempting the kern.timecounter.hardware=ACPI-fast now to see if the problem resolves.

Of note, I have FBSD 8.0, 8.1, 8.2, 9.0, and 9.1 instances running in ESXi 4 (or 4.1?) at the office and none of them has ever had this problem (in 1.5 years of running multiple virts here)

joel@ · Aug 17, 2012

Bado said:
Of note, I have FBSD 8.0, 8.1, 8.2, 9.0, and 9.1 instances running in ESXi 4 (or 4.1?) at the office and none of them has ever had this problem (in 1.5 years of running multiple virts here)

Same here. ESX 4.x works really well with FreeBSD 7-10.

glocke · Sep 5, 2012

Any news about that KB article? A quick search revealed nothing (e.g. only the timekeeping pdf for ESX 4.0 and the timekeeping best practices for Linux guests).

edit:
Forgot to mention: I just had a 7.1 FreeBSD guest (yeah kinda old...) with ACPI-safe as timecounter, open-vm-tools-nox11-148847 installed and timesync enabled, which got delayed several minutes today. Runs under ESXi, 5.0.0, 768111. Heavy IOps on the disc, probably also from other guests which are on the same SAN volume. ntpd is not running (btw. when did VMWare change its opinion regarding running ntpd in a *NIX guest. Some years ago it was recommended to Do Not Use ntpd but instead sync via vmware-tools...)

joel@ · Sep 12, 2012

Has anyone tried ESXi 5.1 with FreeBSD 9.0 yet? It's the first ESXi release to officially support FreeBSD 9.0.

I'd be really interested in hearing about any test results.

throAU · Sep 13, 2012

joel@ said:
Has anyone tried ESXi 5.1 with FreeBSD 9.0 yet? It's the first ESXi release to officially support FreeBSD 9.0.

I'd be really interested in hearing about any test results.

I'm building a test lab at the moment which will likely have ESXi 5.1 on it.

Hopefully get it done tomorrow / early next week - will spin up a VM.

joel@ · Sep 18, 2012

Any updates?

throAU · Sep 19, 2012

Sorry, test lab has been held up slightly, but I should be able to install this week and leave it running for a bit.

I've had the noob going through our build documentation to build the test lab and we uncovered a few documentation problems which held that up

edit:
just confirmed, our lab was built with ESXi 5.0, i'll get it upgraded to 5.1 soon.

throAU · Sep 20, 2012

Upgraded the test lab to 5.1 this morning, going to give FreeBSD 9 a go and see how things are.

Will install VMware tools, as I'm guessing most would run them in their VMs, and turn time sync (NTP and Tools) OFF.

This should replicate the problem, if it exists, yes?

edit:
Installed a VM (FreeBSD 9.0 release, latest VMware tools installed), will check it out in a couple of hours and see if the clock is still running...

Bado · Sep 21, 2012

My home virts (the ones having problems) do not have tools installed, and are running with NTP turned on. I install a basic system from ISO, then just install ports here and there as I need them.

At the office (no problems; fbsd9 on esxi 4.x), we do have vmware-guestd and vmware-kmod installed from ports.

So, I'd be interested in seeing any issues without tools installed, and NTP on.

Bado · Sep 21, 2012

I stand corrected; my home virts also have the vmware tools installed, however NTP is on

throAU · Sep 21, 2012

Just on the VMware tools - you really want them installed for the balloon driver to work, if nothing else. Essentially the balloon driver forces the VM to page (as IT sees fit) when the host is under memory pressure (and the host reclaims that real memory), rather than the host swapping bits of the running VM out to disk that may actually be active (potentially causing massive performance issues and page thrashing on the host).

throAU · Sep 21, 2012

So, it's been 24 hrs +

GENERIC kernel, FreeBSD 9.0 Release, ESXi 5.1 (downloaded on Tuesday), VMware tools installed, no NTP or VMWare time sync turned on.

So far, my clock is still working fine.

Obviously the VM is under no load (vanilla install with no additional services running), but so far so good.

I'm out of the office as of this evening until Thursday next week, but hopefully I'll be able to confirm whether or not it is still ticking when I get back.

glocke · Sep 24, 2012

I have a test environment running just now:
ESXi 5.1.0, 799733 with FreeBSD 9.0, official VMWare Tools installed, no clock sync enabled, no ntpd running. A small script runs sysbench to simulate some IO:

Code:

#!/bin/sh

trap bail HUP INT QUIT ILL TRAP ABRT EMT FPE KILL

bail () {
        echo "caught signal"
        exit 1
}

while (true); do
        for m in seqwr seqrewr seqrd rndrd rndwr rndrw; do
                sysbench --test=fileio prepare
                sysbench --test=fileio --file-test-mode=$m run
                sysbench --test=fileio cleanup
                sleep 30
        done
done

I will let it run for 24h and report back any issues with the clock.