ESXi 4.1 - mpt0 errors

Ladies/Gents, just recently I have noticed the following errors in two of my FreeBSD 8.3 VMs (a name server and an MX server - both running as ESXi 4.1 guests). I'll post output from the MX server as it appears more often (and obviously does more I/O).

Example:

Code:
Jan 16 01:25:42 <kern.crit> mx2 kernel: mpt0: request 0xffffff80002378b0:9223 timed out for ccb 0xffffff000198b000 (req->ccb 0xffffff000198b000)
Jan 16 01:25:42 <kern.crit> mx2 kernel: mpt0: attempting to abort req 0xffffff80002378b0:9223 function 0
Jan 16 01:25:42 <kern.crit> mx2 kernel: mpt0: completing timedout/aborted req 0xffffff80002378b0:9223
Jan 16 01:25:42 <kern.crit> mx2 kernel: mpt0: abort of req 0xffffff80002378b0:0 completed
Jan 16 01:25:42 <kern.crit> mx2 kernel: mpt0: request 0xffffff80002324e0:9224 timed out for ccb 0xffffff003eaca000 (req->ccb 0xffffff003eaca000)
Jan 16 01:25:42 <kern.crit> mx2 kernel: mpt0: attempting to abort req 0xffffff80002324e0:9224 function 0
Jan 16 01:25:42 <kern.crit> mx2 kernel: mpt0: completing timedout/aborted req 0xffffff80002324e0:9224
Jan 16 01:25:42 <kern.crit> mx2 kernel: mpt0: abort of req 0xffffff80002324e0:0 completed

Any ideas what may be causing this?

I am running GENERIC kernel, and have VMware tools installed. System is up to date using freebsd-update.

Physical hardware is a Cisco B200 blade in a UCS 8 slot chassis, the physical storage is a Netapp FAS2240 connected via NFS over 10 gig fibre through a Cisco 4507.

The virtual storage is just VMware virtual machine provided virtual disks - LSI logic parallel emulation.

The Netapp is not running anywhere near flat out in terms of IO, so I'm pretty sure it shouldn't be timing out due to IO throttling - and all our user ports on the 4507 are running at 100Mb POE (plugged into old phones which are 100Mb limited) with only 8 10Gb ports in use and say 36 ports running at 1Gb. It has dual Sup 7s with SSO, so should be no problem there either.

I'm not seeing storage errors on anything else.


Any idea where to start looking to track this down? The machine had 188 days of uptime at that point and stupidly, I rebooted it.

However my name server (also exhibiting the issue to a lesser extent due to less IO) has not yet been rebooted (also has 188 days of uptime) - if there are any diagnostics I should perform prior to reboot I can perform them on that.


Cheers
 
Probably this was caused by some firmware issue, but I can't confirmed it.

Please, show this outputs:
Code:
# camcontrol tags [device id]
# camcontrol devlist
 
I gather you mean the device ID for the system (only) virtual disk (da0)?

Code:
mx2# camcontrol tags da0
(pass0:mpt0:0:0:0): device openings: 127
mx2# camcontrol devlist
<VMware Virtual disk 1.0>          at scbus0 target 0 lun 0 (pass0,da0)
mx2#
 
Also - have confirmed that I wasn't seeing these errors prior to last night - so perhaps it is an uptime-related thing. I'll leave the name server running and monitor the MX server for problems post-reboot.

Both machines are virtually identical virtual hardware, running on same physical infrastructure and at the same patch level, etc. and both started generating errors at roughly the same time.

I did not make any changes to the host environment in the last 2 weeks or so (last change would have been VMware ESXi updates on the hosts).
 
throAU said:
I gather you mean the device ID for the system (only) virtual disk (da0)?
Correct :)

Then stay tuned for some irregularity, and report anyway to be analyzed with more calm. Use SMART to check disk data connected to the controller. Can be useful if you post the results.
 
SMART data? Isn't that only valid for the actual physical drives? They are several levels of abstraction away and in RAID-DP (i.e., Netapp's variant of RAID6). I have zero failed drives in the Netapp and no alarms.

Even if the Netapp had several failed disks, ESXi shouldn't even know, as it is connected via NFS?
 
throAU said:
SMART data? Isn't that only valid for the actual physical drives? They are several levels of abstraction away and in RAID-DP (i.e., Netapp's variant of RAID6). I have zero failed drives in the Netapp and no alarms.

Even if the Netapp had several failed disks, ESXi shouldn't even know, as it is connected via NFS?

Sure, the guest can only see the virtual hardware, not the physical hardware. Check physical drive. Is recommended read VirtualBox Troubleshootings.
 
The physical drives are shared with many other VMs that are not showing errors - the netapp is an enterprise grade SAN with 48 drives, none of which are showing errors.

The Netapp will pro-actively fail disks that have smart errors (it does a nightly scrub), as far as the clients are concerned they won't see any change in service.

I have actually had a drive failure in the FAS some months ago, and did not get these errors in my log previously.
 
I have plenty of 8.X VM's running in my ESX 4.1 environment, also using NetApp for storage. My ESX machines are all running with the latest patches from VMware. I do not see these errors.

Uptime varies a lot, but several VM's have around 200 - 300 days of uptime.

I suggest you send a mail to stable@freebsd.org
 
joel@ said:
I have plenty of 8.X VM's running in my ESX 4.1 environment, also using NetApp for storage. My ESX machines are all running with the latest patches from VMware. I do not see these errors.

Uptime varies a lot, but several VM's have around 200 - 300 days of uptime.

I suggest you send a mail to stable@freebsd.org


Cheers,

Will monitor both VMs over the next couple of days and see if the problem persists and email stable@ as appropriate.

I've been running FreeBSD in ESX for about 6 years now myself (these exact VMs for 2-3 years) and never seen this before either so it is quite peculiar.
 
throAU said:
SMART data? Isn't that only valid for the actual physical drives? They are several levels of abstraction away and in RAID-DP (i.e., Netapp's variant of RAID6). I have zero failed drives in the Netapp and no alarms.

Even if the Netapp had several failed disks, ESXi shouldn't even know, as it is connected via NFS?

Some Dell's drives working with SAS 6/iR BIOS allow use SMART to check virtual disks to get raw values. Linux provides to physical disks attached to the virtual controller device named /dev/sgX then is possible to use smartmontools.

Tested on linux.

FreeBSD should get implement this option :stud
 
cpu82 said:
Some Dell's drives working with SAS 6/iR BIOS allow use SMART to check virtual disks to get raw values. Linux provides to physical disks attached to the virtual controller device named /dev/sgX then is possible to use smartmontools.

Tested on linux.

FreeBSD should get implement this option :stud



The disks are not in the same physical machine... the ESXi host is running VMs off a data store served via NFS from the Netapp.

This setup is based on a fully VMware supported Cisco/Netapp/Vmware partnership Flexpodarchitecture.
 
You are right, but I suppose can be parsing how to implement this possibility, although not be easy do this in a different architecture. Some limited pass-through functionality must exist to start to implement code.

From smartmontools FAQ:
Do smartctl and smartd run on a virtual machine guest OS?

Yes and no. Smartctl and smartd run on a virtual machine guest OS without problems. But this isn't very useful because the virtual disks do not support SMART. If a guest OS disk is configured as a raw disk, this only means that its sectors are mapped transparently to the underlying physical disk. This does not imply the ATA or SCSI pass-through access required to access the SMART info of the physical disk. Even the disk's identity is typically not exposed to the guest OS.
 
throAU said:
Update: errors have not recurred after a reboot of only one of the VMs.

Wierd.

Despite mpt0 errors haven't occurred again, be recommended reporting I/O detailed symptoms that disappeared from your MX server, for avoid future headaches.
 
Same symptoms here with Samba file server on FreeBSD 8.3 p3 on VMWare ESXi 4.1/348481 - Server with 3ware 9690SA and configured RAID5 with 4 x 1TB Samsung Spinpoints F1 - HE103UJ - 1TB each. So it is a certified controller with certified HDDs. The CPU/HDD does not show 100% exhaust and incoming traffic was about 15MB/s - as the disk usage in MB/s. Any idea what is behind this problem? Yesterday I rebooted VM checked virtual disks in VM, checked physical disks in RAID - everything was OK.
 
I saw FreeBSD 8.3 is not supported on ESXi 4.1U3, it is supported in 5.1. So I upgraded the hypervisor to 5.1 (with e1000 and e1000e driver from 5.0U1 - older is without bug), upgraded vmtools-freebsd in VM and added to /boot/loader.conf:
Code:
hw.pci.enable_msi="0"
hw.pci.enable_msix="0"
which addresses the issue when IRQs are shared among mpt0/em0/emX.

The source is here. We will see how it performs tomorrow and until the end of the week. I will post here my experience ;)

throAU said:
Ladies/Gents, just recently I have noticed the following errors in two of my FreeBSD 8.3 VMs (a name server and an MX server - both running as ESXi 4.1 guests). I'll post output from the MX server as it appears more often (and obviously does more I/O).

Example:

Code:
Jan 16 01:25:42 <kern.crit> mx2 kernel: mpt0: request 0xffffff80002378b0:9223 timed out for ccb 0xffffff000198b000 (req->ccb 0xffffff000198b000)
Jan 16 01:25:42 <kern.crit> mx2 kernel: mpt0: attempting to abort req 0xffffff80002378b0:9223 function 0
Jan 16 01:25:42 <kern.crit> mx2 kernel: mpt0: completing timedout/aborted req 0xffffff80002378b0:9223
Jan 16 01:25:42 <kern.crit> mx2 kernel: mpt0: abort of req 0xffffff80002378b0:0 completed
Jan 16 01:25:42 <kern.crit> mx2 kernel: mpt0: request 0xffffff80002324e0:9224 timed out for ccb 0xffffff003eaca000 (req->ccb 0xffffff003eaca000)
Jan 16 01:25:42 <kern.crit> mx2 kernel: mpt0: attempting to abort req 0xffffff80002324e0:9224 function 0
Jan 16 01:25:42 <kern.crit> mx2 kernel: mpt0: completing timedout/aborted req 0xffffff80002324e0:9224
Jan 16 01:25:42 <kern.crit> mx2 kernel: mpt0: abort of req 0xffffff80002324e0:0 completed

Any ideas what may be causing this?

I am running GENERIC kernel, and have VMware tools installed. System is up to date using freebsd-update.

Physical hardware is a Cisco B200 blade in a UCS 8 slot chassis, the physical storage is a Netapp FAS2240 connected via NFS over 10 gig fibre through a Cisco 4507.

The virtual storage is just VMware virtual machine provided virtual disks - LSI logic parallel emulation.

The Netapp is not running anywhere near flat out in terms of IO, so I'm pretty sure it shouldn't be timing out due to IO throttling - and all our user ports on the 4507 are running at 100Mb POE (plugged into old phones which are 100Mb limited) with only 8 10Gb ports in use and say 36 ports running at 1Gb. It has dual Sup 7s with SSO, so should be no problem there either.

I'm not seeing storage errors on anything else.


Any idea where to start looking to track this down? The machine had 188 days of uptime at that point and stupidly, I rebooted it.

However my name server (also exhibiting the issue to a lesser extent due to less IO) has not yet been rebooted (also has 188 days of uptime) - if there are any diagnostics I should perform prior to reboot I can perform them on that.


Cheers
 
My name server now has 238 days of uptime and has not seen this issue recur.

However, one thing I have just remembered - NetApp recommend installing some "guest OS" tools that modify some IO time-out settings for Windows and Linux - this is for when the FAS has a controller fail-over happen (in case of firmware upgrade, hardware failure, etc).

I'll see if I can dig out more info as to what those time outs are and how they could be set on FreeBSD.

As far as I am aware - I've never had a NetApp HA failover. But maybe this happened and caused the issue...
 
Mine is running second day without issue. FreeBSD 8.3 is in ESXi 4.1 unsupported - in my case I strongly suspect that was the problem, even ESXi 4.1u3 crashed under high IO (just a simple background fsck_ufs on a 1TB /home). It does not happen anymore in ESXi 5.1, so I think they probably fixed that.

throAU said:
My name server now has 238 days of uptime and has not seen this issue recur.

However, one thing I have just remembered - NetApp recommend installing some "guest OS" tools that modify some IO time-out settings for Windows and Linux - this is for when the FAS has a controller fail-over happen (in case of firmware upgrade, hardware failure, etc).

I'll see if I can dig out more info as to what those time outs are and how they could be set on FreeBSD.

As far as I am aware - I've never had a NetApp HA failover. But maybe this happened and caused the issue...
 
Hmm.

I've never had any crashes of ESXi or otherwise. I also have not performed the sysctl tuning above yet.

What do those sysctls actually do?
 
Back
Top