This is my first post here on the forums, and I'm a relatively new FreeBSD sysadmin. We ordered a Dell R710 with two RAID controllers, the PERC H700 for 6 internal drives (mfi0), and the PERC H800 for a 24-drive direct attached storage device (mfi1). I have a zpool on 4 of the internal drives on mfi0 and a zpool across all 24 drives in the mfi1 DAS.
From time to time I have noticed that reads and writes to the mfi1 DAS hang for a long time. While it is hung, timeout error messages show up in /var/log/messages. After some seemingly arbitrary amount of time we're back to normal operation. It may be days before the next time this shows up in the logs and may be on the order of 30, or 15000 seconds. I have not yet discerned a pattern. Here is an example:
I have a hunch that this is occurring when copying large amounts of data (I.E. 0.5TB) between the drive arrays. I've also seen it occur during a hefty amount of database activity on mfi1. This database is not heavily used as an online database and is mostly for data warehousing so I've waited a bit to attack the problem as other fires have demanded my attention.
This morning the server is unresponsive and the following information is in the logs and on screen. A hard shutdown and startup seems to have solved the problem with no data loss (thanks zfs checksumming).
Recovered from /var/log/messages. Also shown on screen before panic message below:
Transcribed from on-screen before shutdown, not present in any logs.
I'm at a loss, but it appears to be related to the mfi1 timeouts. I'll add dumps of info that may be helpful.
From time to time I have noticed that reads and writes to the mfi1 DAS hang for a long time. While it is hung, timeout error messages show up in /var/log/messages. After some seemingly arbitrary amount of time we're back to normal operation. It may be days before the next time this shows up in the logs and may be on the order of 30, or 15000 seconds. I have not yet discerned a pattern. Here is an example:
Code:
...
Jan 2 06:06:28 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 3959 SECONDS
Jan 2 06:06:58 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 3989 SECONDS
Jan 2 06:07:28 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 4019 SECONDS
Jan 2 06:07:58 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 4049 SECONDS
Jan 2 06:08:28 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 4079 SECONDS
Jan 2 06:08:58 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 4109 SECONDS
Jan 2 06:09:28 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 4139 SECONDS
Jan 2 06:09:58 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 4169 SECONDS
Jan 2 06:10:28 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 4199 SECONDS
Jan 2 06:10:58 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 4229 SECONDS
Jan 2 06:11:28 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 4259 SECONDS
Jan 2 06:11:58 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 4289 SECONDS
Jan 2 06:12:28 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 4319 SECONDS
Jan 2 06:12:58 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 4349 SECONDS
Jan 2 06:13:28 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 4379 SECONDS
Jan 2 06:13:58 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 4409 SECONDS
Jan 2 06:14:28 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 4439 SECONDS
Jan 2 06:14:58 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 4469 SECONDS
Jan 2 06:15:28 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 4499 SECONDS
Jan 2 06:15:58 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 4529 SECONDS
Jan 2 06:16:28 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 4559 SECONDS
...
I have a hunch that this is occurring when copying large amounts of data (I.E. 0.5TB) between the drive arrays. I've also seen it occur during a hefty amount of database activity on mfi1. This database is not heavily used as an online database and is mostly for data warehousing so I've waited a bit to attack the problem as other fires have demanded my attention.
This morning the server is unresponsive and the following information is in the logs and on screen. A hard shutdown and startup seems to have solved the problem with no data loss (thanks zfs checksumming).
Recovered from /var/log/messages. Also shown on screen before panic message below:
Code:
Jan 2 06:29:28 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 5339 SECONDS
Jan 2 06:29:58 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 5369 SECONDS
Jan 2 06:30:28 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 5399 SECONDS
Jan 2 06:30:58 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 5429 SECONDS
Jan 2 06:31:28 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 5459 SECONDS
Jan 2 06:31:58 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 5489 SECONDS
Jan 2 06:32:28 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 5519 SECONDS
Jan 2 06:32:58 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 5549 SECONDS
Jan 2 06:33:28 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 5579 SECONDS
Jan 2 06:33:58 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 5609 SECONDS
Jan 2 06:34:28 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 5639 SECONDS
Jan 2 06:34:58 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 5669 SECONDS
Transcribed from on-screen before shutdown, not present in any logs.
Code:
Fatal Trap 12: pagefault while in kernel mode
cpuid=6; apic id=34
fault virtual device = 0x290
fault code = supervisor read data, page not present
instruction pointer = 0x20
stack pointer = 0x28
frame pointer = 0x28
code segment = base 0x0, limit 0xffff, type 0x16
DPL 0, pres 1, long 1, def 320, gran 1
processor eflags = interrupt enabled, resume, IOPC = 0
current process = 20376 (ruby)
trap number = 12
panic: page fault
cpuid = 6
Cannot Dump. Device not defined or enabled.
I'm at a loss, but it appears to be related to the mfi1 timeouts. I'll add dumps of info that may be helpful.