FreeBSD crashes copying large files in VM

tcn · Jan 8, 2017

Hi,

I've been having a hard time figuring out what is happening to my server. It is a Xeon machine with 64Gigs of RAM running FreeBSD 11.0 under ZFS. It has a few jails and four VMs, three Windows and one Linux. The VMs are installed on zvols, default settings (compression and all).

The issue is when I copy a large folder of files or a large file (let's say under one on my Windows 7 virutal machine), the transfer starts at high speed but continually decreases until the VM stops responding. When this happens, the system is unresponsive. It pings, I can telnet to ssh port but ssh won't negotiate. Nothins works; not even the console.

When I reset the machine and look at the logs, I can't find any indication of a crash.

I did read of an issue regarding ARC memory size so I restricted mine to 40Gigs (virtual machines consuming 20 Gigs total) but it did happen again. I was visualizing zfs statistics and didn't see the ARC going higher than 62% maximum size (up until it crashed...)

Has anyone seen this behaviour?

Thanks,
Tcn

aribi · Jan 9, 2017

tcn said:
When I reset the machine and look at the logs, I can't find any indication of a crash.

What makes you think it crashed? Sounds to me like "some things work, but anything to do with files doesn't". What virtualization are you using? What is the free space on your zfs pool?

tcn · Jan 10, 2017

Hi aribi,

I say the system crashed because I can't log onto it even on the console. Nothing responds; but pings and socket connection. I don't know how this could be if the computer is crashed but since nothing works, I can't get any answers from the system....... I left it there a while and doesn't seem to recover (a while would be a few minnutes; maybe 5.....)

The VMs are under bhyve and my pools are almost empty (less than 2% utilization). I'll have a look into the thread bu from your questions; it doesn't seem like I have the conditions to this issue.

What troubles me is the way it crashes; it is slowly crashing..... Network bandwidth is getting lower and lower until everything stops. I don't think it is disk activity, I thought more memory but I haven't been able to capture it yet.

Regards,
Tcn

SirDice · Jan 10, 2017

tcn said:
I say the system crashed because I can't log onto it even on the console. Nothing responds; but pings and socket connection. I don't know how this could be if the computer is crashed but since nothing works, I can't get any answers from the system....... I left it there a while and doesn't seem to recover (a while would be a few minutes; maybe 5.....)

It's not crashed, it's locked up. There is a difference. Apparently anything that requires I/O just hangs (that's why pings still work). This typically happens with broken disks or controllers. One disk may be having issues causing the whole bus to "freeze" until the issue recovers or stops. Anything else that requires I/O would have to wait for this.

The last time I had "freezes" like this it turned out my Promise SATA controller was near death. Not broken enough to not be detected but broken enough to cause all sorts of I/O errors.

tcn · Jan 10, 2017

Hi SirDice,

I understand. But I do not know how I can diagnose this without any output from the system.
The server is very new (set it up last December). The controller is onboard (SuperMicro motherboard, X10DRH-IT) so LSI onboard (SAS3008) with 7 disks attached.

Smart does not give me any indication of failure on any disk.

I can reproduce the crash; all I need is a way to detect what is happening......

regards,
tcn

SirDice · Jan 10, 2017

Is this using the mfi(4) or mrsas(4) driver? If the card is supported by both drivers you may want to try mrsas(4) instead.

tcn · Jan 10, 2017

It is using the mpr(4) driver.

I tried to enable more debug options but there is no hw.mpr in the sysctl...... There is hw.mfi though.... (?)

SirDice · Jan 10, 2017

Hmm, the LSI cards in general tend to work really well. So I'm not expecting any issues there. You mentioned it's running on a ZFS pool? How big is that pool? And did you perhaps enable dedup?

tcn · Jan 10, 2017

no, no dedup and the pool is as follows (6 disks in raidz2):

Code:

NAME       SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank      21.4T  3.85T  17.5T         -     5%    18%  1.00x  ONLINE  -

SirDice · Jan 10, 2017

Ok, that looks good too. And 64 GB should be plenty for it. Again, not something I would expect any issues with.

tcn said:
I tried to enable more debug options but there is no hw.mpr in the sysctl...... There is hw.mfi though.... (?)

Are you sure mpr(4) is loaded? That should indeed be the correct driver, not sure why there are hw.mfi. sysctls. Maybe it's a bug with the driver.

Code:

DEBUGGING
     To enable debugging prints from the mpr driver, set the

           hw.mpr.X.debug_level

     tunable, either in loader.conf(5) or by using sysctl(8).  These bits have
     the described effects:

           0x0001 Enable informational prints (set by default).
           0x0002 Enable prints for driver faults (set by default).
           0x0004 Enable prints for controller events.
           0x0008 Enable prints for controller logging.
           0x0010 Enable prints for tracing recovery operations.
           0x0020 Enable prints for parameter errors and programming bugs.
           0x0040 Enable prints for system initialization operations.
           0x0080 Enable prints for more detailed information.
           0x0100 Enable prints for user-generated commands (IOCTL).
           0x0200 Enable prints for device mapping.
           0x0400 Enable prints for tracing through driver functions.

Try enabling one of those to see if that provides any clues.

tcn · Jan 10, 2017

Trying to load the module, I get the message where it is already loaded. Must be in the kernel since I can't see it in the modules list.

I found dev.mpr in the complete list. I enabled debug from 0 to 0x80 (basically 255) but got just too much output (read, write, synchronize cache etc). I retracted the verbose (0x80) and I don't get any messages.

I guess I'll have to make it stall again to see if I get more output on the serial console... I'll have to wait until tonight to do this though.

grehan@ · Jan 11, 2017

I think you'll need to drop the ARC limit even more - try 20GB and see if that helps (this will give the kernel itself some more overhead).

tcn · Jan 11, 2017

I am currently attempting to stall the machine; it is not completely stalled as my top is still running in the console.
Nothing answers anymore but pings are working, ssh session is not responding anymore; not even the serial console.
If I happen to type a character in the console, it freezes. (just did, top is hung).

From what I can see, ARC is low (about 7G), memory is fine and swap is barely used. I guess ARC and memory is out of the equation.

There is something weird though. I have two VMs that are using 100% of processing time which is not right. There is only one that I am using right now and during the copy, it did go up to 100% processing time. I just can't explain why another VM would take 100% processing time as they usually fall under 10% when idle.

I am starting to thing it is a VM issue.......

SirDice · Jan 11, 2017

It might be a combination of factors. The VMs may be deadlocking trying to read something from disk causing them to consume 100% CPU. But to rule them out as a possible cause you could try testing the machine with the VMs turned off.

All in all the problems you're describing are what I would expect to see if there are disk or controller issues. As long as things are in memory they'll continue to work but anything that requires loading simply stalls and ends up waiting forever.

You already looked at the SMART status of the drives, did you also ran the long and short tests?

tcn · Jan 11, 2017

With all debugging features enabled in the mpr(4) driver, nothing came up in the serial console now the console.
I also tried copying a large file through NFS without issue (right before I stalled the system).

As for SMART tests, no, I haven't because I don't know the implications of running a short or long test on a RAID..... Isn't it like taking it offline of the pool for a while?

Could it be network instead of disks?

SirDice · Jan 11, 2017

tcn said:
As for SMART tests, no, I haven't because I don't know the implications of running a short or long test on a RAID..... Isn't it like taking it offline of the pool for a while?

To be honest I don't know the details but we have several machines running with RAID and smartd runs a daily short test and a weekly long test, it has never caused issues with the RAID sets. I've also triggered both long and short tests on-the-fly and I've never encountered problems (besides the disks themselves being broken or on the brink of total collapse).

tcn said:
Could it be network instead of disks?

Not likely as you're also having those issues when trying to login on the console.

And of course, while checking, my home server started acting up in a similar fashion but not as hard as yours. Turns out one of my disks is near death.

Code:

Jan 10 13:01:46 molly kernel: (da1:mps0:0:1:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
Jan 10 13:01:46 molly kernel: (da1:mps0:0:1:0): Retrying command (per sense data)
Jan 10 13:01:46 molly kernel: (da1:mps0:0:1:0): READ(6). CDB: 08 00 00 00 01 00
Jan 10 13:01:46 molly kernel: (da1:mps0:0:1:0): CAM status: SCSI Status Error
Jan 10 13:01:46 molly kernel: (da1:mps0:0:1:0): SCSI status: Check Condition
Jan 10 13:01:46 molly kernel: (da1:mps0:0:1:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)

Trying to read the disk's SMART data causes this:

Code:

Jan 11 11:31:40 molly kernel: (pass1:mps0:0:1:0): ATA COMMAND PASS THROUGH(16). CDB: 85 06 2c 00 da 00 00 00 00 00 4f 00 c2 00 b0 00 length 0 SMID 185 command timeout cm 0xfffffe0000c962d0 ccb 0xfffff8010e260000
Jan 11 11:31:40 molly kernel: (noperiph:mps0:0:4294967295:0): SMID 61 Aborting command 0xfffffe0000c962d0
Jan 11 11:31:40 molly kernel: mps0: Sending reset from mpssas_send_abort for target ID 1
Jan 11 11:31:44 molly kernel: mps0: Unfreezing devq for target ID 1
Jan 11 11:31:44 molly smartd[1210]: Device: /dev/da1 [SAT], failed to read SMART Attribute Data

Pool is now in a degraded state and I need to replace the disk a.s.a.p.

tcn · Jan 11, 2017

SirDice said:
Not likely as you're also having those issues when trying to login on the console.

true; I tend to forget important details like that.....

I will start some SMART tests on the devices then and see what comes up.

SirDice said:
Pool is now in a degraded state and I need to replace the disk a.s.a.p.

Things happen for a reason I guess.... While helping me, you found an issue before it became a problem

tcn · Jan 14, 2017

I have an update; unfortunately, not to close this issue......

SMART tests (short and long) didn't report any error. I am still clueless.

aribi · Jan 14, 2017

Just thinking out loud:
- you have quit a lot of memory.
- the io you generate in the VM is probably less then the amount of free memory.
- somewhere along the line some resource seems to be depleted stalling all io.
- your zfs is on disk(s) using the mpr driver
From reading mpr(4) it seems that the some resources of this driver are dynamic, ie. are allocated on the fly. During high io load one of these might become too large. Perhaps you can do some monitoring of these tunables (where X is your controller).
Something along the lines of:

Code:

% while (1)
> sleep 1
> sysctl dev.mpr.X.io_cmds_highwater
> sysctl dev.mpr.X.io_cmds_active
> sysctl dev.mpr.X.chain_alloc_fail
> sysctl dev.mpr.X.chain_free_lowwater
> end

Modify for your shell of choice, run on console. And then start bulk io in your VM.
Just to see if one of these values goes through the roof.
Obviously this snippet will also stall when problems occur, so it should give you a hint on max values to set.

tcn · Jan 15, 2017

Aribi,

This is why I like these forums; I never think about checking these things.

I just did a small test; I started a large transfer in the VM while running your small data logger. I didn't see anything wrong there... The highwater is at 136 and I didn't see anything higher than 105.

This time, instead of letting it crash ( because I started to see network speed getting lower and lower) I decided to cancel the transfer. The VM crashed. I couldn't communicate with it and could not ping it either. I had to power it off and back on; system still fine.

I have seen a couple of threads talking about random freezes like what I am experiencing. People with different hardware but same behaviour in different circumstances.... Something is amiss.....

Tcn

aribi · Jan 15, 2017

So it's not the io-buffers, how about checking the interrupt rates. What strikes me is that your network response suffers from disk-io; the interrupt system is common ground for the two. I recall similar problems (io stalling) in early implementations of MSI-X. This might help you to view interrupt activity:
vmstat -i -C 60 | tee /tmp/vmstat-i.out
This gives you 60 seconds to trigger the problem within a VM and see which interrupts are being used.
Another one would be:
systat -vmstat 1
It will show the interrupt activity on the right side.
Depending on what you find (extremely high disk-io interrupt rate) you can try disabling msi-x as per mpr(4). This will throttle your disk system, and leave some time for other interrupts to occur.

tcn · Jan 15, 2017

Hold on..... This makes me think of something.... What does hw.intr_storm_threshold exactly do? I have read threads of FreeBSD optimization and recommended to set this value. I have it at 9000.

I will do the test as you recommended and come back with results.

tcn · Jan 15, 2017

Ok; got the VM to crash but managed to save the system again.
I can't attach the vmstat output; always an error even when compressed.
I'm not sure what to look for..... The highest number of interrupts seen on mpr0 is 6910.

This is the output of systat when the vm is gone.
I also have to mention that the VM is copying files from the server that hosts it through CIFS (samba).

One thing strikes me from the output below. da1 is at 94% while the others are more around 50%. I ran another time the test looking more closely to the disks and I can't find this alarming as I have seen some disks at 100 while others where at 50... No correlation there either.

Code:

    2 users    Load  6.28  4.00  2.05                  Jan 15 10:40
   Mem usage:  0k%Phy  3%Kmem
Mem: KB    REAL            VIRTUAL                      VN PAGER   SWAP PAGER
        Tot   Share      Tot    Share    Free           in   out     in   out
Act  22761k   31304  198919k   227172    1832  count     9           32   253
All  22779k   46348  199974k   827204          pages     9           32   506
Proc:                                                            Interrupts
  r   p   d   s   w   Csw  Trp  Sys  Int  Sof  Flt     27 ioflt 27693 total
  2   4      1k  74  415k  53k 172k 3901  361 2539    169 cow         uart1 3
                                                      245 zfod        acpi0 9
18.7%Sys   0.0%Intr  3.9%User  0.0%Nice 77.4%Idle         ozfod       ehci0 ehci
|    |    |    |    |    |    |    |    |    |           %ozfod  1260 cpu0:timer
=========>>                                            29 daefr  3895 mpr0 264
                                         3 dtbuf        1 prcfr     1 ix0:q0 265
Namei     Name-cache   Dir-cache   1163143 desvn    96912 totfr       ix0:q1 266
   Calls    hits   %    hits   %   1073018 numvn          react       ix0:q2 267
     249     244  98                904737 frevn          pdwak     3 ix0:q3 268
                                                      326 pdpgs     1 ix0:q4 269
Disks   da0   da1   da2   da3   da4   da5   da6        44 intrn       ix0:q5 270
KB/t  42.38 24.83 71.28 66.04 52.96 34.37  0.00  42467848 wire      1 ix0:q6 271
tps     451   704   834   502   880   570     0  20895040 act         ix0:q7 272
MB/s  18.68 17.08 58.07 32.38 45.52 19.11  0.00   1828372 inact       ix0:link
%busy    24    94    51    58    56    61     0           cache       xhci0 283
                                                     1832 free        ahci0 284
                                                   169096 buf     957 cpu1:timer
                                                                 1005 cpu19:time
                                                                  959 cpu20:time
                                                                  956 cpu5:timer
                                                                  953 cpu17:time
                                                                  958 cpu11:time
                                                                  970 cpu2:timer
                                                                  979 cpu10:time
                                                                  955 cpu23:time
                                                                  957 cpu8:timer
                                                                 1093 cpu18:time
                                                                 1138 cpu7:timer
                                                                  946 cpu15:time
                                                                  960 cpu6:timer
                                                                  960 cpu21:time
                                                                 1030 cpu9:timer
                                                                  957 cpu13:time
                                                                 1004 cpu4:timer
                                                                  956 cpu16:time
                                                                  952 cpu3:timer
                                                                  940 cpu14:time
                                                                  972 cpu22:time
                                                                  975 cpu12:time

aribi · Jan 15, 2017

Three observations:

1 Very high number of context switches (Csw) 415k on 28k interrupts. Could it be you over-committed on cores for your VM's? (ie. assigned more cores to VM's then are actually available). As a rule of thumb you need 0.5Ghz of cpu-core per 1Gbit of network bandwidth, so for your 10Gbit you would need to have ~4 cores available. Add 2 for OS and zfs and you start overcommit when total of VM cores is more then 12 (for an 18 core system). In this situation I would never hand more then 12 cores to a single VM.
2 free memory seems low, but there is very little paging, so probably no real concerns there.
3 your mpr seems not to use msi interrupts, so all your diskio is essentially handled by a single core.

1 and 3 combined might lead to the situation where free cores are scarce and mpr interrupt handler need to be relocated very often.

tcn · Jan 15, 2017

I have a 10 cores CPU. So, in fact, I should not assign more than 4 cores for all fo them if I follow you; so I keep 6 cores for the main system. Here, I'm in big trouble. I assigned 10 virtual CPUs for all my VMs (2 for 3 of them and one of 4).

Memory is no issue for the VMs, I even gave them a bit too much I think and is reviewing it. 4 Gigs for them might be overkill; 3 or even 2 might be enough for some of them.

As for msi interrupts; it is not disabled in loader.conf so I guess it is just not used.

FreeBSD crashes copying large files in VM

tcn

aribi

tcn

SirDice

Administrator

tcn

SirDice

Administrator

tcn

SirDice

Administrator

tcn

SirDice

Administrator

tcn

grehan@

tcn

Attachments

SirDice

Administrator

tcn

SirDice

Administrator

tcn

tcn

aribi

tcn

aribi

tcn

tcn

aribi

tcn