Performance issues with PERC 730 RAID controller after upgrading to FreeBSD 13.1

We have a Dell Poweredge R530 with a PERC 730P Mini RAID Controller (RAID-5 with three disks, UFS file system). Firmware is up to date.
After upgrading from FreeBSD 12.3 to 13.1-p2 the system sometimes seems to hang for a while, load is going up, and top is showing an almost 100% system load.
This can be reproduced for example with repeatedly doing

$ /bin/rm -f -r /tmp/usr/ports/sysutils/webmin/work

which deletes 117086 files and directories. It usually takes some more tries to trigger the problem.

Code:
$ top -a
last pid: 87619;  load averages:  1.53,  0.96,  0.76                                                                                                     
968 processes: 9 running, 959 sleeping
CPU:  0.5% user,  0.0% nice, 99.4% system,  0.0% interrupt,  0.0% idle
Mem: 2558M Active, 47G Inact, 2583M Laundry, 3341M Wired, 508M Buf, 6881M Free
Swap: 16G Total, 144M Used, 16G Free

  PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
 9822     88       41  20    0  4324M   762M select   0  34:13 246.28% /usr/local/libexec/mysqld --defaults-extra-file=/usr/local/etc/mysql/my.cnf --basedir=/usr/local --datadi
86419 www           1  85    0   190M    65M CPU6     6   0:09  83.21% /usr/local/sbin/httpd -DNOHTTPACCEPT
86438 www           1  40    0   330M   129M select   1   0:03  72.91% /usr/local/sbin/httpd
87619 root          1  52    0    15M  5020K RUN      0   0:07  71.87% /bin/rm -f -r /tmp/usr/ports/sysutils/webmin/work
85062 www           1  39    0   225M    48M range    4   0:02  56.22% /usr/local/sbin/httpd -DNOHTTPACCEPT
86084 www           1  30    0   330M   134M CPU4     4   0:02  41.69% /usr/local/sbin/httpd
87334 www           1  33    0   231M    63M CPU3     3   0:02  39.70% /usr/local/sbin/httpd -DNOHTTPACCEPT

$ grep ^da0 /var/run/dmesg.boot
da0 at mrsas0 bus 0 scbus0 target 0 lun 0
da0: <DELL PERC H730P Mini 4.30> Fixed Direct Access SPC-3 SCSI device
da0: Serial Number 001036e80c651ebc1f0084897fa06d86
da0: 150.000MB/s transfers
da0: 102400MB (209715200 512 byte sectors)


$ pciconf -lv | grep -A 4  mrsas
mrsas0@pci0:1:0:0:      class=0x010400 rev=0x02 hdr=0x00 vendor=0x1000 device=0x005d subvendor=0x1028 subdevice=0x1f47
    vendor     = 'Broadcom / LSI'
    device     = 'MegaRAID SAS-3 3108 [Invader]'
    class      = mass storage
    subclass   = RAID
gstat does not show a significant disk load
hw.mfi.mrsas_enable="1" is set in loader.conf

I have been searching thru several forums already but could not really find a clue.
 
Some steps that you can try to diagnose such problem are:
Reseat all connectors/cables
Check the hard disk smart status.
Enable mrsas driver debug options and look for I/O timeouts or online controller reset.
During high load check dev.mrsas.X.fw_outstanding
Check if there's some changes in mrsas driver in FreeBSD Current (PR)
Check if there's new firmware for DELL PERC H730P Mini 4.30 and see what is fixed in it comparing all changes between your current installed firmware and latest one.
 
Keep in mind that 12 used an imported and modified ZFS while 13.x switched to OpenZFS. Not sure if it's going to do much performance wise but have you upgraded the pool yet? Make sure you update the bootloader(s) before upgrading your pool though. Pre-13.0 bootloaders are not able to boot from OpenZFS pools. The upgrade process (freebsd-update(8) or build(7)) won't do this, you have to do this yourself.
 
Keep in mind that 12 used an imported and modified ZFS while 13.x switched to OpenZFS. Not sure if it's going to do much performance wise but have you upgraded the pool yet? Make sure you update the bootloader(s) before upgrading your pool though. Pre-13.0 bootloaders are not able to boot from OpenZFS pools. The upgrade process (freebsd-update(8) or build(7)) won't do this, you have to do this yourself.
he uses UFS
 
I’ve got a few Dells (R430s and R640s) with H730s, upgraded to 13.1, UFS, RAID5 and not seen anything like this (so far). On some machines (a few years ago) I could get the machines with a process state of “suspfs“ which seemed to be the RAID controller getting too far behind in writes - it eventually caught up. Doesn’t look like you’ve got a process in that state.

Have you tried installing MegaCli to see if any clues in there - can use it to check the controller logs as well.
 
Thanks for all the feedback so far.

dev.mrsas.X.fw_outstanding is 0 or very low during high load. Enabling mrsas driver debugging does not work for me:

$ sysctl hw.mrsas.0.debug_level
sysctl: unknown oid 'hw.mrsas.0.debug_level'

After doing some more tests I also see that some processes keep staying in getblk state for the time the system is hanging:

9822 23 88 41 20 0 4629M 1098M select 7 96:04 82.46% /usr/local/libexec/mysqld --defaults-extra-file=/u
69493 13 www 1 52 0 198M 68M RUN 7 0:04 55.43% /usr/local/sbin/httpd -DNOHTTPACCEPT
70260 23 www 1 52 0 192M 65M getblk 7 0:08 53.67% /usr/local/sbin/httpd -DNOHTTPACCEPT
68674 7 www 1 33 0 204M 27M lockf 3 0:02 52.55% /usr/local/sbin/httpd -DNOHTTPACCEPT
70665 20 root 1 52 0 12M 3132K getblk 7 0:10 52.44% /bin/rm -f -r /tmp/usr/ports/sysutils/webmin/work
67076 7 www 1 31 0 204M 28M CPU6 6 0:01 48.36% /usr/local/sbin/httpd -DNOHTTPACCEPT
11927 33 root 1 52 0 12M 744K getblk 7 2:03 41.27% supervise ftp
64758 36 www 1 52 0 505M 189M getblk 7 0:34 37.20% /usr/local/sbin/httpd -DNOHTTPACCEPT
70680 34 root 1 80 0 46M 30M CPU4 4 0:05 36.77% /usr/local/lib/webmin/webmincron/webmincron.pl (perl)
70375 15 www 1 41 0 328M 122M select 7 0:03 35.78% /usr/local/sbin/httpd
69498 13 www 1 52 0 190M 28M getblk 7 0:04 34.11% /usr/local/sbin/httpd -DNOHTTPACCEPT

megacli does not show any problems but after installing smartmontools I see one disk with a rather high Non-medium error count: 18916.
This value does not rise even after several further high loads so I doubt that's the reason.
Since the problem started immediately after upgrading to 13.1 I really do not think of any hardware issue, anyway.
 
I have. But this is a RAID5 with 4TB NLSAS disks and I actually refuse to break it because of the risks that this might cause. The machine itself is due to be replaced in 2023. I was just worrying if the load problem might be general with FreeBSD 13.1 and that particular controller but as richardtoohey2 writes his machines are running fine.
 
Sample machine - I've NOT yet tried deleting hundreds of thousands of files - it's a production machine, but I've got a T330 somewhere that I can have a look at.

Code:
# grep ^da0 /var/run/dmesg.boot
da0 at mrsas0 bus 0 scbus0 target 0 lun 0
da0: <DELL PERC H730 Mini 4.27> Fixed Direct Access SPC-3 SCSI device
da0: Serial Number 009e735e1eb80c2628009e29b8a06d86
da0: 150.000MB/s transfers
da0: 7628800MB (15623782400 512 byte sectors)
da0 at mrsas0 bus 0 scbus0 target 0 lun 0
da0: <DELL PERC H730 Mini 4.27> Fixed Direct Access SPC-3 SCSI device
da0: Serial Number 009e735e1eb80c2628009e29b8a06d86
da0: 150.000MB/s transfers
da0: 7628800MB (15623782400 512 byte sectors)

# pciconf -lv | grep -A 4  mrsas
mrsas0@pci0:1:0:0:    class=0x010400 rev=0x02 hdr=0x00 vendor=0x1000 device=0x005d subvendor=0x1028 subdevice=0x1f49
    vendor     = 'Broadcom / LSI'
    device     = 'MegaRAID SAS-3 3108 [Invader]'
    class      = mass storage
    subclass   = RAID

# freebsd-version -ruk
13.1-RELEASE-p2
13.1-RELEASE-p2
13.1-RELEASE-p2
                                    
# df -h
Filesystem    Size    Used   Avail Capacity  Mounted on
/dev/da0p2    7.0T    2.5T    4.0T    38%    /
devfs         1.0K    1.0K      0B   100%    /dev

Sample MegaCli output:

Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name                :System
RAID Level          : Primary-5, Secondary-0, RAID Level Qualifier-3
Size                : 7.275 TB
Sector Size         : 512
Is VD emulated      : No
Parity Size         : 1.818 TB
State               : Optimal
Strip Size          : 64 KB
Number Of Drives    : 5
Span Depth          : 1
Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Enabled
Encryption Type     : None
Default Power Savings Policy: Controller Defined
Current Power Savings Policy: None
Can spin up in 1 minute: No
LD has drives that support T10 power conditions: No
LD's IO profile supports MAX power savings with cached writes: No
Bad Blocks Exist: No
Is VD Cached: No



Exit Code: 0x00
 
Thanks a lot. There is a similar machine that has to be upgraded. I decided to upgrade this one to 12.4 after it is released and see what happens.
 
I upgraded a similar machine to 12.4 without having the problem after that. Meanwhile I also replaced the affected server with a new R450 containing a PERC H745. Until now the issue did not occur anymore so I guess there must have been some hardware problem with the old device.
Thank you all very much.
 
Back
Top