Disk performance issue

I'm at my wits end here, any thoughts or advice would be greatly appreciated - no matter how far fetched!

This is under 8.2-STABLE, I can provide whatever output is useful.

I'm running into some performance problems with two SATA drives. They are connected via an LSI 1068 controller. There are a total of 7 drives connected to this HBA, 5 of them perform normally (100-120 MB/s seq read/write) but two of them perform very poorly.

When running
[cmd=]dd bs=1m of=/dev/null if=/dev/da6[/cmd]
in gstat I see
Code:
dT: 5.505s  w: 1.000s
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
     1    122    122  15626    8.1      0      0    0.0   99.2| da6

High busy, high latency, low throughput. The other drives (da1 for example) behave normally:

Code:
dT: 5.505s  w: 1.000s
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
    1    742    742  94923    1.3      0      0    0.0   96.6| da1

I pulled both drives and connected them to my windows 7 laptop via eSATA and benchmarked them, they performed normally with around 100mb/s sequential read and write. I have checked the SMART attributes, nothing out of the ordinary. I have tried swapping them around in the hotswap bay and the physical connections to the HBA to no avail. I have tried everything I can think of, I need new ideas no matter how random.

If the HBA was the problem, whether hardware or software, then the other 5 drives would exhibit the same problems, right?

I have seen only two things that pique my curiosity but I don't know if they matter.

The first is the GEOM output for those two drives. There is a "Mode" value, they have r0w0e0 where all the other drives have r1w1e1. However this could be because they are not part of a zpool and all the other drives are?

The second is the [cmd=]camcontrol devlist[/cmd] output:

Code:
<ATA SAMSUNG HD154UI 1118>         at scbus0 target 0 lun 0 (da0,pass0)
<ATA SAMSUNG HD154UI 1118>         at scbus0 target 1 lun 0 (da1,pass1)
<ATA SAMSUNG HD154UI 1118>         at scbus0 target 3 lun 0 (pass6,da6)
<ATA SAMSUNG HD154UI 1118>         at scbus0 target 4 lun 0 (pass2,da2)
<ATA ST31500541AS CC34>            at scbus0 target 5 lun 0 (da3,pass3)
<ATA Hitachi HDS5C301 A580>        at scbus0 target 6 lun 0 (da4,pass4)
<ATA Hitachi HDS5C301 A580>        at scbus0 target 7 lun 0 (da5,pass5)

The drives that perform normally are (da,pass) but the two behaving oddly are (pass,da). Does that mean anything?
 
Updates

I pulled the drive again and connected it to my laptop via eSATA. After formatting it (NTFS, 4k clusters) it performed just fine. I put it back in my server and mounted the NTFS partition and did sequential reads with dd. Very slow still. I wiped the drive and ran newfs against it, repeated tests, still slow. I tried bonnie++ on it and still very slow. I connected the drive to one of the onboard sata ports, still slow. What in the name of all that is holy is going on here!
 
It's not guaranteed that all the controller ports are the same. Try swapping one of the "slow" drives onto one of the "fast" ports.
 
I tested disk performance last summer with a Futsu server with LSI controller and 15k SAS disks vs Supermicro with 7.2k SATA2 disks attached on board ports. Performance with the SATA disks was better. I concluded then that the LSI controller slows somehow the performance though I did not use RAID setup but used it in transparent mode.
 
wblock@ said:
It's not guaranteed that all the controller ports are the same. Try swapping one of the "slow" drives onto one of the "fast" ports.
I have swapped to

Another port on the same controller (slow)
Another port on the motherboard (slow)
Another port in a different physical machine (fast)

zodias said:
I tested disk performance last summer with a Futsu server with LSI controller and 15k SAS disks vs Supermicro with 7.2k SATA2 disks attached on board ports. Performance with the SATA disks was better. I concluded then that the LSI controller slows somehow the performance though I did not use RAID setup but used it in transparent mode.
I too am using the controller in transparent mode, flashed with IT firmware. My issue is not that the drives are slower by some small margin but that they are extremely slow. What confounds me is that I have multiple drives of the same exact make and model connect to the same controller, one performs as expected and the other performs dog slow. You would think this means there is a problem with the disk. But if I pull that disk and attach it to my laptop via eSATA it will perform normally.

I have tried every combination I can think of, the only conclusion I have reached is that somehow this is a problem with my installation of FreeBSD. I think my next step is to boot the server with livecd and try some benchmarking. Essentially temporarily try another OS on the same hardware.
 
Humor me:

Code:
diskinfo -ctv /dev/<disk>
on both the slow ones and then maybe on 2 normal ones. Compare them. Anything ?
 
da1 said:
Humor me:

Code:
diskinfo -ctv /dev/<disk>
on both the slow ones and then maybe on 2 normal ones. Compare them. Anything ?

Oh yes, big differences which I'm not surprised to see.

Code:
/dev/ad10
        512             # sectorsize
        1500301910016   # mediasize in bytes (1.4T)
        2930277168      # mediasize in sectors
        0               # stripesize
        0               # stripeoffset
        2907021         # Cylinders according to firmware.
        16              # Heads according to firmware.
        63              # Sectors according to firmware.
        S1XWJD1ZB02558  # Disk ident.

I/O command overhead:
        time to read 10MB block      1.227767 sec       =    0.060 msec/sector
        time to read 20480 sectors   2.906722 sec       =    0.142 msec/sector
        calculated command overhead                     =    0.082 msec/sector

Seek times:
        Full stroke:      250 iter in  14.213742 sec =   56.855 msec
        Half stroke:      250 iter in  13.308107 sec =   53.232 msec
        Quarter stroke:   500 iter in  25.597328 sec =   51.195 msec
        Short forward:    400 iter in  21.291186 sec =   53.228 msec
        Short backward:   400 iter in  24.214398 sec =   60.536 msec
        Seq outer:       2048 iter in   0.363503 sec =    0.177 msec
        Seq inner:       2048 iter in   0.363030 sec =    0.177 msec
Transfer rates:
        outside:       102400 kbytes in  13.925104 sec =     7354 kbytes/sec
        middle:        102400 kbytes in  14.299191 sec =     7161 kbytes/sec
        inside:        102400 kbytes in  15.473790 sec =     6618 kbytes/sec

Code:
/dev/ad1
        512             # sectorsize
        1500301910016   # mediasize in bytes (1.4T)
        2930277168      # mediasize in sectors
        0               # stripesize
        0               # stripeoffset
        2907021         # Cylinders according to firmware.
        16              # Heads according to firmware.
        63              # Sectors according to firmware.
        S1XWJX0B200064  # Disk ident.

I/O command overhead:
        time to read 10MB block      0.152640 sec       =    0.007 msec/sector
        time to read 20480 sectors   3.118179 sec       =    0.152 msec/sector
        calculated command overhead                     =    0.145 msec/sector

Seek times:
        Full stroke:      250 iter in   7.016183 sec =   28.065 msec
        Half stroke:      250 iter in   6.997274 sec =   27.989 msec
        Quarter stroke:   500 iter in  10.980527 sec =   21.961 msec
        Short forward:    400 iter in   2.576808 sec =    6.442 msec
        Short backward:   400 iter in   2.027736 sec =    5.069 msec
        Seq outer:       2048 iter in   0.374920 sec =    0.183 msec
        Seq inner:       2048 iter in   0.373581 sec =    0.182 msec
Transfer rates:
        outside:       102400 kbytes in   1.012591 sec =   101127 kbytes/sec
        middle:        102400 kbytes in   1.201352 sec =    85237 kbytes/sec
        inside:        102400 kbytes in   2.103531 sec =    48680 kbytes/sec
 
I don't seem to be able to edit my own posts so for clarification: ad10 is the one behaving abnormally, ad1 is behaving normally. Both are connected to onboard ports, both are the same make and model. If I were to remove ad10 and connect it to another machine, my laptop with Windows 7 for example, it would perform normally.
 
If you connect it to the laptop, boot with mfsBSD, diskinfo shows terrible seek times like that again, it's the drive. The reason for using diskinfo on that system is to elimate benchmark differences. Running a SMART report on the slow drives could be interesting.

It's not unknown for RAID drives of the same age with the same wear going bad within a few days of each other.
 
wblock@ said:
If you connect it to the laptop, boot with mfsBSD, diskinfo shows terrible seek times like that again, it's the drive. The reason for using diskinfo on that system is to elimate benchmark differences. Running a SMART report on the slow drives could be interesting.

It's not unknown for RAID drives of the same age with the same wear going bad within a few days of each other.

Added mfsbsd 8.2-RELEASE AMD64 to my pxe server, booted the laptop from it with the hard drive connected and ran # diskinfo -t on it. It performed normally. I was in a hurry and had no easy way to copy the text, but it had ~ 90MB/s reads and the latency was normal.

My next test will be to boot my server with mfsbsd and test the disk that way.

So far what I have determined is:

It is not the physical drive.
It is not the physical controller.
It is not the SATA cable.

Could it be some installed software and/or service? Is there a way I can monitor the drive at a very low level to see if something is interfering? Something that would not show up in iostat, diskinfo -t, top -m io, etc.? My kernel has DTRACE support, might there be a way to further diagnose this via DTRACE?
 
wblock@ said:
sysutils/ataidle comes to mind as something that could affect certain drives. camcontrol(8), too.

I did not have sysutils/ataidle installed so that wouldn't be it. Though I did install it since I noticed it provides a way to enable/disable APM and AAM. Both were disabled by the way. I enabled both with maximum performance values, no change to drive characteristics.

I'm leaving work soon, the first thing I'm going to do is boot that box with mfsbsd and see if the problem persists. I really appreciate your and everyone else's suggestions and hopefully I can get this resolved soon.
 
One more question. When thinking of the I/O stack, from the application down to the physical hard drive, are there any layers between the device driver (ata, mpt) and the physical drive itself?
 
wblock@ said:
Sorry, I don't know. Another question: do the fast and slow drives have the same firmware version?

Yes they do. So, I booted the server into mfsBSD from a USB thumbdrive. The drive still performed poorly. Looking at the SMART attributes I'm starting to see some bad values after these past few days of torturing it. None of this explains whats going on, but I'm going to run an extended offline test and see what happens.
 
Terry_Kennedy said:
Note that in at least one instance Samsung has changed the firmware without altering the reported version number.

Yeah, I have 10x of those in a NAS and have updated the FW with the "utility" they provided on the drives one by one. Then to have smartd still scream at you because they haven´t changed the revision number; facepalm:)

/Sebulon
 
Well I'm going to chock this up to a bad drive I guess. I don't feel 100% about it but the drive is heading downhill. So rather than bang my head against a wall anymore I'll RMA it. The other Samsungs are starting to show menacing SMART values as well. Steer way, way away from Samsung HD154UI drives :(

Thanks for your input wblock, I really appreciate it.
 
Back
Top