Not all drives detected - M1015 (flashed to 8211-IT) on FreeBSD 10.0-RELEASE

ers

Member

Reaction score: 1
Messages: 38

Please help with mysterious disks problem on system with two M1015 SAS/SATA controller.

Description of a problem
There are two M1015 SAS/SATA controllers (LSI 8240 flashed to 8211-IT) placed into SuperMicro board.
Six drives are working in raidz2 (new_pool) without any problems (4 of then on mps0 and 2 on mps1).
After connecting 8 disk (old_pool1, old_pool2) from another machine, one of the disks was not recognized. Not shown in dmesg. The old_pool2 was degraded.
Placing this disk in different location (controller, power, data connector) did not help.

Controllers are recognized and working without a problem. There is a difference between firmware and BSD driver. For tests, LSI drivers 16,17,18 and 19 were downloaded and placed as modules. No change. ):

Code:
mps0: <LSI SAS2008> port 0xe000-0xe0ff mem 0xf74c0000-0xf74c3fff,0xf7480000-0xf74bffff irq 16 at device 0.0 on pci1
mps0: Firmware: 17.00.01.00, Driver: 16.00.00.00-fbsd
mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>

mps1: <LSI SAS2008> port 0xd000-0xd0ff mem 0xf73c0000-0xf73c3fff,0xf7380000-0xf73bffff irq 17 at device 0.0 on pci2
mps1: Firmware: 17.00.01.00, Driver: 16.00.00.00-fbsd
mps1: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>

However, after hundreds of test I manage to test with 2 more drives... and the magic begin...
After placing one disks on M1015 it shows up, the one "not seen" was still not seen.
After replacing "not seen" disk with new one, the new one was not seen.
After placing two new disks on M1015 both shows up, the one "not seen" was still not seen.
After placing them on motherboard controller the one "not seen" was still not seen.
After removing 2 new drives and connecting "not seen" one to motherboard controller, disk shows up, but one disk from new_pool disappears...

It behaves like the number of disks are badly reported to FreeBSD by mps driver and there is always one missing. I am not sure who blame: hardware, driver or system? Maybe it is connected with firmware and bsd driver mismatch? However, I have heard that driver 18 was skipped, but firmware 17 was compatible with driver 16. Is that right? The tests with different LSI drivers (not changing firmware) doesn't change much.
Was there a problem earlier with mps driver, which badly enumerated connected drives? If so, is the problem fixed for sure?

Is there something I do not know? What to look for? Who talk to about this problem? MAV@?

Known for sure:
- no answer on google / RTFM / this forum
- new_pool drives (same model_1, size, firmware, sata-3)
- old_pools drives (same model_2, size, firmware, sata-2)
- additional drives (same model_3, size, different firmwares, sata-3)
- disk connected directly to controllers
- all data and power connector are verified to be okay
- all disk are ok and are seen and works in other machines
- no relocated sectors, no curent pending sectors, SMART okay
- camcontrol devlist always shows the same disks shown in dmesg
- there is enough power for disks
- "not seen" disk is not spinning (not warm after a while and no vibration can be sensed)
 
Last edited:

Terry_Kennedy

Aspiring Daemon

Reaction score: 347
Messages: 979

- "not seen" disk is not spinning (not warm after a while and no vibration can be sensed)
This would seem to indicate that the disk is either defective or is waiting for a command to spin up.

I'm not sure how much diagnostic info is provided by the firmware you flashed to the controller, but I would at least expect it to list the drives that it detected. Normally, a working but not-spun-up drive will report with capacity N/A but will report the model number.

Since this drive is also not seen when connected to the motherboard controller, I'm betting on a drive problem.
 
OP
E

ers

Member

Reaction score: 1
Messages: 38

You did not read carefully. I have checked all the disks and they are ok. It means no errors, no timeouts, writable, readable (connectors and cables as well).
When disk is seen, it spin up properly and working like a charm.
When it is not seen, it is colder than the rest - after some time (not spinning - because it wasn't initiated by the controller or driver, I think).
In other system all disks working without any problems. I do have tested it - believe me.

As I said, disk is not seen. Just not seen. No S/N, no SMART, nothing... camcontrol do not see that disk. There is no message in dmesg, etc.
Depending on how and how many disk are connected, different disk is "not seen". However, before, when that disk was seen, it was working properly.
Problem seems to be disk model independent. Tested on: ST32000542AS, ST4000VN000, ST2000DM001.
 
OP
E

ers

Member

Reaction score: 1
Messages: 38

New information: in FreeNAS they have "not seen" serial numbers but I am unable to resolve if this is the case here in FreeBSD.
Anybody have similar problem here?
 

diizzy

Aspiring Daemon

Reaction score: 190
Messages: 575

I have several M1015 controller (crossflashed) and they all work fine. In your case you have either incompatible HDDs or defective hardware.
I have no idea what it is but it could be the card itself, cabling and/or PSU.
This is not a driver issue you're looking at...
//Danne
 
OP
E

ers

Member

Reaction score: 1
Messages: 38

In your case you have either incompatible HDDs or defective hardware.
Why you say I have incompatible HDDs?
How you explain that all of them are working with my controllers? (but not in the same time)
How do you explain that different HDDs behave the same way? Maybe "incompatibility" is random? :)

As you have read (iI hope) iI have checked cabling, power and disks. They all are fine.
PSU is new and overpowered. Half of disks were working in another system without the problem but on different motherboard.
Half of them are new. All of those of two series are the same disks. The same model, firmware, capacity, etc.
I can put any of them as "unseen". I can manage any of them to work without the problem.
Only one problem is that one of them will be "unseen" when working together...

After analyzing the data from different connecting cases, only logical reason is that there must be something with the card/firmware/driver.
Please do read posts above, analyze and correct me if I am wrong.

In Internet all says that firmware and driver need to match.
I have mismatched firmware 17 -> bsd driver 16.
This is most likely the cause.
In the same sources they suggest that firmware 17 on bsd driver 16 should work and that mismatched firmware and driver version is causing problems...
I did not find any "sure" answer.

diizzy, what firmware/bsd driver do you have?
What is your version of FreeBSD?
Do all of yours M1015 are fully populated?
What HDDs are you using?
 

diizzy

Aspiring Daemon

Reaction score: 190
Messages: 575

Because HDDs that randomly drop usually are incompatible and keep in mind that far from all consumer HDDs plays nice with HBA/RAID controllers. There could be some kind of timing/race condition that you only see with fully populated controllers etc. General recommendation is to stay away from any type of "green" HDD as these often triggers fault mode as they respond too slowly in sleep mode and usually doesn't support TLER at all or in a resonable way.

...and to answer you questions (yes, I know my firmware is old but on this box I can't upgrade the firmware so it's on hold...):

1. Builtin, firmware 17.00.01.00 - driver: mps0: 19.00.00.00-fbsd
2. FreeBSD -HEAD r278472 (9.2+ also works fine)
3. Yes, "true" Hitachi drives which you can find labeled as Toshiba nowdays. New Hitachi HDDs seems to relabeled WD HDDs more or less and they aren't as good as Toshiba ones (DT01 series).
Here one controller but some of these HDDs are getting really old and with that comes reliability...
Code:
mps0: <LSI SAS2008> port 0xe000-0xe0ff mem 0xe05c0000-0xe05c3fff,0xe0580000-0xe05bffff irq 16 at device 0.0 on pci1
mps0: Firmware: 17.00.01.00, Driver: 19.00.00.00-fbsd
mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>

da0: <ATA Hitachi HDT72101 A31B> Fixed Direct Access SCSI-6 device
da1: <ATA Hitachi HDT72101 A31B> Fixed Direct Access SCSI-6 device
da2: <ATA Hitachi HDT72101 A31B> Fixed Direct Access SCSI-6 device
da3: <ATA Hitachi HDT72101 A31B> Fixed Direct Access SCSI-6 device
da4: <ATA Hitachi HDP72505 A50E> Fixed Direct Access SCSI-6 device
da5: <ATA Hitachi HDP72505 A50E> Fixed Direct Access SCSI-6 device
da6: <ATA Hitachi HDP72505 A50E> Fixed Direct Access SCSI-6 device
da7: <ATA Hitachi HDP72505 A50E> Fixed Direct Access SCSI-6 device
//Danne
 
OP
E

ers

Member

Reaction score: 1
Messages: 38

Hm... Probably we do not understand each other correctly. My disk do not drops randomly.
They start and work stable until I power off the system and change connection paths.
During work the one is "unseen". When you change connection and power up the system, another one is "unseen", the rest are stable.
All drives are stable between restarts and long uptime.

About "green" disk I can agree (they are from old machine), but new ones ST4000VN000 are dedicated to NAS (with all addings) and they behaving the same way...
You have driver 19 from HEAD (16 was from RELEASE), so maybe there was a bug in 16, which was also in my 10.0 RELEASE.
I did not tested bsd driver 19. Maybe it is time to test? If so, it will be driver again...
 

diizzy

Aspiring Daemon

Reaction score: 190
Messages: 575

I have no idea about Seagate's NAS HDDs, they seem very new and I can't find much of user experience either unfortunately. In most cases NAS seems to reflect on 24/7 run-time not being HBA/RAID-friendly.
//Danne
 
OP
E

ers

Member

Reaction score: 1
Messages: 38

You do not catch, I think... If the same behavior is observed with different disks it is not from disks (by 99.8%). ;-)
 

diizzy

Aspiring Daemon

Reaction score: 190
Messages: 575

So you have either a broken controller(s), cables, unreliable PSU(s). If the controller doesn't detect the HDD it's not a driver issue.
//Danne
 
OP
E

ers

Member

Reaction score: 1
Messages: 38

Did you read anything I have post above? All cables and PSU were checked and they are OK!
You said it is disk. Logic shows that it is not a disk. Look above, disk are OK when running in different system. Thy work stable even in the same system. If it is compatibility issue then all of them should be unseen. They all are the same model, firmware etc... This shows that probably enumeration of disk in controller is broken. After bad enumeration one disk from controller is disabled. It can be most probably due to controller or driver. Reading from Internet there were serious issues due to firmware/driver mismatch. For some configurations it works (like yours). In my case there are more than one M1015 and they too behave the same way. Magic? The same broken series? Only one thing connects them - firmware/driver mismatch.
On some SuperMicro motherboards were issue when 3 cards together do not work, but two of them always works. This was on different models of Sipermicro motherboards, so this is not the case here.

You did not write anything to prove that it is disk problem. Only empty statements.
Please convince me you are right. Show me what to look for to find a solution.
 

diizzy

Aspiring Daemon

Reaction score: 190
Messages: 575

The firmware is a standalone program, if it doesn't see disks during boot up you have a hardware/incompatibility issue not a driver issue. But lets assume everything works fine and works as intended. The fact still stands that not all drives are detected...
//Danne
 
OP
E

ers

Member

Reaction score: 1
Messages: 38

How it is possible that any disk have incompatibility issue with any of my m1015 controllers? Please explain what you have on mind...
How to recognize which hardware have it? Your answers are only statements. If you understand the problem, please explain it to me.

Firmware and driver are connected ant should be analyzed together when talking about whole system.
Even if firmware is ok, when driver do not work properly hardware can be unseen, because the software makes device tree to work with.
If there is no reference to hardware, the system do not see it. System, not the hardware!
So, if hardware could see any of the hdd, then hardware CAN see it. But more probably software do not.

Of course, if driver is ok and firmware are broken then anything could happened.

Your statement can be explained only by your practical experience of the same problem.
If so, you can easily show me where the problem is, and how to resolve it.
Why you do not doing that? :)
 

diizzy

Aspiring Daemon

Reaction score: 190
Messages: 575

I honestly give up on this, you've obviously decided that everything works except that it doesn't.
//Danne
 

User23

Well-Known Member

Reaction score: 68
Messages: 496

Well

Code:
Stephen McConnell 
from Avago Tech (formally LSI Corp.)
2015-02-19 23:33:35 UTC
"I have sent the changes to my mentors for review and hope to get this committed next week."

sounds promising.
 
OP
E

ers

Member

Reaction score: 1
Messages: 38

diizzy
I do not claim that everything is working, especially that it does not work.
However, the reason you provided is not very likely... You have not entered anything supportive.

Sebulon
Thanks, this looks promising, however it requires time to test. I have a lot of data on those disks which I cannot lose... :)
I am not sure about difference about described case. In it mps driver shows a messages in a dmesg. In my case there is no abnormal message, but one disk is not recognized.
I need time to check 10.1-RELEASE for FreeBSD driver 19. Mentioned dev.mps.0.spinup_wait_time is not available in 10.0-RELEASE.

Sebulon
User23
Did you have similar problems?
Did you checked spinup_wait patch?
Is it avail in 10.1-RELEASE?

Anybody with the same or similar problems with lsi2008/m1015/mps, please write.
Any amount of information can be useful.
 
Last edited:

diizzy

Aspiring Daemon

Reaction score: 190
Messages: 575

So.... What we possibly have is a hardware incompatibility as WD HDDs does something else than what they should *tada.wav*. That said, do get any errors at all? Given your previous posts it doesn't seem so.
//Danne
 
OP
E

ers

Member

Reaction score: 1
Messages: 38

diizzy
Anything supportive? What WD hdd? mps(4) driver do report himself in dmesg, but there is no error messages at all.

gkontos
Thanks! I have to read this thread carefully. This looks very interesting. If enumeration process is distracted by ses devices it can be a trace...
In my case there is a ses device:
Code:
<AHCI SGPIO Enclosure 1.00 0001> at scbus8 target 0 lun 0 (ses0,pass16)
I have no backplane. All hdd are connected directly to the controller.
 
Last edited by a moderator:

User23

Well-Known Member

Reaction score: 68
Messages: 496

User23
Did you have similar problems?

I don't have similar problems because I haven't similar hardware. But I need to buy some this year for new ZFS systems.
Sadly a lot of bugs, ZFS and Driver/Hardware related, remaining for a long time.
It feels like every patch or release should fix it, but it doesn't.
As a ZFS user I fear every new release.
 

diizzy

Aspiring Daemon

Reaction score: 190
Messages: 575

I've been running ZFS since 7.X-days and 9+ has worked really good. FreeBSD worked somewhat decently but it required a lot of tuning and performance wasn't that great. FreeBSD 10.1 works really good in general with ZFS so I'm very happy how it has evolved over the years.
//Danne
 
Top