Other iSCSI / CTLD problems after recent upgrade

ApoC · Mar 4, 2015

Hi,

I just fetched new sources for stable/10 branch, rebuild all as usual and get into serious problems with ctld and iSCSI volume mounted on Windows 8.1. before upgrade everything worked fine, fast and as expected. After upgrade I mount volume under Windows 8.1 I can see and access disk but after a while ctld stops responding, mounted volume is inaccessible and ctld daemon can't be restarted.

My configuration is as follows:

Code:

portal-group pg0 {
        discovery-auth-group no-authentication
        listen 192.168.0.1
}

target iqn.net.mail.nymfe:target0 {
        auth-group no-authentication
        portal-group pg0

        lun 0 {
                path /dev/mirror/gm0
                blocksize 4096
        }
}

Exported device is gmirror of two disks:

Code:

mirror/gm0  COMPLETE  ada0p1 (ACTIVE)
                      ada1p1 (ACTIVE)

Code:

=>        34  3907029101  ada0  GPT  (1.8T)
          34        2014        - free -  (1.0M)
        2048  3907027080     1  freebsd-ufs  (1.8T)
  3907029128           7        - free -  (3.5K)

=>        34  3907029101  ada1  GPT  (1.8T)
          34        2014        - free -  (1.0M)
        2048  3907027080     1  freebsd-ufs  (1.8T)
  3907029128           7        - free -  (3.5K)

In Windows event log, several problems are reported:

Code:

The initiator could not send an iSCSI PDU. Error status is given in the dump data.

Binary data:

In Words

0000: 00380000 00000001 00000000 C0000007
0010: 00000000 00000000 00000000 00000000
0020: 00000000 00000000 C0000241 00710069
0030: 002E006E 0065006E 002E0074 0061006D
0040: 006C0069 006E002E 006D0079 00650066
0050: 0074003A 00720061 00650067 00300074

In Bytes

0000: 00 00 38 00 01 00 00 00 ..8.....
0008: 00 00 00 00 07 00 00 C0 .......À
0010: 00 00 00 00 00 00 00 00 ........
0018: 00 00 00 00 00 00 00 00 ........
0020: 00 00 00 00 00 00 00 00 ........
0028: 41 02 00 C0 69 00 71 00 A..Ài.q.
0030: 6E 00 2E 00 6E 00 65 00 n...n.e.
0038: 74 00 2E 00 6D 00 61 00 t...m.a.
0040: 69 00 6C 00 2E 00 6E 00 i.l...n.
0048: 79 00 6D 00 66 00 65 00 y.m.f.e.
0050: 3A 00 74 00 61 00 72 00 :.t.a.r.
0058: 67 00 65 00 74 00 30 00 g.e.t.0.

Code:

Connection to the target was lost. The initiator will attempt to retry the connection.

Binary data:

In Words

0000: 00340000 00000001 00000000 C0000014
0010: 00000000 00000000 00000000 00000000
0020: 00000000 00000000 00710069 002E006E
0030: 0065006E 002E0074 0061006D 006C0069
0040: 006E002E 006D0079 00650066 0074003A
0050: 00720061 00650067 00300074

In Bytes

0000: 00 00 34 00 01 00 00 00 ..4.....
0008: 00 00 00 00 14 00 00 C0 .......À
0010: 00 00 00 00 00 00 00 00 ........
0018: 00 00 00 00 00 00 00 00 ........
0020: 00 00 00 00 00 00 00 00 ........
0028: 69 00 71 00 6E 00 2E 00 i.q.n...
0030: 6E 00 65 00 74 00 2E 00 n.e.t...
0038: 6D 00 61 00 69 00 6C 00 m.a.i.l.
0040: 2E 00 6E 00 79 00 6D 00 ..n.y.m.
0048: 66 00 65 00 3A 00 74 00 f.e.:.t.
0050: 61 00 72 00 67 00 65 00 a.r.g.e.
0058: 74 00 30 00 t.0.

Code:

Target did not respond in time for a SCSI request. The CDB is given in the dump data.

Binary data:

In Words

0000: 003E0000 00000001 00000000 C0000009
0010: 00000000 00000000 00000000 00000000
0020: 00000000 00000000 1802002A 00003B89
0030: 00690001 006E0071 006E002E 00740065
0040: 006D002E 00690061 002E006C 0079006E
0050: 0066006D 003A0065 00610074 00670072
0060: 00740065 0030

In Bytes

0000: 00 00 3E 00 01 00 00 00 ..>.....
0008: 00 00 00 00 09 00 00 C0 .......À
0010: 00 00 00 00 00 00 00 00 ........
0018: 00 00 00 00 00 00 00 00 ........
0020: 00 00 00 00 00 00 00 00 ........
0028: 2A 00 02 18 89 3B 00 00 *...;..
0030: 01 00 69 00 71 00 6E 00 ..i.q.n.
0038: 2E 00 6E 00 65 00 74 00 ..n.e.t.
0040: 2E 00 6D 00 61 00 69 00 ..m.a.i.
0048: 6C 00 2E 00 6E 00 79 00 l...n.y.
0050: 6D 00 66 00 65 00 3A 00 m.f.e.:.
0058: 74 00 61 00 72 00 67 00 t.a.r.g.
0060: 65 00 74 00 30 00 e.t.0.

Code:

Initiator sent a task management command to reset the target. The target name is given in the dump data.

Binary data:

In Words

0000: 00340000 00000001 00000000 C0000027
0010: 00000000 00000000 00000000 00000000
0020: 00000000 00000000 00710069 002E006E
0030: 0065006E 002E0074 0061006D 006C0069
0040: 006E002E 006D0079 00650066 0074003A
0050: 00720061 00650067 00300074

In Bytes

0000: 00 00 34 00 01 00 00 00 ..4.....
0008: 00 00 00 00 27 00 00 C0 ....'..À
0010: 00 00 00 00 00 00 00 00 ........
0018: 00 00 00 00 00 00 00 00 ........
0020: 00 00 00 00 00 00 00 00 ........
0028: 69 00 71 00 6E 00 2E 00 i.q.n...
0030: 6E 00 65 00 74 00 2E 00 n.e.t...
0038: 6D 00 61 00 69 00 6C 00 m.a.i.l.
0040: 2E 00 6E 00 79 00 6D 00 ..n.y.m.
0048: 66 00 65 00 3A 00 74 00 f.e.:.t.
0050: 61 00 72 00 67 00 65 00 a.r.g.e.
0058: 74 00 30 00 t.0.

Code:

The description for Event ID 129 from source iScsiPrt cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.

If the event originated on another computer, the display information had to be saved with the event.

The following information was included with the event:

\Device\RaidPort1

the message resource is present but the message is not found in the string/message table

Binary data:

In Words

0000: 0018000F 00000001 00000000 80040081
0010: 00000004 00000000 00000000 00000000
0020: 00000000 00000000 00000000 00000000
0030: 00000000 80040081 00000000 00000000

In Bytes

0000: 0F 00 18 00 01 00 00 00 ........
0008: 00 00 00 00 81 00 04 80 ......
0010: 04 00 00 00 00 00 00 00 ........
0018: 00 00 00 00 00 00 00 00 ........
0020: 00 00 00 00 00 00 00 00 ........
0028: 00 00 00 00 00 00 00 00 ........
0030: 00 00 00 00 81 00 04 80 ......
0038: 00 00 00 00 00 00 00 00 ........

Code:

The IO operation at logical block address 0xc6ccf for Disk 2 (PDO name: \Device\00000088) was retried.

Any advice is welcome.

ApoC · Mar 5, 2015

Just for update. I reverted sources to Jan 23 (git hash 227bb2b2ea01be5c73785d6cd7fd96443e7c0bb0) and problem is gone, so definitely some kind of deadlock in recent iscsi changes.

mav@ · Mar 5, 2015

From your description I don't have much ideas. If the problem is so easily reproducible for you, could you make a full packet dump (hopefully it won't be too big) with tcpdump -pi <interface> -s 0 -w <file> port 3260.

ApoC · Mar 6, 2015

tcpdump is not a problem. In mean while I identified commit what cause this issue by git bisect:

Code:

commit 8667ca9eb4f66121c921be56e31d2a67ab73513b
Author: mav <mav@FreeBSD.org>
Date:   Thu Feb 19 14:40:50 2015 +0000

    MFC r278672: Teach CTL to ask GEOM devices about BIO_DELETE support.

mav@ · Mar 6, 2015

Hmm. Interesting. What are your ada0/ada1 devices? Are they SSDs? If so, then probably all this commit did was automatically enable UNMAP support for the LUN. I suppose that problem may be not in this commit itselt, but somewhere else, only triggered now. Actually right now there is some discussion about some UNMAP issue of gmirror on freebsd-geom mailing list.

As quick test and workaround you may try to manually disable UNMAP by adding to your LUN configuration line `option unmap off`.

ApoC · Mar 6, 2015

They are regular spinning drives, only SSD is for root and it is not exported via iSCSI. I'll try to add that option.

mav@ · Mar 6, 2015

As I can see, geom_mirror always report UNMAP support, even if built over HDDs. But I have feeling that it may not work correctly in that case. So yes, try to disable UNAMP on CTL level.

ApoC · Mar 6, 2015

Another observation, when I build debug kernel and induce the problem and try to restart ctld I am getting kernel panic with stack trace:

Code:

...
kassert_panic()
icl_conn_close()
cfiscsi_maintenance_thread()
fork_exit()
...

ApoC · Mar 6, 2015

With your suggested config option I am not able to induce the problem.

mav@ · Mar 12, 2015

I reproduced the UNMAP problem with gmirror(8). Fixed it in FreeBSD head r279913. Will merge it to stable in two weeks.

ApoC · Mar 26, 2015

Thanks for the fix.