Other Some thoughts after migration from istgt to ctld on 10.1

qcure · Dec 1, 2014

Hello,

I decided to share some thoughts with you about our migration from istgt to ctld on 10.1. We used to have five istgt daemons configured to work on different port and servicing different users. The box it self has 130+ disks, and 192G RAM, we have SSDs for both cache and log:

It seems that ctld does not support having multiple daemons running with different configuration. After running more that one daemon the /var/log/messages gets filled with errors.
Looks like having more than one portal group in the configuration file is also an issue. The service starts fine, but if you try to reload the configuration you get an error message in /var/log/messages saying that the address is already in use and it fails to read/reload the configuration. If you do a restart on the ctld it starts just fine and works.
The performance is excellent, much better than what we previously had with istgt.
If you have ZVOL volumes that were used for MS SQL databases and the ZVOLs were created with blocksize that is different from 512 bytes, moving those volumes to ctld will cause you problems, and the database will fail to start with the error:
Code:
```
Cannot use file 'Data.mdf' because it was originally formatted with sector size 512 and is now on a volume with sector size 8192. Move the file to a volume with a sector size that is the same as or smaller than the original sector size. 
Msg 5178, Level 16, State 1, Line 1
```
This is because istgt by default will present the volume as one that has blocksize of 512 bytes and ctld does not do that, it shows the actual volblocksize of the ZVOL.
Our configuration has 2 x 10Gb interfaces that are configured in LACP, don't know if this is the reason but we are getting this error message every minute in /var/log/messages:
Code:
```
ctld[4704]: child process 20521 terminated with exit status 1 
ctld[20659]: 192.168.1.1: read: connection lost
```
The service works fine so far despite the errors.
Volumes presented via iscsi to esxi hosts, in order to use them with ctld one needs to re-signature the VMFS volumes, so they can be mounted and used. We are on version 5.5U1 of esxi, I'm pretty sure this is valid for older versions of the ESXi.

gstat does not work anymore in FreeBSD10.1 like it worked in 10.0. In 10.0 it used to show the statistics for the ZVOLs, now they are not showing. I have no idea what the reason for this is, and it's not related to the topic. I just thought I would mention it. Hopefully this is useful to someone. I will update as we go along with ctld.

mav@ · Dec 2, 2014

Thank you for your comments. Here are some responses:

That is consequence of processing all requests by a single CTL instance in the kernel. There is no support for multiple instances now.
That is odd. I can not reproduce it. Could you show an example of a configuration with this problem?
Thanks. And it will improve even more.
CTL reports volblocksize as physical sector size. Logical sector size is configurable.
I am not sure it is related.
LUNs in CTL and istgt have different identifiers, that make transition not completely transparent. But AFAIK migration can be done without resignaturing: shutdown istgt, initiate full device and VMFS rescan on vSphere to lose all storages, start ctld, initiate another full rescan.

gstat may not work if you switched your ZVOLs into new dev mode. In that mode ZVOLs are not exposed to GEOM, and so invisible to gstat. But you still may see block backend LUNs via iostat -xzw1 as fake cbbX devices.

qcure · Dec 2, 2014

Hello,

Glad to see developerss are reading the forums

:

2. I was able to reproduce it in a VM at home, when I get home I will paste the configuration file.

4. Yes I know it's configurable, and I tried to set it to 512 but the SQL server was still detecting the volblocksize. And when I created a new volume with a volblocksize of 512 and copied the database files to it, the database was able to open and run just fine. The way I see it is that MS SQL server instead of relying on the information about the volume that it gets from the OS, it maybe is issuing SCSI commands to the block device that its database files are sitting on and it takes some value from there and assumes this is the blocksize.

5. I am not sure also, one other thing is that the server we have has 2 IPs. One that is the admin IP for administrative purposes and the other one is purely for iSCSI traffic and it's not routable and is in the same VLAN as all of the servers that use iSCSI. The portal is running on the iSCSI IP, and not on 0.0.0.0. I did try to set it on the 0.0.0.0, but that did not change anything.

6. I did that, vSphere could see the LUNs and all, but when you try to mount the VMFS it doesn't do it. In the logs of the ESXi host you can see that it sees the devices and detects them as snapshots, and the data on the disk about the device differs from the information about it when it does SCSI query commands. And doing re-signature does the trick, the data is still there and all is fine, it just updates the on disk data with the one that it get from SCSI commands.

Yes we did switch to dev mode, because we only use ZVOLs volumes as LUNs in the ctld. Is there a way to relate those devices to volumes?

qcure · Dec 2, 2014

Here is the ctld configuration that will cause the error in /var/log/messages:

Code:

debug 10

auth-group default {
auth-type none
}

portal-group test1 {
listen 0.0.0.0:3260
}

portal-group test2 {
listen 0.0.0.0:3270
}

target iqn.test1 {
auth-group default
portal-group test1
lun 0{
path /dev/zvol/zroot/test1
backend block
}
}

target iqn.test2 {
auth-group default
portal-group test2
lun 0 {
path /dev/zvol/zroot/test2
backend block
}
}

This configuration runs fine when you start the ctld but errors with:

Code:

Dec  2 19:10:29 freebsd ctld[598]: bind(2) failed for 0.0.0.0: Address already in use
Dec  2 19:10:29 freebsd ctld[598]: failed to reload configuration

when you reload the daemon.

EDIT: One more thing, about the issue with the message about dropping session every minute, here is what i found in /var/log/debug.log

Code:

Dec  2 19:53:54 sm1 ctld[648]: incoming connection; forking child process #0
Dec  2 19:53:54 sm1 ctld[3489]: accepted connection from 192.168.1.1; portal group "default"
Dec  2 19:53:54 sm1 ctld[3489]: 192.168.1.1: setting session timeout to 60 seconds
Dec  2 19:53:54 sm1 ctld[3489]: 192.168.1.1: Capsicum capability mode enabled
Dec  2 19:53:54 sm1 ctld[3489]: 192.168.1.1: beginning Login Phase; waiting for Login PDU

And this message appears every minute, for some reason it tries to do iSCSI log in back on it self. If I read it correctly.

mav@ · Dec 3, 2014

2. That appeared to be a bug. I've just fixed it in the HEAD branch at SVN revision 275452. Thank you for the report.

4. a volblocksize of 512 may be a bad idea unless you are using a mirrored pool of 512n disks. If you are using 4K disks or some RAIDZ, so small block size will cause very poor space efficiency. And even in an ideal case it will create a huge CPU overhead on large operations. You should probably dump/restore your database moving it to volume without that hack with some reasonable block size. MSSQL may know better to address physical blocks rather than logical. It makes transition painful, but should probably improve performance in the end.

X in cbbX devices is coming from absolute LUN number in CTL. You may see both the number and the volume path in ctladm devlist -v.

qcure · Dec 3, 2014

Hello,

4. We are doing just that, migrating the databases to ZVOLs with size volblocksize of 4k (since this is the biggest that is supported by MSSQL) and for the Oracle databases we are using 64k. The pool is the equivalent to RAID50.
5. We still can't figure out what is causing this.

qcure · Dec 5, 2014

Tonight we had some issues, the /var/log/messages had lots of

Code:

Dec  4 00:36:48 kernel: (0:13:0:0): MODE SENSE(6). CDB: 1a 08 1c 00 ff 00
Dec  4 00:36:48 kernel: (0:13:0:0): Tag: 0x3c3c36, Type: 1
Dec  4 00:36:48 kernel: (0:13:0:0): CTL Status: SCSI Error
Dec  4 00:36:48 kernel: (0:13:0:0): SCSI Status: Check Condition
Dec  4 00:36:48 kernel: (0:13:0:0): SCSI sense: ILLEGAL REQUEST asc:24,0 (Invalid field in CDB)
Dec  4 00:36:48 kernel: (0:13:0:0): Command byte 2 bit 5 is invalid

Dec  4 09:18:29 kernel: (1:46:0:0): MODE SENSE(6). CDB: 1a 00 1c 00 c0 00
Dec  4 09:18:29 kernel: (1:46:0:0): Tag: 0x14000000, Type: 1
Dec  4 09:18:29 kernel: (1:46:0:0): CTL Status: SCSI Error
Dec  4 09:18:29 kernel: (1:46:0:0): SCSI Status: Check Condition
Dec  4 09:18:29 kernel: (1:46:0:0): SCSI sense: ILLEGAL REQUEST asc:24,0 (Invalid field in CDB)
Dec  4 09:18:29 kernel: (1:46:0:0): Command byte 2 bit 5 is invalid

And at some point a lot of sessions to the iSCSI got dropped, this morning we also had lots of iSCSI sessions dropped and the following messages:

Code:

Dec  5 09:14:48 (0:7:0:0): COMPARE AND WRITE. CDB: 89 00 00 00 00 00 00 05 5d e8 00 00 00 01 00 00
Dec  5 09:14:48 kernel: (0:7:0:0): Tag: 0x35f132, Type: 1
Dec  5 09:14:48 kernel: (0:7:0:0): CTL Status: SCSI Error
Dec  5 09:14:48 kernel: (0:7:0:0): SCSI Status: Check Condition
Dec  5 09:14:48 kernel: (0:7:0:0): SCSI sense: MISCOMPARE asc:1d,0 (Miscompare during verify operation)

EDIT: ctld just died on us.

mav@ · Dec 5, 2014

HADES said:

Tonight we had some issues, the /var/log/messages had lots of

Code:

Dec  4 00:36:48 kernel: (0:13:0:0): MODE SENSE(6). CDB: 1a 08 1c 00 ff 00
Dec  4 00:36:48 kernel: (0:13:0:0): Tag: 0x3c3c36, Type: 1
Dec  4 00:36:48 kernel: (0:13:0:0): CTL Status: SCSI Error
Dec  4 00:36:48 kernel: (0:13:0:0): SCSI Status: Check Condition
Dec  4 00:36:48 kernel: (0:13:0:0): SCSI sense: ILLEGAL REQUEST asc:24,0 (Invalid field in CDB)
Dec  4 00:36:48 kernel: (0:13:0:0): Command byte 2 bit 5 is invalid

Dec  4 09:18:29 kernel: (1:46:0:0): MODE SENSE(6). CDB: 1a 00 1c 00 c0 00
Dec  4 09:18:29 kernel: (1:46:0:0): Tag: 0x14000000, Type: 1
Dec  4 09:18:29 kernel: (1:46:0:0): CTL Status: SCSI Error
Dec  4 09:18:29 kernel: (1:46:0:0): SCSI Status: Check Condition
Dec  4 09:18:29 kernel: (1:46:0:0): SCSI sense: ILLEGAL REQUEST asc:24,0 (Invalid field in CDB)
Dec  4 09:18:29 kernel: (1:46:0:0): Command byte 2 bit 5 is invalid

It seems that your initiator tried to read the SMART configuration, that is not supported in 10.1. Some commits were made to suppress those errors after the release.

HADES said:
And at some point a lot of sessions to the iSCSI got dropped, this morning we also had lots of iSCSI sessions dropped and the following messages

Code:

Dec 5 09:14:48 (0:7:0:0): COMPARE AND WRITE. CDB: 89 00 00 00 00 00 00 05 5d e8 00 00 00 01 00 00 Dec 5 09:14:48 kernel: (0:7:0:0): Tag: 0x35f132, Type: 1 Dec 5 09:14:48 kernel: (0:7:0:0): CTL Status: SCSI Error Dec 5 09:14:48 kernel: (0:7:0:0): SCSI Status: Check Condition Dec 5 09:14:48 kernel: (0:7:0:0): SCSI sense: MISCOMPARE asc:1d,0 (Miscompare during verify operation)

That may be a normal situation if several initiators are fighting for some data object. That is not necessary an error.

HADES said:
EDIT: ctld just died on us.

Please more about this.

qcure · Dec 5, 2014

Hi,

I will be checking the logs, but on a first glance I didn't see anything strange. Few moments after ctld reload with a new configuration the reload took a bit more time than usual, and it just vanished from the process table. I can't give you more info right now as we are in a race to get the istgt configuration back up.

EDIT: This is the /var/log/messages content from close to the moment this happened:

Code:

Dec  5 09:14:38  kernel: (0:12:0:0): Task Action: I_T Nexus Reset
Dec  5 09:14:38  kernel: (0:12:0:0): CTL Status: Command Completed Successfully
Dec  5 09:14:38  kernel: (0:50:0:0): CTL Status: Command Completed Successfully
Dec  5 09:14:38  kernel: (0:5:0:0): Task Action: I_T Nexus Reset
Dec  5 09:14:38  kernel: (0:5:0:0): CTL Status: Command Completed Successfully
Dec  5 09:14:38  kernel: WARNING:IP HERE (REAL  HERE iqn): no ping reply (NOP-Out) after 5 seconds; dropping connection
Dec  5 09:14:38  kernel: WARNING: IP HERE (REAL  HERE iqn): no ping reply (NOP-Out) after 5 seconds; dropping connection
Dec  5 09:14:38  kernel: WARNING: IP HERE (REAL  HERE iqn): no ping reply (NOP-Out) after 5 seconds; dropping connection
Dec  5 09:14:38  kernel: WARNING: IP HERE (REAL  HERE iqn): no ping reply (NOP-Out) after 5 seconds; dropping connection
Dec  5 09:14:38  kernel: WARNING: IP HERE (REAL  HERE iqn): no ping reply (NOP-Out) after 5 seconds; dropping connection
Dec  5 09:14:38  kernel: WARNING: IP HERE (REAL  HERE iqn): no ping reply (NOP-Out) after 5 seconds; dropping connection
Dec  5 09:14:38  kernel: WARNING: IP HERE (REAL  HERE iqn): no ping reply (NOP-Out) after 5 seconds; dropping connection
Dec  5 09:14:38  kernel: WARNING: IP HERE (REAL  HERE iqn): no ping reply (NOP-Out) after 5 seconds; dropping connection
Dec  5 09:14:38  kernel: WARNING: IP HERE (REAL  HERE iqn): no ping reply (NOP-Out) after 5 seconds; dropping connection
Dec  5 09:14:38  kernel: WARNING: IP HERE (REAL  HERE iqn): no ping reply (NOP-Out) after 5 seconds; dropping connection
Dec  5 09:14:38  kernel: WARNING: IP HERE (REAL  HERE iqn): no ping reply (NOP-Out) after 5 seconds; dropping connection
Dec  5 09:14:38  kernel: WARNING: IP HERE (REAL  HERE iqn): no ping reply (NOP-Out) after 5 seconds; dropping connection
Dec  5 09:14:38  kernel: WARNING: IP HERE (REAL  HERE iqn): no ping reply (NOP-Out) after 5 seconds; dropping connection
Dec  5 09:14:38  kernel: WARNING: IP HERE (REAL  HERE iqn): no ping reply (NOP-Out) after 5 seconds; dropping connection
Dec  5 09:14:38  kernel: WARNING: IP HERE (REAL  HERE iqn): no ping reply (NOP-Out) after 5 seconds; dropping connection
Dec  5 09:14:38  kernel: WARNING: IP HERE (REAL  HERE iqn): no ping reply (NOP-Out) after 5 seconds; dropping connection
Dec  5 09:14:38  kernel: WARNING: IP HERE (REAL  HERE iqn): no ping reply (NOP-Out) after 5 seconds; dropping connection
Dec  5 09:14:38  kernel: WARNING: IP HERE (REAL  HERE iqn): no ping reply (NOP-Out) after 5 seconds; dropping connection
Dec  5 09:14:40  kernel: WARNING:IP HERE (REAL  HERE iqn): waiting for CTL to terminate tasks, 1(0:33:0:0): Task Action: I_T Nexus Reset
Dec  5 09:14:40  kernel: remaining
Dec  5 09:14:40  kernel: (0:33:0:0): CTL Status: Command Completed Successfully
Dec  5 09:14:40  kernel: WARNING: IP HERE (REAL  HERE iqn): waiting for CTL to terminate tasks, 1 remaiWARNING: IP HERE253 (REAL  HERE iqn): waiting for CTL to terminate tasks, 1 remaining
Dec  5 09:14:40  kernel: WARNING: IP HERE (REAL  HERE iqn): waiting for CTL to terminate tasks, 1 remaining
Dec  5 09:14:40  kernel: (0:30:0:0): Task Action: I_T Nexus Reset
Dec  5 09:14:40  kernel: (0:10:0:0): Task Action: I_T Nexus Reset
Dec  5 09:14:40  kernel: WARNING: IP HERE (REAL  HERE iqn): waiting for CTL to terminate tasks, 1 remai(0:36:0:0): Task Action: I_T Nexus Reset
Dec  5 09:14:40  kernel: WARNING: IP HERE (REAL  HERE iqn): waiting for CTL to terminate tasks, 1 remaining
Dec  5 09:14:40  kernel: WARNING: IP HERE (REAL  HERE iqn): waiting for CTL to terminate tasks, 1 remaining
Dec  5 09:14:40  kernel: (0:7:0:0): Task Action: I_T Nexus Reset
Dec  5 09:14:40  kernel: (0:9:0:0): Task Action: I_T Nexus Reset
Dec  5 09:14:40  kernel: ning
Dec  5 09:14:40  kernel: (0:30:0:0): CTL Status: Command Completed Successfully
Dec  5 09:14:40  kernel: (0:36:0:0): CTL Status: Command Completed Successfully
Dec  5 09:14:40  kernel: (0:6:0:0): Task Action: I_T Nexus Reset
Dec  5 09:14:40  kernel: ning
Dec  5 09:14:40  kernel: (0:6:0:0): CTL Status: Command Completed Successfully
Dec  5 09:14:40  kernel: (0:10:0:0): CTL Status: Command Completed Successfully
Dec  5 09:14:40  kernel: (0:9:0:0): CTL Status: Command Completed Successfully
Dec  5 09:14:40  kernel: (0:7:0:0): CTL Status: Command Completed Successfully

One more thing, this night during the issue we saw a huge amount of 9k jumbo clusters denied this also co-related with big ethernet traffic on the iSCSI LAGG interface, which is set[]up correctly for 9k jumbo frames.

mav@ · Dec 5, 2014

I would say that the errors log looks consistent with mentioned clusters allocation failures. Inability of send packets caused connections termination, that was logged.

qcure · Dec 5, 2014

One more thing to add, we have been using istgt for a long time with the same amount of initiators [~50], in the same network configuration and we never had issues like that. Right now we moved 90% of the hosts back to istgt, and so far it has worked fine, we will monitor the situation and this might be a way to segregate the issue, because we are changing the version of the BSD and the iSCSI stack at the same time. One thing that I also saw in the logs is that some times we get up to 10 forked processes to handle the iSCSI logins, and there is some default number of workers that ctld is working with 7 or 8, if I recall it correctly. The configuration of the ctld is with

Code:

maxproc 256

mav@ · Dec 5, 2014

The iSCSI frontend of CTL is quite aggressive on mbuf clusters usage. It sets socket buffer sizes to 1MB for both read and write. Make sure that you have enough mbuf clusters to cover 2MB*number of connections.

qcure · Dec 8, 2014

We have no issues with insufficient mbufs, here is the info from netstat -m

Code:

36866/30949/67815 mbufs in use (current/cache/total)
4092/17168/21260/655356 mbuf clusters in use (current/cache/total/max)

This is a lot of mbufs and this is now that we moved most of the traffic to the old [files]istgt[/file] configuration.
Also from vmstat -z

Code:

128 Bucket:            1024,      0,    1930,   16634, 4642823, 408,   0
256 Bucket:            2048,      0,   15556,    5864,36894997,206729,   0
vmem btag:               56,      0, 3210763,  165145, 6267448,23776,   0

I have no idea why are those failed 256 bucket numbers. Rest of the output looks fine and the rest of the rows have "0" in the "FAIL" column.

mav@ · Dec 8, 2014

HADES said:
Also from vmstat -z

Code:

128 Bucket: 1024, 0, 1930, 16634, 4642823, 408, 0 256 Bucket: 2048, 0, 15556, 5864,36894997,206729, 0 vmem btag: 56, 0, 3210763, 165145, 6267448,23776, 0

I have no idea why are those failed 256 Bucket numbers. Rest of the output looks fine and the rest of the rows have "0" in the "FAIL" column.

Those memory types are used in a way that is supposed to fail a lot. They are used for performance optimization, so those failures are invisible to applications. So that is not a problem.

Have you looked on what is going on with memory consumption when the problem appears? On my tests ZFS sometimes consumes so much RAM during active writes that system starts killing random processes. Can you try to limit the ARC size, at least for test?

qcure · Dec 8, 2014

We have also noticed extremely aggressive memory usage, but we also had swap configured, and the system never swapped, under heavy load it would go to as much as 1300MB free out of 190GB, and then page reclamation would start and free 6 or 7GB in one run. When the cltd died on us, I didn't see any message from the FreeBSD about not being able to fork a process, which is usually what happens when a system runs out of memory and is unable to fork a process. We didn't limit the ARC on purpose since in 10.0 of the FreeBSD there was always 40GB of free memory that was never used and beside that, the tuning that was done when the system was running version 9.1. Right now we are running fine with istgt with no issues, although with much lower performance.

mav@ · Dec 10, 2014

Just for note, here is another report about iSCSI connection timeouts, that may or may not be related: https://forums.freenas.org/index.php?threads/freenas-9-3-stable-upgrading-issue-from-9-2-1-9.25462/

So far packet traces in that case given more understanding that is present here now.

mav@ · Dec 17, 2014

4. I've just committed to FreeBSD HEAD branch change, that allows to override automatically detected sizes/offsets of physical and UNMAP blocks. I'll merge it to 10-STABLE in a week.