Winbind AD dropping every 10 hours

We are experiencing the same symptoms as https://lists.samba.org/archive/samba/2017-November/212066.html with the latest samba package (4.8.11) on FreeBSD 12 (having also tested on 11.2). Bascially, winbind stops accepting connections (SMB logins, SSH logins -- anything that runs through Active Directory authentication) every 10 hours. This is the kerberos ticket expiration length so it makes sense to say that the kerberos tickets for the system aren't refreshing. Things automatically start working again 2 hours later though.

Free BSD 12
Samba 4.8.11
AD Member Server contacting Windows AD 2012 Domain Controllers

I'm trying to figure out if we have something odd in our config and I'd appreciate any input. This may be better placed in a Samba forum but since we're using FreeBSD...

smb4.conf
Code:
[global]
    realm = DOMAIN.EDU
    security = ADS
    workgroup = DOMAIN
    netbios name = METRO-SEQUOIAFB
    #server string = FreeBSD Test Server

    server role = member server
    #DV#encrypt passwords = yes

    dedicated keytab file = /etc/krb5.keytab
    kerberos method = secrets and keytab

    log level = 99
    max log size = 0
    logging = file

    # FreeNAS only
    #ads dns update = yes

    ###acl allow execute always = true
    #DV#allow trusted domains = yes
    ###client ldap sasl wrapping = plain
    client ldap sasl wrapping = seal
    #DV#client use spnego = yes
    #DV#client ntlmv2 auth = yes
    #deadtime = 15
    deadtime = 0
    directory name cache size = 0
    disable spoolss = yes
    dns proxy = no
    #DV#domain logons = no
    domain master = no
    ea support = yes
    #DV#getwd cache = yes
    #DV#guest account = nobody
    ###hostname lookups = yes

    #idmap config *: backend = autorid
    ##DV#idmap config *: ignore builtin = no
    #idmap config *: range = 10000-90000000
    #idmap config *: rangesize = 1000000
    ##DV#idmap config *: read only = no

    idmap config AD3 : backend     = ad
    idmap config AD3 : range       = 5000-3999999999
    idmap config AD3 : schema_mode = rfc2307
    idmap config AD3 : unix_primary_group = yes

    idmap config OU : backend     = autorid
    idmap config OU : range       = 4000000000 - 4099999999
    idmap config OU : schema_mode = rfc2307
    idmap config OU : unix_primary_group = yes

    idmap config * : backend = tdb
    idmap config * : range = 4100000000-4200000000

    kernel change notify = no
    ###lm announce = yes
    load printers = no
    local master = no
    ###map to guest = Bad User
    max open files = 234731
    #DV#multicast dns register = yes
    nsupdate command = /usr/local/bin/samba-nsupdate -g
    ntlm auth = no
    ###obey pam restrictions = yes
    #DV#oplocks = yes
    panic action = /usr/local/libexec/samba/samba-backtrace
    preferred master = no
    printcap name = /dev/null
    printing = bsd

    server max protocol = SMB3
    server min protocol = SMB2_02

    #strict locking = no
    strict locking = auto
    #time server = yes

    template homedir = /sequoia-NEW/sequoia/users/%U
    template shell = /usr/local/bin/bash

    winbind cache time = 7200
    winbind enum groups = yes
    winbind enum users = yes
    winbind nested groups = yes
    #winbind nss info = rfc2307
    winbind nss info = template
    ###winbind offline logon = yes
    winbind refresh tickets = yes
    #DV#winbind use default domain = no
    winbind:ignore domains  = lots of domains

        # https://lists.samba.org/archive/samba/2014-March/179632.html
        kdc:service ticket lifetime = 24
        kdc:user ticket lifetime = 24
        kdc:renewal lifetime = 120


        # Defaults for all shares
    #create mask = 0666
    create mask = 0000
    #directory mask = 0777
    directory mask = 0000
    dos charset = CP437
    dos filemode = yes
    store dos attributes = yes
    unix charset = UTF-8

    #DV#access based share enum = no
    #DV#browseable = yes
    #DV#guest ok = no
    #DV#hide dot files = yes
    inherit owner = yes
    #DV#printable = no
    read only = no
    veto files = /.snapshot/.windows/.mac/.zfs/
    vfs objects = zfsacl streams_xattr

    #DV#nfs4:mode = simple
    #DV#nfs4:acedup = dontcare
    #DV#nfs4:chown = no
    zfsacl:acesort = yes

    # https://www.samba.org/samba/docs/current/man-html/vfs_zfsacl.8.html
    # https://www.samba.org/samba/docs/current/man-html/smb.conf.5.html#ACLMAPFULLCONTROL
    acl map full control = no

/etc/krb5.conf
Code:
[logging]
   Default = FILE:/var/log/krb5.log

[libdefaults]
   default_realm = domain.edu

   dns_lookup_realm = true
   dns_lookup_kdc   = true
   ticket_lifetime  = 24h
   renew_lifetime   = 7d
   forwardable      = true
   rdns             = false

[realms]
   domain.edu = {
      kdc            = tcp/domain.edu
      admin_server   = domain.edu
      kpasswd_server = domain.edu
      auth_to_local = RULE:[1:$0\$1](^OU\.AD3\.domain\.EDU\\.*)s/^OU\.AD3\.domain\.EDU/OU/
      auth_to_local = RULE:[1:$0\$1](^AD3\.domain\.EDU\\.*)s/^AD3\.domain\.EDU/AD3/
      auth_to_local = DEFAULT
   }


[domain_realm]
   .domain.edu = domain.edu
    domain.edu = domain.edu

   .ad3.domain.edu    = AD3.domain.EDU
    ad3.domain.edu    = AD3.domain.EDU

[login]
   krb4_convert = true
   krb4_get_tickets = false


[appdefaults]
   pam = {
# mappings = regex1 regex2 [...]

#  specifies that pam_krb5 should derive the user's principal name from
#  the Unix user name by first checking if the user name matches regex1,
#  and formulating a principal name using regex2. For example, "mappings
#  = EXAMPLE\(.*) $1@EXAMPLE.COM" would map any user with a name of the
#  form "EXAMPLE\whatever" to a principal name of
#  "whatever@EXAMPLE.COM". This is primarily targeted at allowing
#  pam_krb5 to be used to authenticate users whose user information is
#  provided by winbindd(8). This will frequently require the reverse to
#  be configured by setting up an auth_to_local rule elsewhere in
#  krb5.conf(5).

      mappings = OU\\(.*) $1@domain.edu
      reverse_mappings = (.*)@OU\.AD3\.domain\.EDU OU\$1
      mappings = AD3\\(.*) $1@AD3.domain.EDU
      reverse_mappings = (.*)@AD3\.domain\.EDU AD3\$1

      forwardable = true
      validate = true
   }
 
You indicate that winbind stops accepting connections. How do you know specifically it is not accepting connections? Or is it just an observation based only upon the symptom?

I ask because, absent any data from a log, a packet capture via tcpdump might be able to provide better insights as to what winbind is complaining about. If winbind is putting RST packets back on the wire, that is a clue. If inbound traffic is never seen on the wire, that is a clue. If winbind is pestering the domain controllers and not getting a response, that is clue.

I know you are already off in the weeds on this but feels like it would be helpful to eliminate what is traversing the wire.
 
We have exactly the same problem (although on FreeBSD 11.3 with Samba 4.10.11).

Our findings so far:
it is indeed the expiration of the Kerberos ticket (10h, default for Windows DCs) that causes the problem. But *not* the Kerberos ticket of the process that binds to the domain ("winbindd: domain child [AD3] (winbindd)" in your case). This ticket seems to be refreshed just fine. Look for "Current tickets expire ..." messages in log.wb-AD3 (log level 7). They count down from 36000 secs and can be negative eventually (i.e. ticket is expired):

Current tickets expire in 2187 seconds (at 1577548806, time is now 1577546619)
Current tickets expire in 2178 seconds (at 1577548806, time is now 1577546628)
Current tickets expire in 243 seconds (at 1577548806, time is now 1577548563)
Current tickets expire in 243 seconds (at 1577548806, time is now 1577548563)
Current tickets expire in 231 seconds (at 1577548806, time is now 1577548575)
Current tickets expire in -1617 seconds (at 1577548806, time is now 1577550423)
Current tickets expire in 36000 seconds (at 1577586423, time is now 1577550423)
Current tickets expire in 35991 seconds (at 1577586423, time is now 1577550432)

Then *this* ticket gets refreshed as it is expected.

But the refreshing of the GSSAPI ticket for the openldap-sasl-client (with GSSAPI=on) that is used for the idmapper (process name: "winbindd: idmap child (winbindd)") seems to be the problem: when this ticket is expired, a connection to the DC (LDAP port) is established and stays open for 2 hours (i.e. 7200000 msecs, which is exactly the value of net.inet.tcp.keepidle).

Looking at the samba code (source3/winbindd/winbindd_dual.c), the child queue is deliberately blocked "until we get the response from the child". Consequently we find the log entries "keep orphaned subreq" and "cleanup orphaned subreq" in the log.winbindd when the lockup happens. Actually, the refresh of the krb ticket seems not to be the problem either, since after the TCP socket times out the ticket is refreshed as expected (grep for "Starting GENSEC submechanism gssapi_krb5" in log.winbindd-idmap, loglevel 5).

BTW, setting "gensec_gssapi:requested_life_time = <int> # seconds" in smb4.conf allows to set the ticket lifetime. Helps tremendously with debugging (not funny to wait 10h for a 2h debug window ...).

Why do I assume that we deal with two different kerberos tickets? Since the GSSAPI ticket for idmap seems to be acquired right after samba/winbind is started. While the "domain join" ticket seems to be fetched quite a bit later (~ 30 minutes after winbind start).

What we still don't understand, is why the connection to the DC's LDAP port stays open, although the krb ticket is expired. An interactive ldapsearch does handle an expired (file based) ticket as one would expect and does not block:

# ldapsearch -h dc_fqdn -N -Y GSSAPI ...
SASL/GSSAPI authentication started
ldap_sasl_interactive_bind_s: Local error (-2)
additional info: SASL(-1): generic failure: GSSAPI Error: The context has expired (Success)

For now, we "pkill -9 -f 'winbindd: idmap child'" via cron every 8 hours. A new idamp child is spawned automatically with a fresh 10h ticket. Ugly, but the only workaround we found so far.
 
We have exactly the same problem (although on FreeBSD 11.3 with Samba 4.10.11).

Thank you twerschlein for these insights. It helped us a lot, since we experience the very same problems.

Our servers run 12.1-RELEASE-p1 with samba410-4.10.11. These servers also act as NFS servers. We noticed that gssd is dying once in a while too, when the Winbind problem occurs. We also noticed that "pkill -9 -f 'winbindd: idmap child'" via cron every 8 hours does not always prevent the problem from occuring.

I was wondering, did you look futher into this problem and if so, what are your findings? And did you report this issue to the Samba team?

Thanks!
Remy
 
Back
Top