tcsh starts slowly

cramey · Jun 2, 2019

Installed FreeBSD 12 on a fresh server for testing, noticed something really odd - intermittently tcsh takes a couple of seconds to start up, far more than it should. The system is not cpu or io bound. I even tried running tcsh with the "-f" switch, which makes tcsh "not load any resource or startup files, or perform any command hashing" assuming maybe there was something off in the various startup files, but it still does it. As a test I ran bash and zsh on the system in question, and they start up quickly, no delay. Oddly tcsh even does it when simply asking for the version:

Code:

[cramey@quill ~]# time tcsh --version
tcsh 6.20.00 (Astron) 2016-11-24 (x86_64-amd-FreeBSD) options wide,nls,dl,al,kan,sm,rh,color,filec

real    0m2.039s
user    0m0.008s
sys     0m0.001s

Thinking maybe I somehow had a corrupt version of tcsh, I set about getting the source and compiling a clean copy. Got the newest source, compiled and ran my fresh executable, same problem:

Code:

[cramey@quill ~/tcsh/tcsh-6.21.00]$ time ./tcsh --version                      
tcsh 6.21.00 (Astron) 2019-05-08 (x86_64-amd-FreeBSD) options wide,nls,dl,al,kan,rh,color,filec

real    0m2.097s
user    0m0.003s
sys     0m0.015s

Finally I tried tracing tcsh with truss ( truss -D -o csh.truss tcsh --version) just to see if something jumped out at me, and I noticed this:

Code:

0.000036734 open("/etc/group",O_RDONLY|O_CLOEXEC,0666) = 3 (0x3)
0.000026210 fstat(3,{ mode=-rw-r--r-- ,inode=15760,size=507,blksize=4096 }) = 0 
(0x0)
0.000039960 read(3,"# $FreeBSD: releng/12.0/etc/grou"...,4096) = 507 (0x1fb)
0.000025042 close(3)                             = 0 (0x0)
0.000025917 __sysctl(0x7fffffffe530,0x2,0x7fffffffe640,0x7fffffffe528,0x0,0x0) =
 0 (0x0)
0.000023129 sigaction(SIGCHLD,{ SIG_DFL 0x0 ss_t },0x0) = 0 (0x0)
0.000032588 pipe2(0x7fffffffe508,0)              = 0 (0x0)
0.000020420 dup(3)                               = 5 (0x5)
0.000018606 dup(5)                               = 6 (0x6)
0.000019262 close(5)                             = 0 (0x0)
0.000018661 close(3)                             = 0 (0x0)
0.000019202 fcntl(6,F_SETFD,FD_CLOEXEC)          = 0 (0x0)
0.000018800 dup(4)                               = 3 (0x3)
0.000022162 dup(3)                               = 5 (0x5)
0.000019721 dup(5)                               = 7 (0x7)
0.000021148 close(5)                             = 0 (0x0)
0.000019755 close(3)                             = 0 (0x0)
0.000019503 close(4)                             = 0 (0x0)
0.000019636 fcntl(7,F_SETFD,FD_CLOEXEC)          = 0 (0x0)
0.000366862 fork()                               = 48750 (0xbe6e)
0.000020046 close(7)                             = 0 (0x0)
2.042416266 read(6,0x7fffffffc500,8192)          = 0 (0x0)

Anyone have any idea what might be causing this?

D-FENS · Jun 3, 2019

It seems that the executable is spending a lot of time reading files.
Check the manual page and try to remove all configuration files and init scripts from the picture. Also check the rest files for sanity (are the files extraordinarily big, is the content sensible?):

Code:

       /bin/sh         
       /etc/csh.cshrc 
       /etc/csh.login 
       /etc/csh.logout
       /etc/passwd     
       /tmp/sh*       
       ~/.cshdirs (+) 
       ~/.cshrc       
       ~/.history     
       ~/.login       
       ~/.logout       
       ~/.tcshrc (+)

If the runs are still slow, try to monitor the disk activity. What exactly is being read and how much data is transferred?

Code:

iostat -w3
systat -iostat
gstat

There was a thread in the forum explaining how to log which files are being read and written in real time. I'll try to dig it out.

Another option would be to edit the startup scripts and add " echo XXX: $(date)" everywhere and see what exactly consumes so much time.

SirDice · Jun 3, 2019

What's in /etc/nsswitch.conf?

12-STABLE:

Code:

dice@maelcum:~ % /usr/bin/time tcsh --version
tcsh 6.20.00 (Astron) 2016-11-24 (x86_64-amd-FreeBSD) options wide,nls,dl,al,kan,sm,rh,color,filec
        0.00 real         0.00 user         0.00 sys

12-STABLE:

Code:

dice@molly:~ % /usr/bin/time tcsh --version
tcsh 6.20.00 (Astron) 2016-11-24 (x86_64-amd-FreeBSD) options wide,nls,dl,al,kan,sm,rh,color,filec
        0.00 real         0.00 user         0.00 sys

12.0-RELEASE:

Code:

dice@case:~ % /usr/bin/time tcsh --version
tcsh 6.20.00 (Astron) 2016-11-24 (x86_64-amd-FreeBSD) options wide,nls,dl,al,kan,sm,rh,color,filec
        0.00 real         0.00 user         0.00 sys

11.2-RELEASE:

Code:

dice@wintermute:~ % /usr/bin/time tcsh --version
tcsh 6.20.00 (Astron) 2016-11-24 (x86_64-amd-FreeBSD) options wide,nls,dl,al,kan,sm,rh,color,filec
        0.00 real         0.00 user         0.00 sys

cramey · Jun 5, 2019

Although I did go through all the startup scripts, I couldn't find anything. Even if there was something in there, would it manifest when you're just asking for the version? nsswitch.conf was never modified, so whatever came with the original install of 12.0-RELEASE. As I said in the original post, the system wasn't I/O bound.

Ultimately I fixed this by doing new install. As an experiment I copied over the original /etc and /home onto the new install, but the problem never came back. My best guess is silent on-disk corruption of some system library. In retrospect, I should have tested the integrity of the system with freebsd-update IDS before resorting to a new install.

_martin · Jun 6, 2019

It would have helped if you used -f (follow fork) too. I don't use csh but when I tested it I noticed different behavior when launched from terminal and when from ssh session. The later was attempting to resolve the IP of itself. That's why SirDice question about /etc/nsswitch.conf makes sense. IPv6 can cause lots of headache too.

Crivens · Apr 5, 2021

Exhumed this thread as this is annoying me enough.

Truss showed a speedy start up to a sendto() operation when the shell tries to send some binary stuff to your dns (192.168.0.1:53 in my case). Then the call times out after 2 seconds.

As my machine is running ZFS, I would rule out corruption of files.

Any ideas?

Edit; switch off wlan-> Fast start.
Switch on wlan-> slow start.

And I thought the days of the turbo switches were gone...

Crivens · Apr 5, 2021

Looks like there is garbage sent to the dns. Also what is done with files looks fishy. Out of curiosity, I went to /usr/bin/csh and did a "make clean; scan-build10 make".
Looks like this is one place to have a really good look into. If the messages scrolling by are anything to go by, there is need for some null pointer exorcism.

I will investigate this closer.

_martin · Apr 5, 2021

The only time I was able to replicate the behavior was when I set the bogus DNS server in resolv.conf. But that makes sense as it can't resolve anything. From the initial trace OP did it seems problem is elsewhere (close of fd 7 which seems to be a result of some fd trampoline pointing to pipe2()).

Did you try to test this with no DNS server in resolv.conf ? What is being sent to the DNS server (tcpdump) ?

Crivens · Apr 6, 2021

Ok, I'm not going after the scan-build results. Chains > 1200 are not my cup of tea right now.
But whar I found out: csh on tty[1..8] - no delay. Csh on xterm - delay. Same users same config.

This is a bit confusing.

covacat · Apr 6, 2021

looks like it tries to resolve the host you are logged from
the from column in w(1) which is - for ttyv*
lots of dn servers timeout on non routable ips ptr queries

Crivens · Apr 6, 2021

Same system, same user, same configs. Only once on tty2 and once in an xterm.
That should come out identical, network wise, right?

covacat · Apr 6, 2021

192.168.1.1 is unreachable and times out

tcpdump -w zzzz -vvv dst port 53 >/dev/null 2>&1 &

# cat /etc/resolv.conf && time tcsh --version

# Generated by resolvconf
nameserver 192.168.1.1

tcsh 6.21.00 (Astron) 2019-05-08 (x86_64-amd-FreeBSD) options wide,nls,dl,al,kan,sm,rh,color,filec

real 0m2.012s
user 0m0.003s
sys 0m0.006s

# tcpdump -r zzzz -nl
reading from file zzzz, link-type EN10MB (Ethernet)

20:13:07.295916 IP 89.40.XXX.YYY.21364 > 192.168.1.1.53: 54560+ A? myloginhost.tld. (27)

just check for someething similar and see what you get

_martin · Apr 6, 2021

Crivens said:
Same system, same user, same configs. Only once on tty2 and once in an xterm.
That should come out identical, network wise, right?

Can you share the tcpdump and truss output from both cases?

Crivens · Apr 6, 2021

That might take some time.
What I see in truss is that the first fd used in the slow case is 4. In the fast case it is 3. The is the first diff. Later there are huuuge diffs.
Also, looping back from the xterm by ssh 127.0.0.1 speeds up things also.

_martin · Apr 6, 2021

It's interesting why is this happening. With proper DNS settings I can't replicate this at all. Neither from PuTTY session nor when I access ttys directly (in VM).
Tried even the X session, different users..
To rule out possible environment settings can you test this with newly created test user ?

Crivens · Apr 6, 2021

Why should a shell do a DNS query anyway? Someone please explain. Like I'm 6 years old.

_martin · Apr 6, 2021

Maybe it wants to set the PS1 which has host name in it.

Crivens · Apr 6, 2021

... and that is known by the DNS but not the local machine... ?

covacat · Apr 6, 2021

from src/contrib/tcsh/FAQ

21. I start tcsh and it takes a couple of minutes to get the prompt.
You have defined REMOTEHOST and your DNS is not responding. Either
undefine REMOTEHOST and recompile or fix your DNS.

and in tc.func.c

Code:

#ifdef REMOTEHOST

static void
palarm(int snum)
{
    USE(snum);
    _exit(1);
}

......
void
remotehost(void)
{
....
pid = fork();
    if (pid == 0) {
        sigset_t set;
        xclose(fds[0]);
        /* Don't get stuck if the resolver does not work! */
        signal(SIGALRM, palarm);
        sigemptyset(&set);
        sigaddset(&set, SIGALRM);
        (void)sigprocmask(SIG_UNBLOCK, &set, NULL);
        (void)alarm(2);
......

Crivens · Apr 6, 2021

covacat Good find. But does it explain why tty is fast and xterm is slow? I'll check more on that tomorrow.

covacat · Apr 6, 2021

i dont have X/xterm now and cant remember what is shown in w(1) in the From: column. for console is a dash which may be ignored

_martin · Apr 6, 2021

Take my statement above with a grain of salt. We did suspect DNS it at fault but without further information it's just a guessing game.
Those tcpdumps would help others to see what was sent to your DNS server.

Crivens · Apr 6, 2021

I got a workaround.
The delay is from asking the dns for unix.home, no idea where that name comes from...
So a quick vi of etc/hosts can fix that.

tcsh starts slowly

cramey

D-FENS

SirDice

Administrator

cramey

_martin

Crivens

Administrator

Crivens

Administrator

_martin

Crivens

Administrator

covacat

Crivens

Administrator

covacat

_martin

Crivens

Administrator

_martin

Crivens

Administrator

_martin

Crivens

Administrator

covacat

Crivens

Administrator

covacat

_martin

Crivens

Administrator