Crash & reboot after pkg add...

jb82 · Sep 16, 2024

I was looking for vscode and found it in /var/cache/pkg. So I typed:

doas pkg add ./vscode-1.90.2_1.pkg

Immediate machine reboot followed after, say, 10s saying something about installing krbXX dependency.

I do not consider this normal. Machine HW is absolutely stable.

Questions:
- What might have happened?
- How to tackle/debug such problems?

Cath O'Deray · Sep 17, 2024

Which version of FreeBSD, exactly?

Port packages from quarterly, or latest?

freebsd--version -kru ; uname -aKU

pkg -vv | grep -B 1 -e url -e priority

Is there kernel panic information?

/var/crash

jb82 · Sep 18, 2024

Cath O'Deray said:
Which version of FreeBSD, exactly?

Port packages from quarterly, or latest?

freebsd--version -kru ; uname -aKU

pkg -vv | grep -B 1 -e url -e priority

Is there kernel panic information?

/var/crash

It's the fully up-to-date 14.1-RELEASE with GENERIC kernel, at that time it used to be quarterly. Fully synced via pkg upgrade -f. The vscode however was there from previous switch to latest. And I just tried to pkg add it and it crashed. Now I'm back on latest. Nothing in /var/crash though. I'm on NVIDIA drm kmod, sometimes that caused some troubles. But otherwise it's pretty stable.

Info:

Code:

14.1-RELEASE-p4
14.1-RELEASE-p4
14.1-RELEASE-p4

FreeBSD p1.home 14.1-RELEASE-p4 FreeBSD 14.1-RELEASE-p4 GENERIC amd64 1401000 1401000

FreeBSD: {
    url             : "pkg+https://pkg.FreeBSD.org/FreeBSD:14:amd64/latest",
    enabled         : yes,
    priority        : 0,

SirDice · Sep 18, 2024

jb82 said:
The vscode however was there from previous switch to latest. And I just tried to pkg add it and it crashed.

"Installing" a package is really nothing more than extraction of an archive and adding the necessary data to the package database. The editors/vscode package might be large though, so lots of files to extract and decompress. That puts quite a bit of load on I/O, memory and some CPU.

I wonder if you can trigger a crash just by extracting a large, compressed, archive.

dark.initr0 · Sep 18, 2024

Hello!
What about free space?
So VSCode have been installed?
Can you repeat your command again with "-d" (debug mode)? - "pkg -d add ..."
and maybe redirect output to file to analyze later?

jb82 · Sep 18, 2024

SirDice said:
"Installing" a package is really nothing more than extraction of an archive and adding the necessary data to the package database. The editors/vscode package might be large though, so lots of files to extract and decompress. That puts quite a bit of load on I/O, memory and some CPU.

I wonder if you can trigger a crash just by extracting a large, compressed, archive.

That's very strange. This is an older Lenovo P1 Gen 2 with Xeon and dedicated NVIDIA, otherwise pretty stable. Their stock Samsung 1 TB NVME disk and zpool on that with >500 GB of free space. I routinely compile big stuff like world/kernel, or even Firefox, no problems detected. But I'm really interested in what might've happened. No power outage, machine just freezed for a while, I saw some line about krb dependency and... rebooted. From the systems perspective I'd be interested how such failures might be debugged.

Ofc, it might be HW problem, firmware bug whatever. I'll try what you suggest to simulate some huge extractions but I guess that compiling those huge projects poses similar load, perhaps a tad smaller pressure on I/O.

If it was Windows, I wouldn't even bother

But in FreeBSD, we're perfectionists, correct?

jb82 · Sep 20, 2024

SirDice said:
I wonder if you can trigger a crash just by extracting a large, compressed, archive.

I guess I found the cause. Crashes from time to time.

fmc000 · Sep 20, 2024

Looks like a ZFS bug

cracauer@ · Sep 20, 2024

Time to run hardware testing.

jb82 · Sep 20, 2024

cracauer@ said:
Time to run hardware testing.

e.g. extended smartctl test shows absolutely no issues, NVME disk feel absolutely fine... weird...

cracauer@ · Sep 20, 2024

I'm more thinking memtest, mprime/prime95 and superpi.

jb82 · Sep 21, 2024

cracauer@ said:
I'm more thinking memtest, mprime/prime95 and superpi.

I see. You know, this particular machine has Xeon + ECC memories, so it's less likely but still. I'm gonna do the inbuilt Lenovo UEFI Diagnostics throughout this night. We can rule out the disks though.

jb82 · Sep 21, 2024

...so neither NVME nor extended Memory (ECC) test (run through UEFI firmware) does exhibit any HW issues. Starting to think about ZFS issue...

fmc000 · Sep 21, 2024

jb82 said:
...so neither NVME nor extended Memory (ECC) test (run through UEFI firmware) does exhibit any HW issues. Starting to think about ZFS issue...

That was my idea from the start. Maybe try scrubbing the pool.

jb82 · Sep 21, 2024

fmc000 said:
That was my idea from the start. Maybe try scrubbing the pool.

I have 10-day automatic period for scrubbing. It's a very simple single vdev (single disk) zpool. All looks good there:

Code:

NAME          STATE     READ WRITE CKSUM
        zroot         ONLINE       0     0     0
          nda1p3.eli  ONLINE       0     0     0

errors: No known data errors

But I already have one OS core dump, I'm gonna look into that soon.

fmc000 · Sep 21, 2024

Following cracauer@ suggestion, I'll let memtest run overnight, not just for a few minutes/hours to check for RAM issues. The stacktrace is definitely ZFS related but hardware errors should always be considered and this morning you posted two messages in less than four hours, defintely not enough for a thorough memtest session in my experience.

jb82 · Sep 21, 2024

fmc000 said:
Following cracauer@ suggestion, I'll let memtest run overnight, not just for a few minutes/hours to check for RAM issues. The stacktrace is definitely ZFS related but hardware errors should always be considered and this morning you posted two messages in less than four hours, defintely not enough for a thorough memtest session in my experience.

I already compiled a custom kernel with a single watch for this particular null ptr fault. A 12h lasting memory test (64 GB ECC RAM) is scheduled as well.

Barney · Sep 21, 2024

jb82 said:
I guess I found the cause. Crashes from time to time.

View attachment 20394

Text dump is your friend ?

You’re sure the package wasn’t cached from some older version of freeBSD?

jb82 · Sep 21, 2024

Barney said:
You’re sure the package wasn’t cached from some older version of freeBSD?

Yes, I am.

For completeness:

Code:

(kgdb) bt
#0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57
#1  doadump (textdump=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:405
#2  0xffffffff80b32767 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:523
#3  0xffffffff80b32c3e in vpanic (fmt=0xffffffff8115fdb8 "%s", ap=ap@entry=0xfffffe0226645750) at /usr/src/sys/kern/kern_shutdown.c:967
#4  0xffffffff80b32a93 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:891
#5  0xffffffff8100091b in trap_fatal (frame=0xfffffe0226645830, eva=0) at /usr/src/sys/amd64/amd64/trap.c:952
#6  0xffffffff81000966 in trap_pfault (frame=<unavailable>, usermode=false, signo=<optimized out>, ucode=<optimized out>) at /usr/src/sys/amd64/amd64/trap.c:760
#7  <signal handler called>
#8  zfs_ace_fuid_size (acep=0x0) at /usr/src/sys/contrib/openzfs/module/os/freebsd/zfs/zfs_acl.c:258
#9  0xffffffff8219157d in zfs_acl_next_ace (aclp=aclp@entry=0xfffff8052bfab980, start=start@entry=0x0, who=who@entry=0xfffffe0226645988, access_mask=access_mask@entry=0xfffffe0226645998,
    iflags=iflags@entry=0xfffffe02266459b6, type=type@entry=0xfffffe02266459b4) at /usr/src/sys/contrib/openzfs/module/os/freebsd/zfs/zfs_acl.c:613
#10 0xffffffff82194651 in zfs_zaccess_aces_check (zp=0xfffff80025342570, working_mode=0xfffffe0226645a00, anyaccess=anyaccess@entry=0, cr=0xfffff80049ea2600)
    at /usr/src/sys/contrib/openzfs/module/os/freebsd/zfs/zfs_acl.c:2130
#11 0xffffffff82194a93 in zfs_zaccess_common (zp=0x0, zp@entry=0xfffff80025342570, v4_mode=<optimized out>, working_mode=0x0, working_mode@entry=0xfffffe0226645a00,
    check_privs=check_privs@entry=0xfffffe02266459fc, skipaclchk=644110774, skipaclchk@entry=0, cr=0xfffffe02266459b4, cr@entry=0xfffff80049ea2600)
    at /usr/src/sys/contrib/openzfs/module/os/freebsd/zfs/zfs_acl.c:2288
#12 0xffffffff82193e0c in zfs_zaccess (zp=zp@entry=0xfffff80025342570, mode=<optimized out>, flags=flags@entry=0, skipaclchk=skipaclchk@entry=0, cr=cr@entry=0xfffff80049ea2600, mnt_ns=<optimized out>)
    at /usr/src/sys/contrib/openzfs/module/os/freebsd/zfs/zfs_acl.c:2388
#13 0xffffffff82194b49 in zfs_zaccess_rwx (zp=0x0, zp@entry=0xfffff80025342570, mode=<optimized out>, flags=644110728, flags@entry=0, cr=0xfffffe02266459b6, cr@entry=0xfffff80049ea2600, mnt_ns=mnt_ns@entry=0x0)
    at /usr/src/sys/contrib/openzfs/module/os/freebsd/zfs/zfs_acl.c:2474
#14 0xffffffff82335234 in zfs_access (zp=zp@entry=0xfffff80025342570, mode=256, flag=flag@entry=0, cr=0xfffff80049ea2600) at /usr/src/sys/contrib/openzfs/module/zfs/zfs_vnops.c:205
#15 0xffffffff821a2989 in zfs_freebsd_access (ap=0xfffffe0226645ab0) at /usr/src/sys/contrib/openzfs/module/os/freebsd/zfs/zfs_vnops_os.c:4513
#16 0xffffffff80c2d7a4 in VOP_ACCESS (vp=0xfffff80025346380, accmode=<optimized out>, cred=0xfffff80049ea2600, td=<optimized out>) at ./vnode_if.h:341
#17 vn_open_vnode (vp=0xfffff80025346380, fmode=fmode@entry=1048577, cred=cred@entry=0xfffff80049ea2600, td=td@entry=0xfffff80538064740, fp=0xfffff804add49550) at /usr/src/sys/kern/vfs_vnops.c:437
#18 0xffffffff80c2d380 in vn_open_cred (ndp=ndp@entry=0xfffffe0226645c90, flagp=flagp@entry=0xfffffe0226645da4, cmode=cmode@entry=0, vn_open_flags=vn_open_flags@entry=16, cred=0xfffff80049ea2600,
    fp=0xfffff804add49550) at /usr/src/sys/kern/vfs_vnops.c:337
#19 0xffffffff80c239a8 in openatfp (td=0xfffff80538064740, dirfd=-100, path=0x1cfdaa000158 <error: Cannot access memory at address 0x1cfdaa000158>, pathseg=pathseg@entry=UIO_USERSPACE, flags=1048577,
    mode=<optimized out>, fpp=0x0) at /usr/src/sys/kern/vfs_syscalls.c:1173
#20 0xffffffff80c2371d in kern_openat (dirfd=-2109805088, path=0xfffffe0226645988 "\020]d&\002\376\377\377\200&4%", pathseg=UIO_USERSPACE, flags=644110774, mode=644110772, td=<optimized out>)
    at /usr/src/sys/kern/vfs_syscalls.c:1278
#21 sys_openat (td=<optimized out>, uap=<optimized out>) at /usr/src/sys/kern/vfs_syscalls.c:1111
#22 0xffffffff810011c0 in syscallenter (td=0xfffff80538064740) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:188
#23 amd64_syscall (td=0xfffff80538064740, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1194
#24 <signal handler called>
#25 0x000000082661a4aa in ?? ()
Backtrace stopped: Cannot access memory at address 0x83985aad8

fmc000 · Sep 21, 2024

Do you use NFS along with ZFS?

jb82 · Sep 21, 2024

fmc000 said:
Do you use NFS along with ZFS?

No.

Cath O'Deray · Sep 21, 2024

jb82 said:
#6 0xffffffff81000966 in trap_pfault (frame=<unavailable>, usermode=false, signo=<optimized out>, ucode=<optimized out>) at /usr/src/sys/amd64/amd64/trap.c:760

Barney · Sep 22, 2024

It could be some bad memory. Vscode may just be using a lot of memory to cache so you just happen to see it now More likely a bug in zfs. Whats the hardware? Is it a dual CPU system?

You’ve updated all of your packages, and then added a package with a million dependencies built against older libs. Why not try rebuilding it in ports and see if you have the same problem.

Cath O'Deray · Sep 22, 2024

jb82 said:
… Nothing in /var/crash …

Also, in the photograph <https://forums.freebsd.org/attachments/kernel-panic-1_40perc-jpg.20394/>, no dump.

sysrc dumpdev

If the result is "AUTO", then you might want to change it to "/dev/nda1p2":

proceed with caution
this assumes that you have GELI-encrypted freebsd-swap at nda1p2 (alongside GELI-encypted freebsd-zfs for your nda1p3.eli).

Compare with my setup:

Code:

% lsblk /dev/ada1
DEVICE         MAJ:MIN SIZE TYPE                                    LABEL MOUNT
ada1             0:130 932G GPT                                         - -
  ada1p1         0:131 260M efi                              gpt/efiboot0 /boot/efi
  <FREE>         -:-   1.0M -                                           - -
  ada1p2         0:132  16G freebsd-swap                        gpt/swap0 SWAP
  ada1p2.eli     0:221  16G freebsd-swap                                - SWAP
  ada1p3         0:133 915G freebsd-zfs                          gpt/zfs0 <ZFS>
  ada1p3.eli     0:139 915G -                                           - -
  <FREE>         -:-   708K -                                           - -
% sysrc dumpdev
dumpdev: /dev/ada1p2
%

jb82 · Sep 22, 2024

Findings,
a) A bad memory seems unlikely since an overnight UEFI-based memory test on the ECC modules showed PASSED.
b) I already collected other crashdumps, screenshot above therefore is not the only clue
c) It has nothing to do with pkg, vscode or whatsoever. I won't change the title though. It seems to be a generic issue, either HW, or ZFS related.
d) At the time of screenshoting that, I did have dumpdev = "NO", that's why I took the picture. Now, I have proper crash dumps.
e) The HW is Lenovo P1 Gen 2, Intel(R) Xeon(R) E-2276M, 64 GB RAM ECC modules, NVME SAMSUNG MZVLB1T0HBLR-000L7 5M2QEXF7 1 TB.

It's hard to reproduce but I've already had ~5 such events in the last two weeks.

Crash & reboot after pkg add...

Administrator