System has started to hang after a few minutes

This morning my system, a ThinkPad W520 running FreeBSD 15.0-RELEASE hung and was only shut down by powering off. Subsequenty it seems it seems to hang after a few minutes.

What can I do to identify the cause? Presumably dmesg will show no indication that something has triggered the hang.

sshd stops working so accessing remotely does not work.
 
Anything in /var/log/messages?
If you have another computer that can accept external syslog, possibly redirecting syslog to the computer can help?
 
Does the system freeze happen when xorg is running or in system console/virtual terminal without xorg running ?

If it's happening when xorg is running, is there a drm-kmod video driver loaded? If it is, which one is it?
 
The hang seemed to occur when doing something in Chrome, so I thought I would try to re-install it.

In the process of downloading it I get this message:-

ada1: <SSD 4TB VE0R5305> s/n 0022921 detached
Solaris: WARNING: Pool 'zroot' has encountered an uncorrectable I/O failure and has been suspended.W

What to do?
 
Disk (SSD) failures, but not always the drive itself. Cables (riser cards), connectors or power supply lines can also affect.

Something to check first would be:
  • If you have sysutils/smartmontools installed, how is the output of smartctl -a /dev/ada1 | fgrep Written? Doesn't it exceeding the warranted TBW?
  • Is your failing SSD (ada1) sanely connected (physically)?
  • If your SSD and physical connections looks OK, is your main memories sane?
These are what I'll check first if it happenes to me.
If these physical things are all OK, we'll need ZFS experts.
 
The SSD (4TB) is brand new but could be a dodgy brand, although has been working fine up until today.

Does ZFS need some special parameters with SSD?

The command you gave me shows:-

241 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 189

That doesn't mean anything to me, maybe to someone else...

BTW I was able to reboot and install smartmontools.
 
In my case (NVMe WD Black SN850X 4000GB), start using this year,
Code:
# smartctl -a /dev/nvme0 | fgrep Written
Data Units Written:                 34,336,971 [17.5 TB]

As mine is NVMe, result using nvmecontrol(8) below.
Code:
# nvmecontrol logpage -p 2 nvme0 | fgrep unit
Data units (512,000 byte) read: 5598740
Data units written:             34338652

Both ran as root. The latter needs manual calcuration and comes to the same results.

Your result looks quite weired.
 
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 050 Pre-fail Always - 0
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 298
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 31
161 Unknown_Attribute 0x0032 100 100 050 Old_age Always - 0
162 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 4936
163 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 3000
164 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 23
166 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 60
167 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
168 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
169 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 100
171 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
172 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
174 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 18
175 Program_Fail_Count_Chip 0x0032 100 100 000 Old_age Always - 0
181 Program_Fail_Cnt_Total 0x0022 100 100 000 Old_age Always - 28871
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 100 100 000 Old_age Always - 40
195 Hardware_ECC_Recovered 0x003a 100 100 000 Old_age Always - 0
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
206 Unknown_SSD_Attribute 0x0032 100 100 000 Old_age Always - 1
207 Unknown_SSD_Attribute 0x0032 100 100 000 Old_age Always - 59
232 Available_Reservd_Space 0x0032 100 100 000 Old_age Always - 0
241 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 189
242 Total_LBAs_Read 0x0032 100 100 000 Old_age Always - 36
249 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 2404
250 Read_Error_Retry_Rate 0x0032 100 100 000 Old_age Always - 192

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

The above only provides legacy SMART information - try 'smartctl -x' for more
 
Back
Top