It's a bit more complicated than that.
Code:
[root@backup 10.Nov 3:04pm ~]# gpart show ada0
=> 34 500118125 ada0 GPT (238G)
34 1024 1 freebsd-boot (512K)
1058 2096222 2 freebsd-swap (1.0G)
2097280 165150720 3 freebsd-zfs (79G)
167248000 67108864 4 freebsd-zfs (32G)
234356864 67108864 5 freebsd-zfs (32G)
301465728 16777216 6 freebsd-swap (8.0G)
318242944 181875215 - free - (87G)
[root@backup 10.Nov 3:04pm ~]#
ada0p3 is the OS.
ada0p4 is the zfs logs for another zfs array.
ada0p5 is the cache also for the other zfs array.
ada0p2 was the original swap that became too small and since I had lots of space at the end of the fifth partition, I merely created a second swap partition (ada0p6) and used this one instead of ada.
Below is more information regarding the two zpools on my system:
Code:
[root@backup 10.Nov 3:10pm ~]# zpool status
pool: zdata
state: ONLINE
scan: scrub repaired 0 in 0 days 10:40:04 with 0 errors on Wed Nov 4 18:20:04 2020
config:
NAME STATE READ WRITE CKSUM
zdata ONLINE 0 0 0
raidz3-0 ONLINE 0 0 0
gpt/data_disk10 ONLINE 0 0 0
gpt/data_disk11 ONLINE 0 0 0
gpt/data_disk12 ONLINE 0 0 0
gpt/data_disk13 ONLINE 0 0 0
gpt/data_disk14 ONLINE 0 0 0
gpt/data_disk15 ONLINE 0 0 0
gpt/data_disk16 ONLINE 0 0 0
gpt/data_disk17 ONLINE 0 0 0
gpt/data_disk18 ONLINE 0 0 0
gpt/data_disk19 ONLINE 0 0 0
logs
mirror-1 ONLINE 0 0 0
gpt/log0 ONLINE 0 0 0
gpt/log1 ONLINE 0 0 0
cache
gpt/cache0 ONLINE 0 0 0
gpt/cache1 ONLINE 0 0 0
errors: No known data errors
pool: zroot
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://illumos.org/msg/ZFS-8000-9P
scan: scrub repaired 384K in 0 days 00:04:33 with 0 errors on Tue Nov 10 13:40:04 2020
config:
NAME STATE READ WRITE CKSUM
zroot ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
gpt/disk0 ONLINE 0 0 9
gpt/disk1 ONLINE 0 0 1
errors: No known data errors
[root@backup 10.Nov 3:10pm ~]#
I've scrubbed and rescrubbed the zroot pool and I keep getting the same warning again and again. Running
smartctl on the mirrored disks as follows:
Code:
[root@backup 10.Nov 3:17pm ~]# smartctl -a /dev/ada0
smartctl 7.1 2019-12-30 r5022 [FreeBSD 12.1-RELEASE-p10 amd64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Marvell based SanDisk SSDs
Device Model: SanDisk SD6SB1M256G1022I
Serial Number: 140433400574
LU WWN Device Id: 5 001b44 bbe109efe
Firmware Version: X231600
User Capacity: 256,060,514,304 bytes [256 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: Unknown (0x000a)
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS T13/1699-D revision 6
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Tue Nov 10 15:17:17 2020 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x11) SMART execute Offline immediate.
No Auto Offline data collection support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
No Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 10) minutes.
SMART Attributes Data Structure revision number: 4
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0032 100 100 --- Old_age Always - 73
9 Power_On_Hours 0x0032 253 100 --- Old_age Always - 44100
12 Power_Cycle_Count 0x0032 100 100 --- Old_age Always - 56
166 Min_W/E_Cycle 0x0032 100 100 --- Old_age Always - 1
167 Min_Bad_Block/Die 0x0032 100 100 --- Old_age Always - 36
168 Maximum_Erase_Cycle 0x0032 100 100 --- Old_age Always - 5838
169 Total_Bad_Block 0x0032 100 100 --- Old_age Always - 346
171 Program_Fail_Count 0x0032 100 100 --- Old_age Always - 69
172 Erase_Fail_Count 0x0032 100 100 --- Old_age Always - 0
173 Avg_Write/Erase_Count 0x0032 100 100 --- Old_age Always - 5604
174 Unexpect_Power_Loss_Ct 0x0032 100 100 --- Old_age Always - 27
187 Reported_Uncorrect 0x0032 100 100 --- Old_age Always - 56
194 Temperature_Celsius 0x0022 065 045 --- Old_age Always - 35 (Min/Max 24/45)
212 SATA_PHY_Error 0x0032 100 100 --- Old_age Always - 0
230 Perc_Write/Erase_Count 0x0032 100 100 --- Old_age Always - 0 0 47696
232 Perc_Avail_Resrvd_Space 0x0033 100 100 004 Pre-fail Always - 99
233 Total_NAND_Writes_GiB 0x0032 100 100 --- Old_age Always - 1462359
241 Total_Writes_GiB 0x0030 253 253 --- Old_age Offline - 467347
242 Total_Reads_GiB 0x0030 253 253 --- Old_age Offline - 15118
243 Unknown_Marvell_Attr 0x0032 100 100 --- Old_age Always - 0
SMART Error Log Version: 1
ATA Error Count: 53 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 53 occurred at disk power-on lifetime: 44099 hours (1837 days + 11 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
51 40 00 00 00 00 00
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 18 50 74 80 40 08 00:00:00.000 READ FPDMA QUEUED
Error 52 occurred at disk power-on lifetime: 44099 hours (1837 days + 11 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
51 40 00 00 00 00 00
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 18 50 74 80 40 08 00:00:00.000 READ FPDMA QUEUED
Error 51 occurred at disk power-on lifetime: 44099 hours (1837 days + 11 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
51 40 00 00 00 00 00
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 20 50 75 80 40 08 00:00:00.000 READ FPDMA QUEUED
Error 50 occurred at disk power-on lifetime: 44099 hours (1837 days + 11 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
51 40 00 00 00 00 00
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
2f 00 01 10 00 00 00 08 00:00:00.000 READ LOG EXT
Error 49 occurred at disk power-on lifetime: 44099 hours (1837 days + 11 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
51 40 00 00 00 00 00
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 00 78 71 80 40 08 00:00:00.000 READ FPDMA QUEUED
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 34559 -
# 2 Short offline Completed without error 00% 34535 -
# 3 Short offline Completed without error 00% 34511 -
# 4 Short offline Completed without error 00% 34487 -
# 5 Short offline Completed without error 00% 34463 -
# 6 Short offline Completed without error 00% 34439 -
# 7 Short offline Completed without error 00% 34415 -
# 8 Short offline Completed without error 00% 34391 -
# 9 Short offline Completed without error 00% 34367 -
#10 Short offline Completed without error 00% 34343 -
#11 Short offline Completed without error 00% 29964 -
#12 Short offline Completed without error 00% 29940 -
#13 Short offline Completed without error 00% 29916 -
#14 Short offline Completed without error 00% 29892 -
#15 Short offline Completed without error 00% 29868 -
#16 Short offline Completed without error 00% 29844 -
#17 Short offline Completed without error 00% 29820 -
#18 Short offline Completed without error 00% 29796 -
#19 Short offline Completed without error 00% 29772 -
#20 Short offline Completed without error 00% 29748 -
#21 Short offline Completed without error 00% 29724 -
Selective Self-tests/Logging not supported
[root@backup 10.Nov 3:17pm ~]#
Code:
[root@backup 10.Nov 3:18pm ~]# smartctl -a /dev/ada1
smartctl 7.1 2019-12-30 r5022 [FreeBSD 12.1-RELEASE-p10 amd64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Marvell based SanDisk SSDs
Device Model: SanDisk SD6SB1M256G1022I
Serial Number: 140433401574
LU WWN Device Id: 5 001b44 bbe10a2e6
Firmware Version: X231600
User Capacity: 256,060,514,304 bytes [256 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: Unknown (0x000a)
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS T13/1699-D revision 6
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Tue Nov 10 15:19:07 2020 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x11) SMART execute Offline immediate.
No Auto Offline data collection support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
No Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 10) minutes.
SMART Attributes Data Structure revision number: 4
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0032 100 100 --- Old_age Always - 81
9 Power_On_Hours 0x0032 253 100 --- Old_age Always - 44101
12 Power_Cycle_Count 0x0032 100 100 --- Old_age Always - 56
166 Min_W/E_Cycle 0x0032 100 100 --- Old_age Always - 1
167 Min_Bad_Block/Die 0x0032 100 100 --- Old_age Always - 30
168 Maximum_Erase_Cycle 0x0032 100 100 --- Old_age Always - 5831
169 Total_Bad_Block 0x0032 100 100 --- Old_age Always - 384
171 Program_Fail_Count 0x0032 100 100 --- Old_age Always - 81
172 Erase_Fail_Count 0x0032 100 100 --- Old_age Always - 0
173 Avg_Write/Erase_Count 0x0032 100 100 --- Old_age Always - 5592
174 Unexpect_Power_Loss_Ct 0x0032 100 100 --- Old_age Always - 27
187 Reported_Uncorrect 0x0032 100 100 --- Old_age Always - 0
194 Temperature_Celsius 0x0022 065 044 --- Old_age Always - 35 (Min/Max 25/44)
212 SATA_PHY_Error 0x0032 100 100 --- Old_age Always - 0
230 Perc_Write/Erase_Count 0x0032 100 100 --- Old_age Always - 0 0 47656
232 Perc_Avail_Resrvd_Space 0x0033 100 100 004 Pre-fail Always - 99
233 Total_NAND_Writes_GiB 0x0032 100 100 --- Old_age Always - 1457799
241 Total_Writes_GiB 0x0030 253 253 --- Old_age Offline - 467334
242 Total_Reads_GiB 0x0030 253 253 --- Old_age Offline - 15856
243 Unknown_Marvell_Attr 0x0032 100 100 --- Old_age Always - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 34559 -
# 2 Short offline Completed without error 00% 34535 -
# 3 Short offline Completed without error 00% 34511 -
# 4 Short offline Completed without error 00% 34487 -
# 5 Short offline Completed without error 00% 34463 -
# 6 Short offline Completed without error 00% 34439 -
# 7 Short offline Completed without error 00% 34415 -
# 8 Short offline Completed without error 00% 34391 -
# 9 Short offline Completed without error 00% 34367 -
#10 Short offline Completed without error 00% 34343 -
#11 Short offline Completed without error 00% 29964 -
#12 Short offline Completed without error 00% 29940 -
#13 Short offline Completed without error 00% 29916 -
#14 Short offline Completed without error 00% 29892 -
#15 Short offline Completed without error 00% 29868 -
#16 Short offline Completed without error 00% 29844 -
#17 Short offline Completed without error 00% 29820 -
#18 Short offline Completed without error 00% 29796 -
#19 Short offline Completed without error 00% 29772 -
#20 Short offline Completed without error 00% 29748 -
#21 Short offline Completed without error 00% 29724 -
Selective Self-tests/Logging not supported
[root@backup 10.Nov 3:19pm ~]#
Above output leads me to believe there is a major issue with /dev/ada0.
Assuming that the /dev/ada0 drive needs to be replaced, I am unsure as to how to recreate the logs mirror and the cache.
In the back of the 2U server chassis are two slots for the SSD drives. I aim to take the failing drive out and put in a new replacement drive. I wish to avoid sliding the server out of the rack but I will if there are no other more effective ways of dealing with this.
Assuming we are not sliding the server out, below is a list of steps I believe are needed in order to replace /dev/ada0:
Code:
# zpool detach zroot /dev/ada0
<take out failing drive and insert replacement drive>
# camcontrol devlist -v (verify device name and /dev designation)
# gpart create -s GPT ada0
# gpart add -b 34 -s 512k -t freebsd-boot -i 1 -l zfsboot0 ada0
# gpart add -b 1058 -s 2096222 -t freebsd-swap -i 2 -l swap0 ada0
# gpart add -b 2097280 -s 165150720 -t freebsd-zfs -i 3 -l disk0 ada0
$ gpart add -b 167248000 -s 67108864 -t freebsd-zfs -i 4 -l log0 ada0
$ gpart add -b 234356864 -s 67108864 -t freebsd-zfs -i 5 -l cache0 ada0
$ gpart add -b 301465728 -s 16777216 -t freebsd-zfs -i 6 -l swap20 ada0
$ gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada0
I am unsure as to what the next step should be. I've seen some postings where it states to use
zpool attach zroot /dev/ada0p3. But that doesn't address the issue of creating the log0 and the cache0 mirrors. Should I use
zpool replace zroot ada0? Section 19.3.5 in the the FreeBSD documentation on ZFS (
https://www.freebsd.org/doc/handbook/zfs-zpool.html) seems to indicate the use of the replace parameter. The zfs mirror has not yet entered into a degraded state so this should be considered a functioning mirror, yes?
Will the above method work? Or would it be easier to simply add the replacement drive to the mirror, add the bootcode, resilver and then detach the failing drive, and then boot down the server and move the replacement drive to the correct SSD drive slot in the back of the server?
Thanks in advance for any advice you may offer.
~Doug