Hi, something interesting of today:
When I started my machine, I could see a bunch of these errors:
Nothing bad happened, the machine appeared to work correctly. But looking closer, I found the read thruput from the respective SSD was ~100MBps (instead of ~450). So something appeared to be wrong.
Unplugging and replugging the SATA cable solved that issue (and it's a good cable, with a clip). So now we know, almost certain, these cables have a temperature problem. It is only a few weeks since I put the server mainboard into the desktop and plugged the cables - but that board does get very hot: I hit coretemp.throttle when using more than 15 cores. (Normal desktop airflow is certainly not ideal for a 10/20 core, in this time of the year.)
Then I tried again to read the SSD, raw device with dd. And while the max thruput was okay now, I observed something else (using
Finally I managed to do a factory reset (aka "secure erase") - not so very easy, because the server board does not know S3 hibernation, but does put all SATA drives into security freeze, unconditionally - so hot plugging the power is required (I wonder how that would to be managed remotely; but then, it's only a cheap server board).
Result, after restoring all data: the spookiness is gone, there is sustained read thruput of 500MBps again (minus powerd, minus cx_lowest C3, minus ibrs_disable, minus cpu allocation, gives 430).
Some hard data on the effect of this: I have my basic OS filesystems on plain UFS, and everything else in mirrored zfs. The OS filesystems get copied to the second disk during shutdown. (So that there is always something that can be booted to a working OS, in the most simple way.)
Here are recent timings of this copy:
And this is it now, after factory reset of the reading(!) device:
Even with all TRIM features active (ZFS and UFS), there is a significant improvement here.
The current smart data, for those interested:
When I started my machine, I could see a bunch of these errors:
Code:
kernel: (ada0:ahcich4:0:0:0): READ_FPDMA_QUEUED ACB: 60 08 48 e4 b3 40 06 00 00 00 00 00
kernel: (ada0:ahcich4:0:0:0): CAM status: Uncorrectable parity/CRC error
kernel: (ada0:ahcich4:0:0:0): Retrying command, 3 more tries remain
Nothing bad happened, the machine appeared to work correctly. But looking closer, I found the read thruput from the respective SSD was ~100MBps (instead of ~450). So something appeared to be wrong.
Unplugging and replugging the SATA cable solved that issue (and it's a good cable, with a clip). So now we know, almost certain, these cables have a temperature problem. It is only a few weeks since I put the server mainboard into the desktop and plugged the cables - but that board does get very hot: I hit coretemp.throttle when using more than 15 cores. (Normal desktop airflow is certainly not ideal for a 10/20 core, in this time of the year.)
Then I tried again to read the SSD, raw device with dd. And while the max thruput was okay now, I observed something else (using
gstat -po
): repeatedly the thruput would go down to only 2MBps and the ms/r value to quite exactly 100ms. This was happening on read (and we do not expect SSD to be slow on read). Something was going on internally, and the piece appeared to have difficulties retrieving it's data (no clue to be found in smart data). Over all, this reduced the thruput to some 330MBps.Finally I managed to do a factory reset (aka "secure erase") - not so very easy, because the server board does not know S3 hibernation, but does put all SATA drives into security freeze, unconditionally - so hot plugging the power is required (I wonder how that would to be managed remotely; but then, it's only a cheap server board).
Result, after restoring all data: the spookiness is gone, there is sustained read thruput of 500MBps again (minus powerd, minus cx_lowest C3, minus ibrs_disable, minus cpu allocation, gives 430).
Some hard data on the effect of this: I have my basic OS filesystems on plain UFS, and everything else in mirrored zfs. The OS filesystems get copied to the second disk during shutdown. (So that there is always something that can be booted to a working OS, in the most simple way.)
Here are recent timings of this copy:
Code:
Jul 7 DUMP: Dumping snapshot of /dev/ada0p3 (/) to standard output
Jul 7 DUMP: finished in 18 seconds, throughput 26678 KBytes/sec
Jul 7 DUMP: Dumping snapshot of /dev/ada0p4 (/usr) to standard output
Jul 7 DUMP: finished in 29 seconds, throughput 22468 KBytes/sec
Jul 13 DUMP: Dumping snapshot of /dev/ada0p3 (/) to standard output
Jul 13 DUMP: finished in 14 seconds, throughput 34390 KBytes/sec
Jul 13 DUMP: Dumping snapshot of /dev/ada0p4 (/usr) to standard output
Jul 13 DUMP: finished in 28 seconds, throughput 23270 KBytes/sec
Jul 23 DUMP: Dumping snapshot of /dev/ada0p3 (/) to standard output
Jul 23 DUMP: finished in 18 seconds, throughput 26748 KBytes/sec
Jul 23 DUMP: Dumping snapshot of /dev/ada0p4 (/usr) to standard output
Jul 23 DUMP: finished in 28 seconds, throughput 23270 KBytes/sec
And this is it now, after factory reset of the reading(!) device:
Code:
Jul 24 DUMP: Dumping snapshot of /dev/ada0p3 (/) to standard output
Jul 24 DUMP: finished in 14 seconds, throughput 34390 KBytes/sec
Jul 24 DUMP: Dumping snapshot of /dev/ada0p4 (/usr) to standard output
Jul 24 DUMP: finished in 18 seconds, throughput 36198 KBytes/sec
Jul 24 DUMP: Dumping snapshot of /dev/ada0p3 (/) to standard output
Jul 24 DUMP: finished in 14 seconds, throughput 34390 KBytes/sec
Jul 24 DUMP: Dumping snapshot of /dev/ada0p4 (/usr) to standard output
Jul 24 DUMP: finished in 22 seconds, throughput 29617 KBytes/sec
Even with all TRIM features active (ZFS and UFS), there is a significant improvement here.
The current smart data, for those interested:
Code:
Model Family: Phison Driven SSDs
Device Model: KINGSTON SA400S37240G
Firmware Version: S1Z40102
User Capacity: 240,057,409,536 bytes [240 GB]
Sector Size: 512 bytes logical/physical
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
1 Raw_Read_Error_Rate -O--CK 100 100 000 - 100
9 Power_On_Hours -O--CK 100 100 000 - 13509
12 Power_Cycle_Count -O--CK 100 100 000 - 694
148 Unknown_Attribute ------ 100 100 000 - 0
149 Unknown_Attribute ------ 100 100 000 - 0
167 Write_Protect_Mode ------ 100 100 000 - 0
168 SATA_Phy_Error_Count -O--C- 100 100 000 - 1
169 Bad_Block_Rate ------ 100 100 000 - 0
170 Bad_Blk_Ct_Erl/Lat ------ 100 100 010 - 0/0
172 Erase_Fail_Count -O--CK 100 100 000 - 0
173 MaxAvgErase_Ct ------ 100 100 000 - 0
181 Program_Fail_Count -O--CK 100 100 000 - 0
182 Erase_Fail_Count ------ 100 100 000 - 0
187 Reported_Uncorrect -O--CK 100 100 000 - 0
192 Unsafe_Shutdown_Count -O--C- 100 100 000 - 298
194 Temperature_Celsius -O---K 040 064 000 - 40 (Min/Max 23/64)
196 Reallocated_Event_Count -O--CK 100 100 000 - 0
199 SATA_CRC_Error_Count -O--CK 100 100 000 - 0
218 CRC_Error_Count -O--CK 100 100 000 - 1
231 SSD_Life_Left ------ 075 075 000 - 75
233 Flash_Writes_GiB -O--CK 100 100 000 - 34191
241 Lifetime_Writes_GiB -O--CK 100 100 000 - 33290
242 Lifetime_Reads_GiB -O--CK 100 100 000 - 23140
244 Average_Erase_Count ------ 100 100 000 - 259
245 Max_Erase_Count ------ 100 100 000 - 306
246 Total_Erase_Count ------ 100 100 000 - 110915