SMARTCTL output help.

Phishfry · Jan 8, 2024

I have some new disk hardware and I was wondering why Spin Up Time is longer than POH? How can that be?

103 Spin_Up_Time 0x0027 100 100 001 Pre-fail Always - 6208

9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 29

getopt · Jan 8, 2024

Phishfry said:
I have some new disk hardware

Should we guess about vendor, type and how you use it?

Phishfry · Jan 8, 2024

I just can't fathom the number. Is it hours compared to minutes? 29POH x 60min==6208

Code:

=== START OF INFORMATION SECTION ===
Model Family:     Toshiba MG07ACA... Enterprise Capacity HDD
Device Model:     TOSHIBA MG07ACA12TEY
Serial Number:    19L0A06LFxxx
LU WWN Device Id: 5 000039 928c97786
Add. Product Id:  DELL(tm)
Firmware Version: GB02
User Capacity:    12,000,138,625,024 bytes [12.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5528
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Jan  8 10:19:23 2024 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

covacat · Jan 8, 2024

https://kb.acronis.com/content/9103?ckattempt=1

Phishfry · Jan 8, 2024

I have not used SMARTMONTOOLS with actual rotating disks in quite a while. It is a learning experience.

Phishfry · Jan 8, 2024

Never have I ever....

Code:

 23 Helium_Condition_Lower  0x0023   100   100   075    Pre-fail  Always       -       0
 24 Helium_Condition_Upper  0x0023   100   100   075    Pre-fail  Always       -       0

Phishfry · Jan 8, 2024

Spin_Up_Time= 6208 can I assume is MilliSeconds? Spin up from zero to 7200rpm and thermally calibrate.
6.2 seconds sounds about right.

PMc · Jan 9, 2024

It could be whatever internally computed value, even a binary compound of multiple values (often with Seagate). Don't have Toshiba in my zoo, so no idea about their habits.

Phishfry · Jan 9, 2024

That is reassuring. Not a standard value.
I saw some disk reports with 4000 value and was wondering why mine would take 2 seconds longer.

There are several things about my 'flight data recorder' I don't understand. Even with years looking at SSD SMART.

I don't see G-Force indicator. Is that only on laptop drives?

Phishfry · Jan 9, 2024

Heck I just learned how to manually trigger a internal self-test with smartmon in the last few months..
Its features are really fun.
smartctl -t short /dev/ada0

PMc · Jan 9, 2024

Phishfry said:
I don't see G-Force indicator. Is that only on laptop drives?

No, but mainly on Seagate. E.g. not on my Ultrastar.
The whole thing is a lot of chaos. And these RAW_VALUE are not meant for the user to understand. Only the VALUE column in reference to the THRESH column should say something. Not even the Power_On_Hours are necessarily power-on-hours. Some actually are, some are only spinning hours. Some SSD count some activitiy there (but definitely not the power-on-hours).

The most interesting thing to do is probably, throw them into a database weekly, and draw charts.

Phishfry said:
Heck I just learned how to manually trigger a internal self-test with smartmon in the last few months..
Its features are really fun.
smartctl -t short /dev/ada0

Don't do it on SSD. It's useless there, and I killed two of mine with the "extended" selftest (beware of SiliconMotion controllers).

ralphbsz · Jan 9, 2024

Phishfry said:
I have not used SMARTMONTOOLS with actual rotating disks in quite a while. It is a learning experience.

And this is one of the reasons why monitoring disk health is very hard: You need to work with each disk drive vendors, and do vendor- and sometimes model-specific decoding of how they do things and what numbers mean. When it comes to PFA (predictive failure analysis, the drive telling the user that it needs to be replaced), it's even more complicated.

Phishfry said:

Never have I ever....

Code:

 23 Helium_Condition_Lower  0x0023   100   100   075    Pre-fail  Always       -       0
 24 Helium_Condition_Upper  0x0023   100   100   075    Pre-fail  Always       -       0

Ah, this is a helium-filled drive. That trend started in the early 2010s, to reduce aerodynamic friction and turbulence in the drive. Has several good effects: reduces spindle motor power, reduces mechanical vibration inside the drive (caused both by higher-powered motors and more turbulence), allows using thinner platters so more platters can be stacked, and drives run cooler. Helium-filled drives are hermetically sealed (so no air gets in), and their pressure and helium "level" needs to be monitored. Leaks can cause the drive to stop functioning.

Cath O'Deray · Jan 14, 2024

PMc said:
… I killed two of mine with the "extended" selftest …

How can hardware be killed by a test that does not write, if there is not a latent hardware fault prior to reading?

PMc · Jan 14, 2024

grahamperrin said:
How can hardware be killed by a test that does not write, if there is not a latent hardware fault prior to reading?

Because it's not hardware that got killed.

The hardware is perfectly fine, but it's bricked.

Step 1:
I created a script that does extended selftest periodically every few weeks. And I put the script into etc/periodic/daily

Step 2:
At 3AM, as expected, the script grabbed one or two disks and started the selftest.

Step 3:
First SSD was dead.

Step 4:
I replaced it with a different brand.

Step 5:
Exactly when the selftest started on the new SSD, it was also dead (replaced per warranty)

That was then enough for me to start research. The first victim was a Kingston A400. smartctl says this is Phison. It is not, at that time it was rebranded SiliconMotion 2258XT. The second was HP S700. Resources on the web say this is also SM2258XT. That explains things.

What happened: that machine runs about ten nodes. At 3AM, hell breaks loose: there are lots of find running, there are lots of database VACUUM running, etc. The devices get hammered with i/o.
Whatever the extended selftest is supposed to do on a SSD, apparently nobody had tested it under such conditions.

These SSD have only one persistent memory, that's the flash cells. The configuration data -what kind of flash memory is installed, how is it to be treated and how is it mapped- is stored inside the flash cells. The running organizational data, the so-called Flash-Translation-Layer, is also inside the flash cells. This is all together one big messy mesh.

I opened the device, shortened the factory-mode-enable pins, and indeed it reports itself as some cryptic device with 64MB inaccessible storage space. At this point one could download a new configuration.

Further details are at usbdev.ru. That one is the real freakshow. Because there is big money to make with this: there are people who have such failures, and do not have a backup. And they want their data back - and they are willing to pay.

I for my part do have backups, but I would love to get that piece back in working order - if it were just for the sports of my zoo.
But then, getting to some configuration that allows to read out the stored payload data in order to then try and reconstruct it, that is one thing; getting to a fresh configuration that would reliably work for continued operation, that is yet another.

Crivens · Jan 14, 2024

Well, today I learned...

Phishfry · Jan 14, 2024

Note to self. Avoid Extended SMART test with SSD.

Crivens · Jan 14, 2024

Usually ZFS catches problems before the drive does.

Phishfry · Jan 14, 2024

The only reason I tiptoed into the short test as I was showing a test as started but interrupted.
So I figured lets finish it. Relatively new drive.

I didn't know you had to trigger a test as it seems most are automatic...

I purchased new drives and been doing fan testing in my miniNAS chassis. SMART has my temps.

Very disappointed by the lack of granularity on SuperMicro X11SCL motherboard. Four settings from optimal to full blast.
But the setting affects all Fan Headers. Booooooooooo.

Phishfry · Jan 14, 2024

Generally, Give me your opinion on SMART Drive Temperature reading.

What do you consider WARM, HOT or shut-it-down-now temps?
35C is my winter norm idle OK and Max so far is 38C soft loaded.
Smooth breakin period making everything pretty inside....
I hate that my ITX NAS is running a FSP 250W Flex Power Supply. I see I bought a spare.

Phishfry · Jan 14, 2024

The Term "Enterprise Capacity" doesn't make me feel warm and fuzzy.

Phishfry said:
Model Family: Toshiba MG07ACA... Enterprise Capacity HDD

PMc · Jan 14, 2024

As with a Diesel: warm is better.

Code:

        "JC0911HX0Z****": {     # Hitachi 1T SATA
                a1: 47..44,
                a2: 48..45,
                d1: 57..54,
                d2: 59..56
        },
        "Z500****": {           # Seagate 3000G SATA
                a2: 43..41,
                a1: 44..42,
                d1: 45..43,
                d2: 47..45
        },
        "K4GL****": {           # HGST Ultrastar 4T SATA
                a1: 45..43,
                a2: 46..44,
                d1: 50..48,
                d2: 52..50
        },

a1, a2: extra fan arrays
d1: pause all scrubs
d2: rctl throttle all workload

heatctl.rb - tools - assorted systems management scripts

gitr.daemon.contact

Phishfry · Jan 14, 2024

ipmi tools does not contain any drive information?
That would be nice.
I do see M2 NVMe Temp. No Reading
Wonder what that requires.

Code:

ipmitool sdr
CPU Temp         | 34 degrees C      | ok
PCH Temp         | 35 degrees C      | ok
System Temp      | 27 degrees C      | ok
Peripheral Temp  | 32 degrees C      | ok
VcpuVRM Temp     | 32 degrees C      | ok
M2NVMeSSD Temp   | no reading        | ns
FAN1             | no reading        | ns
FAN2             | 3100 RPM          | ok
FAN3             | no reading        | ns
FANA             | 3500 RPM          | ok
DIMMA1 Temp      | 31 degrees C      | ok
DIMMB1 Temp      | 32 degrees C      | ok
12V              | 11.84 Volts       | ok
5VCC             | 5.18 Volts        | ok
3.3VCC           | 3.30 Volts        | ok
VBAT             | 0x04              | ok
Vcpu             | 0.78 Volts        | ok
VDimm            | 1.24 Volts        | ok
5VSB             | 5.19 Volts        | ok
3.3VSB           | 3.26 Volts        | ok
1.8V_PCH         | 1.85 Volts        | ok
1.2V_BMC         | 1.22 Volts        | ok
1.05V_PCH        | 1.06 Volts        | ok
Chassis Intru    | 0x00              | ok

Phishfry · Jan 14, 2024

Strange that you must control SuperMicro Fan Control through the BMC GUI and not in BIOS.......

The drive information for IPMItools may come from i2c/SES.

I see drive temps on other platforms. My backplane on this box is dumb.

ralphbsz · Jan 14, 2024

Phishfry said:
Generally, Give me your opinion on SMART Drive Temperature reading.

Good question. The normal consensus is: cold is bad; 20C and lower are bad, and 15C and lower are very bad. Some drives (perhaps many, perhaps all) will go into a "limp home" mode at temperatures below 15C, for example they will perform a read verify after every write, killing performance (by as much as a factor of 10).

On the other side, hot is bad. Above 45C or so, there is measurable correlation between higher temperatures and higher failure rates. In general, around 30C to 40C seems optimal. Most people in the field think that the absolute temperature in that reasonable range is less important than having a stable temperature; disks whose temperatures vary widely are probably less reliable.

Warning: Papers have been published on this in the freely available public record (for example at FAST or in ToS). But some of those papers (for example the much quoted Pinheiro et al., and Sankar et al.) are relatively old, about 15 and 10 years ago. In the meantime, the hardware has changed much. And in general, disk drives are very reliable (failures are rare), and the effect of temperature on the already small failure rate is small. For an amateur with 1 or a handful drives, none of this matters much.

SMARTCTL output help.

Phishfry

getopt

␢

Phishfry

covacat

Phishfry

Phishfry

Phishfry

PMc

Phishfry

Phishfry

PMc

ralphbsz

Cath O'Deray

PMc

Crivens

Administrator

Phishfry

Crivens

Administrator

Phishfry

Phishfry

Phishfry

PMc

heatctl.rb - tools - assorted systems management scripts

Phishfry

Phishfry

ralphbsz