Other NVMe Autonomous Power State Transition (APST)

So I'm playing around with APST (I had a slow day thankfully). I noticed when I'm using Debian, the power states will shift around depending on what the drive (1TB EX900 Plus M.2) needs to perform. The NVMe has 5 power states (0-4) and in Debian it will typically remain in power state 4 (0.0090W). It appears that APST is well supported in Debian.

Now for FreeBSD 13.0, It appears that APST is not functional here, at least by default. The NVMe starts off in power state 0 (3.000W) and remains there all the time.

I am able to use the command 'nvmecontrol power -p 4 nvme0' to force the NVMe into power state 4, and it will remain there until the drive is asked to do some real work and then it is back to power state 0 and will remain, it will not drop to a lower power state automatically.

In searching the internet for half a day I have been unable to find any real mention of APST with FreeBSD. The few things I could find were very elusive and talked about the AHCI driver could be the issue, but that was more of a discussion about AHCI but it did mention APST.

Here comes the question and I'm sorry I'm even asking it but I've exhausted my brain and the internet searching:
Is there any way to get FreeBSD 13.0 to make the NVMe APST work properly and automatically change the Power State?

Thanks for any advice.
 
Same problem here, got a WD SN570 and its acting like a mini heater in the case.

Idling at almost 70C. I discovered it can be manually changed to state 2, but state 2 still has high idle power, I think it just throttles power under load.

My drive wont change to 4 at all, if I run the command, and then check its state right after its back to 0.
 
The problem is, as soon as you run the command to place your NVMe into a lower power state, it was probably asked to do something which brings it back up. I have a script called Multi-Report for TrueNAS (could be used for other than TrueNAS) that has an option to set to a lower power level and sometimes it works fine, sometimes the power level jumps back up.
 
Yeah, the problem is that there is no APST support on FreeBSD yet. I was bothered by it too: half the idle battery life compared to Linux, almost 6W of power draw instead of 3W, and about 50 degrees Celsius on the controller. Recently, I gave up waiting and finally wrote a patch. You can find it here: PR 281643. I've tested it for a couple of days on my laptop, and it's been a relief :) There isn't a lot of code; basically, it comes down to filling a table with certain values (idle time to wait before switching and the state to switch to), so I believe it should work fine.

The only thing you should know is that your problem might not only be with APST itself, but also with some of the non-operational power states. In my case, it was the highest two (4-5), which strikingly consume much more power than even state 0! I found this with APST being turned off ( nvmecontrol admin-passthru -o 0x9 --cdw10 0xC --cdw11 0x0), and you can check it too without recompiling the kernel.

In the patch I've added a sysctl (apst_max_latency) to limit the controller to the normally functioning 0-3 states. As I said, this saves me 3W of total 6W idle power consumption, nearly doubling the battery life and reducing the temperature by 20C. Of course, it’s still worth fixing the actual problem (i.e., the highest NOPS), but it shouldn't be as much of an issue anymore.
 
eseipi Thank you for the information. I will give it a try to see how it works in two different systems. The only reason I like NVMe is silent operation. Now if I can force the power levels, that would be great. It would be nice to have a breakdown of each item in the command, if you have it. Otherwise I will search the internet for it.
 
I think you cannot force the controller to stay in NOPS while using the device regardless of APST. I wasn't clear enough, sorry. All I'm saying is that some of the non-operational power states may also work incorrectly, as in my case. You can check this without recompiling the kernel (by booting from a Live USB, for example), but the command I provided obviously won't help in your case because, as you said, your main problem is APST being disabled by default. The patch should still be doing exactly what you need, though.

But of course I can describe the command, maybe it will help someone. So, according to nvmecontrol(8) and NVMe specification (which is freely available):
- admin-passthru is sending commands to Admin Submission Queue of the controller;
- -o 9 is an opcode for the Set Feature command (10 is for Get Feature);
- --cdw10 12 is the Feature Identifier corresponding to APST, which we pass in Dword 10 field of the command;
- --cdw11 0 is feature specific value passed in Dword 11, which means we want to disable APST (1 to enable it *back*, which seems to work only because the controller already has the data it needs for state transitions).
 
Thanks for the breakdown, it did save me some time but I still need to look at it all. I of course have the NVMe specifications however I have only used a few of the commands which I needed. I'm a firm believer that don't mess with it if you don't fully understand it. I do have one NVMe I can test on in my FreeBSD/TrueNAS server. I will measure my power consumption to see what values actually save me power. That is crazy that your system pulled more power while the NVMe was in a less power consumption mode. Bet you thought you were reading things wrong at first. Hopefully I will be able to see the current draw differences. I have a system with 6 NVMe drives, I can toss FreeBSD on that and test and that would show power consumption better.
 
Thank You, that is an interesting work indeed. But it seems, NVMe devices can be individually different. Mine is saying this:
Code:
Power States Supported: 5

 #   Max pwr  Enter Lat  Exit Lat RT RL WT WL Idle Pwr  Act Pwr Workload
--  --------  --------- --------- -- -- -- -- -------- -------- --
 0:  9.0000W    0.000ms   0.000ms  0  0  0  0  0.0000W  0.0000W 0
 1:  4.6000W    0.000ms   0.000ms  1  1  1  1  0.0000W  0.0000W 0
 2:  3.8000W    0.000ms   0.000ms  2  2  2  2  0.0000W  0.0000W 0
 3:  0.0450W*   2.000ms   2.000ms  3  3  3  3  0.0000W  0.0000W 0
 4:  0.0040W*  15.000ms  15.000ms  4  4  4  4  0.0000W  0.0000W 0

But switching it from 1 to 3 does not make a measurable difference in idle temperature or power intake.
 
IIRC the problem with APST is, that it is yet another one of those "non-standards" - i.e. it's just a collection of suggestions, which vendors CAN implement that way, but don't HAVE to implement all of it and MAY even do it completely different.
I.e. it is a huge mess and requires either proper definitions/drivers from the vendor or *a lot* of guesswork and try&error from developers (which usually have better things to do...)

I usually avoid NVMe drives from vendors that are known to have high power demand and are essentially space heaters. Apart from that I'm deciding per usecase if low power/heat or higher performance/power draw is more important and choose the drive model accordingly. E.g. I'm running a lot of WD blue M.2 NVMe drives because they really only sip power and run at perfectly moderate temperatures even in passively cooled systems. On the other side I'm running micron 7400/7450 drives in servers for VM/jail pools where cooling is plenty and performance is more critical. (At home I'm running WD red M.2 NVMe as a poudriere build pool, which run *really* hot under load - wouldn't buy those again...)
 
My Gen 5 NVMe drives raise up to 47C when under full load. I know that is not very high but I set my limit to 50C to raise the alarm. These each have their own heatsink and if I wasn't so anal about it, I would have not installed a fan (12 VDC fan running at 7VDC) slowly moving air across them. Without the air flow I do hit 50C. More airflow has minimal impact so slow a steady wins the race.

But switching it from 1 to 3 does not make a measurable difference in idle temperature or power intake.
Keep in mind that those watts are at a 5VDC level, so it is a very small measurement if you are looking at input power to your system for current draw. This is one reason I hope using four or six drives will make a noticable difference.

You could use thermal readings to gage if it is making any difference. However I'm not sure that an NVMe drive set at 9W but not moving a lot of data would be very different than a lower power state (speaking from a power consumption factor).
 
Thank You, that is an interesting work indeed. But it seems, NVMe devices can be individually different. Mine is saying this:
Code:
Power States Supported: 5

 #   Max pwr  Enter Lat  Exit Lat RT RL WT WL Idle Pwr  Act Pwr Workload
--  --------  --------- --------- -- -- -- -- -------- -------- --
 0:  9.0000W    0.000ms   0.000ms  0  0  0  0  0.0000W  0.0000W 0
 1:  4.6000W    0.000ms   0.000ms  1  1  1  1  0.0000W  0.0000W 0
 2:  3.8000W    0.000ms   0.000ms  2  2  2  2  0.0000W  0.0000W 0
 3:  0.0450W*   2.000ms   2.000ms  3  3  3  3  0.0000W  0.0000W 0
 4:  0.0040W*  15.000ms  15.000ms  4  4  4  4  0.0000W  0.0000W 0

But switching it from 1 to 3 does not make a measurable difference in idle temperature or power intake.
Hmm. But what about the 4th state? And are you sure the 3rd one actually persists after you set it? If not, please also check the note in my other post.
 
IIRC the problem with APST is, that it is yet another one of those "non-standards" - i.e. it's just a collection of suggestions, which vendors CAN implement that way, but don't HAVE to implement all of it and MAY even do it completely different.
I.e. it is a huge mess and requires either proper definitions/drivers from the vendor or *a lot* of guesswork and try&error from developers (which usually have better things to do...)

I usually avoid NVMe drives from vendors that are known to have high power demand and are essentially space heaters. Apart from that I'm deciding per usecase if low power/heat or higher performance/power draw is more important and choose the drive model accordingly. E.g. I'm running a lot of WD blue M.2 NVMe drives because they really only sip power and run at perfectly moderate temperatures even in passively cooled systems. On the other side I'm running micron 7400/7450 drives in servers for VM/jail pools where cooling is plenty and performance is more critical. (At home I'm running WD red M.2 NVMe as a poudriere build pool, which run *really* hot under load - wouldn't buy those again...)
The specification says the following about the "optional" keyword:
"A keyword that describes features that are not required by this specification. However, if any optional feature defined by the specification is implemented, the feature shall be implemented in the way defined by the specification."

APST is an optional feature indeed, but are you sure about the huge mess? Can you provide more info about this? It seems pretty standard to me, as it does to Linux developers as far a I know. Of course, there are different numbers of states (but up to 32 in total), different latencies and power consumption, but none of this matters as long as it stays within the specification. There are almost zero guesswork, and Linux implementation basically use only maximum latency (which is tunable) to build the transition table (which, as mentioned above, must be standard if the feature implemented at all). I added two sysctls: maximum latency and another related coefficient, which is simply hardcoded in Linux for some reason. This can certainly be done in other ways, but I found the whole thing pretty staightforward overall.
 
My Gen 5 NVMe drives raise up to 47C when under full load. I know that is not very high but I set my limit to 50C to raise the alarm. These each have their own heatsink and if I wasn't so anal about it, I would have not installed a fan (12 VDC fan running at 7VDC) slowly moving air across them. Without the air flow I do hit 50C. More airflow has minimal impact so slow a steady wins the race.


Keep in mind that those watts are at a 5VDC level, so it is a very small measurement if you are looking at input power to your system for current draw. This is one reason I hope using four or six drives will make a noticable difference.

You could use thermal readings to gage if it is making any difference. However I'm not sure that an NVMe drive set at 9W but not moving a lot of data would be very different than a lower power state (speaking from a power consumption factor).
You are very lucky compared to me (idle at about 50C), chrcol (idle at almost 70C) or lostgeek (Thread audio-glitches-hot-nvme-drive-looking-for-advice-14-0-release.92816, he says his controller operates at 90C) :)
 
APST is an optional feature indeed, but are you sure about the huge mess? Can you provide more info about this?
I took a "medium deep dive" into APST a few years ago when I also tried to get it working in an automated manner (I gave up after running into frequent freezes/IO-timeouts/crashes when changing around the power levels via script). IIRC the description of the 'standard' as a mess came from a developer on a mailing list. I'll try to fint that source again, but don't hold your breath, I'm really not even sure if or on what mailing list that thread was posted or if it was on some (this?) forum.


The thing when comparing FreeBSD to Linux compatibility especially for such (probably) half-arsed standards is: Linux is THE prime hobbyist-OS, so you have a lot of people who are willing to waste generously devote their time fiddling about with hardware from various (exotic) vendors and in different variants to get something 'working'. The problem with that: It's not a particularly professional approach - it *might* work, but there's no guarantee. I wouldn't want to rely on something that is developed mainly in this way in a production environment. (That's why I use Free- and OpenBSD)
The proper way is always to involve the vendor - either to get an already working driver, or at least some specifications and/or confirmation that the implementation is correct. If that's not possible, just stick with the baseline that is specified and hence works 100%.
 
The proper way is always to involve the vendor - either to get an already working driver, or at least some specifications and/or confirmation that the implementation is correct. If that's not possible, just stick with the baseline that is specified and hence works 100%.
I understand what you're saying, but fortunately it has nothing to do with the nvme(4) driver, because the whole NVMe is just an open specification, which in my opinion is much better than, like you said, fiddling with hardware from different vendors.
 
NVMe is, but the APST part is not. At least not in a way that it *MUST* be implemented 100% identical by every vendor or model, and that's the point where the guessing and fiddling begins if you don't get proper specifications from the vendor...

I'd really like to see this working on FreeBSD, and given that it has been ~3 or 4 years since I looked into APST, chances are that there has been some progress, but in the state it was back then I wouldn't want to touch power states on any production system. (even if it might result in some considerable power savings - e.g. in my home server/buildhost I have 8 M.2 NVMe drives running...)
 
You are very lucky compared to me (idle at about 50C)
I am very keen about airflow and cooling. I have no issues making case modifications (how I installed a 120mm fan to blow on the NVMe drives). I designed the entire system with removing heat as a factor and it still being a virtually silent system. And I had no idea what kind of heat my NVMe drives would generate, being Gen 5, so I got lucky as well.
 
NVMe is, but the APST part is not. At least not in a way that it *MUST* be implemented 100% identical by every vendor or model, and that's the point where the guessing and fiddling begins if you don't get proper specifications from the vendor...
But everything is exactly the opposite, actually. Again, optional NVMe feature must be either implemented 100% according to the specification, or not be implemented at all (which is checked by the driver, of course). This is guaranteed, otherwise the device violates the interface for no good reason and it shouldn't be called NVMe in the first place. Can you show examples where this has been done? What are we arguing about?

I'd really like to see this working on FreeBSD, and given that it has been ~3 or 4 years since I looked into APST, chances are that there has been some progress, but in the state it was back then I wouldn't want to touch power states on any production system. (even if it might result in some considerable power savings - e.g. in my home server/buildhost I have 8 M.2 NVMe drives running...)
There is currently zero support for APST in nvme(4), and this has always been the case. Maybe you are talking about Power Management, but it has almost nothing to do with the driver (except for sending and receiving generic Set/Get commands when invoking nvmecontrol power ... from userspace and one small data structure). It's hard to say from this point what your problem was, but it's totally understood that you found the issue not worth solving in your particular use case.

What does seem strange to me is that looks like we are arguing about who should do what with *their* system and who can or cannot propose changes. The FreeBSD Project has very well-established guidelines and procedures for the code to be accepted, so I believe everyone's production environment is safe, regardless of whether you consider the person who has done the actual work to be capable of solving system problems.
 
Hmm. But what about the 4th state? And are you sure the 3rd one actually persists after you set it? If not, please also check the note in my other post.

Keep in mind that those watts are at a 5VDC level, so it is a very small measurement if you are looking at input power to your system for current draw. This is one reason I hope using four or six drives will make a noticable difference.
I get reading in 100mW increments, and indeed it fluctuates very much (because the machine is usually working), but there is a minimum that can be perceived.

You could use thermal readings to gage if it is making any difference. However I'm not sure that an NVMe drive set at 9W but not moving a lot of data would be very different than a lower power state (speaking from a power consumption factor).
That is what I think - the integrated circuits may already reduce power when at idle. What I can clearly see is the difference between states 0-2 when the device is busy; in state 0 the temperature goes runaway (>60C) and some cooling or ventilation would be needed (state 1 suits my current needs, because the device is mirrored against a SATA SSD which isn't as fast),

So, bottomline probably is, one has to look at these devices individually. Just as with disks, where each model has someway different behaviour for power saving.

And are you sure the 3rd one actually persists after you set it?

Certainly - I did read it back and it went back to state 1 only when actually accessing the device.
 
So, bottomline probably is, one has to look at these devices individually. Just as with disks, where each model has someway different behaviour for power saving.
Absolutely. It might be a good idea to even leave all these features disabled by default, but the lack of the mechanism itself may only frustrate people who have very real issues without it.
 
JoeSchmuck I've made a new discovery, if the issue is still relevant for you. Last time I missed somehow that nvmecontrol(8) can also send a payload with admin-passthru command, so in fact you have all the tools to configure APST without any patching:

1. Check if your controller has default transition data (i.e. some values that are not zero):
Code:
# nvmecontrol admin-passthru --opcode 10 --cdw10 12 --data-len 256 --read nvme0 | head -n2
DWORD0 status= 0
000: 00003c18 00000000 00003c18 00000000 00003c18 00000000 00000000 00000000

2a. If yes (I tested two different devices and both had one), you can add --raw-binary flag to the previous command and redirect stdout to a file. Something like this:
Code:
# nvmecontrol admin-passthru --opcode 10 --cdw10 12 --data-len 256 --read --raw-binary nvme0 > payload.bin

$ hexdump payload.bin
0000000 3c18 0000 0000 0000 3c18 0000 0000 0000
0000010 3c18 0000 0000 0000 0000 0000 0000 0000
0000020 0000 0000 0000 0000 0000 0000 0000 0000
*
0000100

2b. If not (unlikely, but who knows), or if you want different behavior, you can still generate the data structure yourself, which only needs to be done once. Something like this will do (change the power state latencies to whatever you have):
Python:
import struct
import sys

# All the operational states are mandatory, however only the required NOPS
# (i.e. the ones you want the controller to be able to switch to) should be
# specified.
pstates = [
    {'nops': False },
    {'nops': False },
    {'nops': False },
    {'nops': True,  'enlat': 1500,  'exlat': 1500 },
    {'nops': True,  'enlat': 6000,  'exlat': 14000 },
    {'nops': True,  'enlat': 50000, 'exlat': 80000 },
]

# The lower this value, the sooner the controller will switch the state up.
itpt_factor = 50 # (should be >0, 50 is the hardcoded Linux default)

table = [0] * 32
for itps in reversed(range(1, len(pstates))):
    if not pstates[itps]['nops']:
        table[itps - 1] = table[itps]
        continue

    total_latency = pstates[itps]['enlat'] + pstates[itps]['exlat']
    itpt = min(total_latency * itpt_factor // 1000, (1 << 24) - 1)
    table[itps - 1] = itpt << 8 | itps << 3

sys.stdout.buffer.write(struct.pack('<'+'Q'*len(table), *table))

Code:
$ python apst_gendata.py > payload.bin && hexdump payload.bin
0000000 9618 0000 0000 0000 9618 0000 0000 0000
0000010 9618 0000 0000 0000 e820 0003 0000 0000
0000020 6428 0019 0000 0000 0000 0000 0000 0000
0000030 0000 0000 0000 0000 0000 0000 0000 0000
*
0000100

3. All that remains is to simply transfer this file back to the controller:
Code:
# nvmecontrol admin-passthru --opcode 9 --cdw10 12 --cdw11 1 --input-file payload.bin --data-len 256 --write nvme0
DWORD0 status= 0

That's all, everything should work now. The only downside is that you have to run the last Set Feature command every time you restart or wake up the machine, but that's a lot better than nothing I guess. EDIT: I forgot that it might be possible to hook the command using devd(8), but I'm not very familiar with it.
 
Last edited:
Thank you so much for making a patch, I am interested and have a spare NVME, I will see if I can play around and test it.

Bit of background, I am not a huge fan of NVME personally, as it is power/heat heavy, sadly I had to use it as the NUC I am using is basically NVME only, there is no integrated M.SATA like the old unit, and the internal SATA port is placed in a impractical place to be used, I ended up using active fan cooling, as I was not comfortable with the storage at 70C day to day.

It is the OS drive, so I know it will never sit idle constantly, but an NVME drive even if its looping in and out of low power states should still have a significant temperature and power draw drop as seen on Windows which has masses of i/o on its system drive idling.

So there is no risk of the drive freezing or anything if I run that nvmecontrol command?
 
eseipi
I lost track of this thread, I'm glad I came back to check on it. As for running the command automatically when the system it booted, you could just place it in /usr/local/etc/rc.d/ as a bash script (all one line of it), or even in a Cron Tab, but that is a bad way to do it in my opinion.

Thanks for all that information. And I do still have this little problem, well until one of my hard drives fails. Going on 6 years so someday is getting closer. Then I retire that system (sell it) and put my new all NVMe system online.

As for the method of doing this, I wish it were a simple switch or single value and not a set of values in a file. I think about file corruption and then I load something corrupt into my NVMe drive. While typing this I was thinking that a small bash file would calculate the checksum value of the file and ensure it is correct before sending the data. I do that with the files I have on Github, this way when someone downloads the file, they can verify it was transferred without issue. It is not a fancy checksum either, but it doesn't need to be fancy.
 
Thank you so much for making a patch, I am interested and have a spare NVME, I will see if I can play around and test it.
I wouldn't bother with patching if I knew that nvmecontrol is capable of sending payloads :) For now the proposed change essentially adds two kenvs controlling the feature state at device initialization, so tbh there isn't a lot of functionality to test. Moreover, the patch won't compile on CURRENT after recent src: f08746a7. I use RELEASE myself, but I'll fix it if anyone actually needs it.
So there is no risk of the drive freezing or anything if I run that nvmecontrol command?
I must emphasize that you always need to clearly understand what you're doing. The APST-related sections of the NVME spec are good references in this case. That said, why would there be any risk? Despite the length, my previous post essentially boils down to these two commands:
Code:
# nvmecontrol admin-passthru --opcode 10 --cdw10 12 --read --data-len 256 --raw-binary nvme0 > payload.bin
# nvmecontrol admin-passthru --opcode 9 --cdw10 12 --cdw11 1 --write --data-len 256 --input-file payload.bin nvme0
DWORD0 status= 0
As was mentioned, the first one writes the default transition data from the controller to a file, and the second enables the feature by sending the same data back. All of this should be perfectly safe to do on a running system. The only obstacle I can think of is a lack of data provided by the vendor, in which case you'll have to obtain it yourself in some other way.
 
Last edited:
As for running the command automatically when the system it booted, you could just place it in /usr/local/etc/rc.d/ as a bash script (all one line of it), or even in a Cron Tab, but that is a bad way to do it in my opinion.
The main issue with relying on userspace tools for this task is that you need to hook it with the controller initialization, which happens not only when the system boots, but also on wakeup, upon user request, and IIRC after some errors. There may be other cases - I'm not sure. If you're aware of these limitations, using simple rc scripts on boot/resume may be totally sufficient indeed.
As for the method of doing this, I wish it were a simple switch or single value and not a set of values in a file. I think about file corruption and then I load something corrupt into my NVMe drive.
Some additional error checking wouldn't hurt, but it might not be strictly necessary either. During testing, I sent multiple invalid payloads that were simply rejected by the controller with an error code. I'm not saying these checks are foolproof or that every device performs them, though. There's also no real need to store the data in the first place, instead you can simply retrieve or generate it each time (Linux does exactly this).
 
Back
Top