Solved Ryzen 7 Frequency

Hello all,
I have a FreeBSD 12.0 BETA4 running kde5 desktop with: "ryzen 7 1700 (overclocked to 3700Mhz, if that matters) 32 GB of ram on a MSI x370 Gaming Plus motherboard with the latest bios, when I
sysctl dev.cpu at idle: towards the bottom of the list I get: dev.cpu.0.freq_levels: 3700/-1 3700/-1 1550/-1 dev.cpu.0.freq: 1550 dev.cpu.0.temperature: 28.6C

When I am running hand brake I get: dev.cpu.0.freq_levels: 3700/-1 3700/-1 1550/-1 dev.cpu.0.freq: 1550 dev.cpu.0.temperature: 44.6C
Does this mean I am never reaching the 3700Mhz or am I reading it wrong?
If it is in fact running that slow, how can I make run at or close to the desired 3700Mhz?
 
Are you running powerd(9)? If so, what are your settings for powerd(8)?

On my system it works as expected. It is a Ryzen 7 2700 (not overclocked), 32 GB DDR4-3400, ASUS X470 mainboard, running stable/11. When the system is idle, I have this:
Code:
dev.cpu.0.freq_levels: 3200/3200 2800/2450 1550/1181
dev.cpu.0.freq: 1550
And when it's under load, I get this:
Code:
dev.cpu.0.freq_levels: 3200/3200 2800/2450 1550/1181
dev.cpu.0.freq: 3200
The only thing that's wrong is the temperature reported by amdtemp(4). The value is obviously wrong (much too high), so I can't use it to monitor the system temperature, unfortunately.
 
Are you running powerd(9)? If so, what are your settings for powerd(8)?

On my system it works as expected. It is a Ryzen 7 2700 (not overclocked), 32 GB DDR4-3400, ASUS X470 mainboard, running stable/11. When the system is idle, I have this:
Code:
dev.cpu.0.freq_levels: 3200/3200 2800/2450 1550/1181
dev.cpu.0.freq: 1550
And when it's under load, I get this:
Code:
dev.cpu.0.freq_levels: 3200/3200 2800/2450 1550/1181
dev.cpu.0.freq: 3200
The only thing that's wrong is the temperature reported by amdtemp(4). The value is obviously wrong (much too high), so I can't use it to monitor the system temperature, unfortunately.


Thanks, I was unaware of powerd.

I added this to my rc.conf
powerd_enable="YES" powerd_flags="-a hiadaptive"



I now show at idle:
dev.cpu.0.freq_levels: 3700/-1 3700/-1 1550/-1 dev.cpu.0.freq: 3700 dev.cpu.0.temperature: 27.5C

running HandBrake I get:
dev.cpu.0.freq_levels: 3700/-1 3700/-1 1550/-1 dev.cpu.0.freq: 3700 dev.cpu.0.temperature: 51.6C

Oddly, my fps on HandBrake stayed relatively the same.

Do you have any other performance related tips?
 
cant say it will provide an answer in this case, however there is a "superior" power daemon by the name of powerd++ (powerdxx for pkg/port)
 
Do you have any other performance related tips?
I'm sorry, I'm not familiar with HandBrake. Maybe someone else can comment on that.
However, if HandBrake has a configuration setting for multithreading, you should make sure it is enabled.
 
The #1 question of all performance optimization: Do you know where your bottleneck is?

I have no idea what HandBrake does in detail, other than being a video transcoder. Are you sure it is really CPU limited? Look at the amount of IO (how many bytes are read/written), and compare that to the infrastructure you are using for IO (disks, filesystem, etc.). If HandBrake were IO limited, then your CPU utilization would be very likely below 100%.

If your CPU utilization is 100%, then IO is probably not the bottleneck, and then the next bottleneck is likely CPU or memory (I'm assuming that HandBrake runs on a single computer, and doesn't use the network). Instead, check that all cores (and all hyper threaded pseudo-cores) are fully utilized. That's relatively easy, there is a version of the "top" command that shows CPU utilization per core. If the cores are not loaded uniformly, then the bottleneck is one core, and you need to change things. Unfortunately, getting code to run uniformly on all cores can be very hard, and is likely to require source code changes.

However, I suspect that the real problem is memory bandwidth. If the bottleneck is memory, then CPU utilization will look like it is 100% (because the CPU is busy waiting for data to come from / go to memory), but it is not actually processing data. In that case, increasing the clock frequency won't help, because the memory bandwidth does not depend on clock frequency. I would focus on two aspects here. First, if your system is NUMA, make sure you have configured it correctly (see command numactl). This may require changing the source code to allocate memory in the correct threads, and giving threads core affinity (perhaps during allocation, perhaps afterwards). Without a full understanding of the memory layout of the program, this is likely difficult. Second, measure the memory throughput of the system, and see whether you are against the limits here. This requires OS- and CPU-specific tools (I have only done this on Linux on Xeon and Power9). Warning: Optimizing code to use memory efficiently can be very hardware specific; when working on this years ago, we found that we had to make significant changes between successive Intel generations.
 
Just for the record: AMD Ryzen (1st and 2nd generation) are not NUMA, so you don't have to worry about that. AMD Threadripper processors are NUMA.

By the way, I'm using ffmpeg for video/audio transcoding. For my use cases, it is CPU-bound. Neither I/O nor memory bandwidth is the bottleneck. I think HandBrake is based on ffmpeg (but I'm not sure; somebody correct me please if I'm wrong), so it will probably be similar for you. Just be sure to enable multithreading. For the stock ffmpeg this is enabled by default.
 
The #1 question of all performance optimization: Do you know where your bottleneck is?

I have no idea what HandBrake does in detail, other than being a video transcoder. Are you sure it is really CPU limited? Look at the amount of IO (how many bytes are read/written), and compare that to the infrastructure you are using for IO (disks, filesystem, etc.). If HandBrake were IO limited, then your CPU utilization would be very likely below 100%.

If your CPU utilization is 100%, then IO is probably not the bottleneck, and then the next bottleneck is likely CPU or memory (I'm assuming that HandBrake runs on a single computer, and doesn't use the network). Instead, check that all cores (and all hyper threaded pseudo-cores) are fully utilized. That's relatively easy, there is a version of the "top" command that shows CPU utilization per core. If the cores are not loaded uniformly, then the bottleneck is one core, and you need to change things. Unfortunately, getting code to run uniformly on all cores can be very hard, and is likely to require source code changes.

However, I suspect that the real problem is memory bandwidth. If the bottleneck is memory, then CPU utilization will look like it is 100% (because the CPU is busy waiting for data to come from / go to memory), but it is not actually processing data. In that case, increasing the clock frequency won't help, because the memory bandwidth does not depend on clock frequency. I would focus on two aspects here. First, if your system is NUMA, make sure you have configured it correctly (see command numactl). This may require changing the source code to allocate memory in the correct threads, and giving threads core affinity (perhaps during allocation, perhaps afterwards). Without a full understanding of the memory layout of the program, this is likely difficult. Second, measure the memory throughput of the system, and see whether you are against the limits here. This requires OS- and CPU-specific tools (I have only done this on Linux on Xeon and Power9). Warning: Optimizing code to use memory efficiently can be very hardware specific; when working on this years ago, we found that we had to make significant changes between successive Intel generations.


Thanks for the help, I am reading (1,900,000 B/s nic) the file over the local network (physical machine) and writing (1,327,000 B/s nic and 1,000,000B/s Hd) over the local network to a vm (running in byhve run on the box with handbrake).

Do these #'s seem normal ?

last pid: 4984; load averages: 18.29, 13.88, 7.37 up 0+13:58:21 08:38:09 124 processes: 1 running, 123 sleeping CPU 0: 76.8% user, 0.0% nice, 4.1% system, 0.0% interrupt, 19.1% idle CPU 1: 78.0% user, 0.0% nice, 2.8% system, 0.0% interrupt, 19.1% idle CPU 2: 86.6% user, 0.0% nice, 4.1% system, 0.8% interrupt, 8.5% idle CPU 3: 74.4% user, 0.0% nice, 3.3% system, 0.0% interrupt, 22.4% idle CPU 4: 78.5% user, 0.0% nice, 3.3% system, 0.0% interrupt, 18.3% idle CPU 5: 80.1% user, 0.0% nice, 0.8% system, 0.0% interrupt, 19.1% idle CPU 6: 79.7% user, 0.0% nice, 2.8% system, 0.0% interrupt, 17.5% idle CPU 7: 80.9% user, 0.0% nice, 2.0% system, 0.0% interrupt, 17.1% idle CPU 8: 81.7% user, 0.0% nice, 3.7% system, 0.0% interrupt, 14.6% idle CPU 9: 79.7% user, 0.0% nice, 3.3% system, 0.0% interrupt, 17.1% idle CPU 10: 77.2% user, 0.0% nice, 4.5% system, 0.0% interrupt, 18.3% idle CPU 11: 81.7% user, 0.0% nice, 5.3% system, 0.0% interrupt, 13.0% idle CPU 12: 75.2% user, 0.0% nice, 5.7% system, 0.0% interrupt, 19.1% idle CPU 13: 82.9% user, 0.0% nice, 2.8% system, 0.0% interrupt, 14.2% idle CPU 14: 77.6% user, 0.0% nice, 2.0% system, 0.0% interrupt, 20.3% idle CPU 15: 80.5% user, 0.0% nice, 3.3% system, 0.0% interrupt, 16.3% idle Mem: 3618M Active, 16G Inact, 87M Laundry, 11G Wired, 675M Buf, 751M Free ARC: 3842M Total, 1815M MFU, 1218M MRU, 32K Anon, 44M Header, 765M Other 1545M Compressed, 3463M Uncompressed, 2.24:1 Ratio Swap: PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 4951 pete 156 20 -19 1167M 762M select 3 94:40 1217.27% ghb 1504 root 55 20 0 6271M 5400M kqread 4 94:07 24.96% bhyve 4943 pete 33 32 0 1555M 298M select 14 0:21 21.11% firefox 4851 pete 22 29 0 387M 187M select 4 2:22 10.33% kwin_x11 2322 pete 22 22 0 381M 172M select 5 49:18 7.05% kwin_x11 1675 root 9 23 0 215M 134M select 15 7:47 4.56% Xorg 2585 pete 23 20 0 330M 134M select 0 28:53 4.18% kscreenlocker_greet 4933 pete 71 20 0 2011M 514M select 2 0:23 3.21% firefox 2260 pete 17 21 0 224M 117M select 7 15:39 2.26% Xorg 2189 root 21 20 0 4139M 3914M kqread 10 49:56 1.00% bhyve 4871 pete 19 20 0 247M 100M select 3 0:03 0.77% konsole 1907 root 21 20 0 4139M 3203M kqread 4 7:18 0.74% bhyve 2326 pete 24 20 0 471M 186M select 0 5:30 0.72% plasmashell 4968 pete 23 20 0 309M 119M select 7 0:04 0.52% ksysguard 4855 pete 24 20 0 479M 203M select 4 0:13 0.47% plasmashell 844 root 1 20 0 11M 1776K select 4 0:01 0.22% moused 4938 pete 29 20 0 1568M 333M select 7 0:08 0.13% firefox 4984 pete 1 20 0 13M 3232K CPU1 1 0:00 0.07% top 4969 pete 2 20 0 15M 4424K select 10 0:00 0.04% ksysguardd 1974 root 1 20 0 12M 2520K select 14 0:04 0.04% hald-addon-storage 1092 root 1 20 0 11M 1660K select 10 0:09 0.03% powerd 1660 haldaemon 2 20 0 22M 7532K select 13 0:12 0.03% hald 2339 pete 19 20 0 244M 89M select 8 0:09 0.02% konsole 2295 pete 5 20 0 178M 52M select 13 0:11 0.02% kdeinit5 4869 pete 21 20 0 260M 106M select 11 0:01 0.01% dolphin 4832 pete 5 20 0 180M 63M select 2 0:01 0.01% kdeinit5 4853 pete 21 20 0 277M 99M select 2 0:01 0.01% krunner 4954 pete 3 20 0 24M 7056K select 10 0:00 0.01% at-spi2-registryd 1971 root 1 20 0 12M 2520K select 14 0:04 0.01% hald-addon-storage 4937 pete 31 20 0 1493M 242M select 11 0:02 0.01% firefox 857 root 1 20 0 10M 1012K select 7 0:02 0.01% devd 2337 pete 21 20 0 258M 103M select 15 0:05 0.01% dolphin 4944 pete 34 20 0 1557M 287M select 15 0:04 0.01% firefox 4952 pete 1 20 0 12M 2864K select 11 0:00 0.00% dbus-daemon 930 root 1 20 0 11M 2104K select 7 0:01 0.00% syslogd 4976 pete 20 20 0 270M 120M select 2 0:01 0.00% kate 4873 pete 6 20 0 114M 49M select 11 0:00 0.00% org_kde_powerdevil 4845 pete 4 20 0 126M 52M select 14 0:00 0.00% ksmserver $
 
Seems all your reading/writing happens over the network. And you are are reading/writing only ~2 Mbyte/second each way, which is nothing. Given enough network hardware (100gig cards, or multiple Mellanox IB cards), a machine of that class can do ~18 GByte/s over the network, so 2 MByte/s doesn't even register.

It seems that your CPUS and memory are utilized reasonably well. The odd thing is that they are not utilized perfectly. If the program were written well, it should have all CPUs at 100% utilization (because the bottleneck should be either CPU or memory access, and memory is measured as CPU utilization). But CPU/memory utilization is only about 80%, except for one CPU, which is roughly 90%. Perhaps that one core has some extra processing to do? Maybe it is running a fine-grained workload manager that distributes tasks over the other CPUs?

The good news: You are running nearly as fast as possible; simple changes (like balancing things) can only get you another 20%.

The bad news: We don't know why you are wasting 20% of your system.
 
The bad news: We don't know why you are wasting 20% of your system.
He is doing video transcoding, probably using h.264 or h.265 codecs (MPEG AVC / HEVC). With multithreading, these codecs only scale well until about 8–10 threads. Above that, the advantage of multithreading diminishes. Also note that the threads are not completely independent of each other, because the compressed data of a frame can reference data of an earlier frame. Therefore it is not surprising that his 16 virtual cores (8 cores * 2 SMT) are not at 100 %. I've got a similar processor (Ryzen 2700) and also do video transcoding with ffmpeg, so I'm pretty much in the same boat.

PS: Having said that – I'm very satisfied with the performance of the CPU. It can do transcoding of Full-HD with very high quality settings much faster than my previous machine. And if I really wanted to use the processor 100 %, I could just transcode two videos at the same time. ;-)
 
Back
Top