Solved Ryzen 7 Frequency

pestslent1 · Nov 11, 2018

Hello all,
I have a FreeBSD 12.0 BETA4 running kde5 desktop with: "ryzen 7 1700 (overclocked to 3700Mhz, if that matters) 32 GB of ram on a MSI x370 Gaming Plus motherboard with the latest bios, when I
sysctl dev.cpu at idle: towards the bottom of the list I get:


dev.cpu.0.freq_levels: 3700/-1 3700/-1 1550/-1
dev.cpu.0.freq: 1550
dev.cpu.0.temperature: 28.6C

When I am running hand brake I get:


dev.cpu.0.freq_levels: 3700/-1 3700/-1 1550/-1
dev.cpu.0.freq: 1550
dev.cpu.0.temperature: 44.6C

Does this mean I am never reaching the 3700Mhz or am I reading it wrong?
If it is in fact running that slow, how can I make run at or close to the desired 3700Mhz?

olli@ · Nov 11, 2018

Are you running powerd(9)? If so, what are your settings for powerd(8)?

On my system it works as expected. It is a Ryzen 7 2700 (not overclocked), 32 GB DDR4-3400, ASUS X470 mainboard, running stable/11. When the system is idle, I have this:

Code:

dev.cpu.0.freq_levels: 3200/3200 2800/2450 1550/1181
dev.cpu.0.freq: 1550

And when it's under load, I get this:

Code:

dev.cpu.0.freq_levels: 3200/3200 2800/2450 1550/1181
dev.cpu.0.freq: 3200

The only thing that's wrong is the temperature reported by amdtemp(4). The value is obviously wrong (much too high), so I can't use it to monitor the system temperature, unfortunately.

pestslent1 · Nov 11, 2018

olli@ said:
Are you running powerd(9)? If so, what are your settings for powerd(8)?

On my system it works as expected. It is a Ryzen 7 2700 (not overclocked), 32 GB DDR4-3400, ASUS X470 mainboard, running stable/11. When the system is idle, I have this:

Code:

dev.cpu.0.freq_levels: 3200/3200 2800/2450 1550/1181 dev.cpu.0.freq: 1550

And when it's under load, I get this:

Code:

dev.cpu.0.freq_levels: 3200/3200 2800/2450 1550/1181 dev.cpu.0.freq: 3200

The only thing that's wrong is the temperature reported by amdtemp(4). The value is obviously wrong (much too high), so I can't use it to monitor the system temperature, unfortunately.

Thanks, I was unaware of powerd.

I added this to my rc.conf

powerd_enable="YES"
powerd_flags="-a hiadaptive"

I now show at idle:

dev.cpu.0.freq_levels: 3700/-1 3700/-1 1550/-1
dev.cpu.0.freq: 3700
dev.cpu.0.temperature: 27.5C

running HandBrake I get:

dev.cpu.0.freq_levels: 3700/-1 3700/-1 1550/-1
dev.cpu.0.freq: 3700
dev.cpu.0.temperature: 51.6C

Oddly, my fps on HandBrake stayed relatively the same.

Do you have any other performance related tips?

reptar · Nov 11, 2018

cant say it will provide an answer in this case, however there is a "superior" power daemon by the name of powerd++ (powerdxx for pkg/port)

olli@ · Nov 11, 2018

pestslent1 said:
Do you have any other performance related tips?

I'm sorry, I'm not familiar with HandBrake. Maybe someone else can comment on that.
However, if HandBrake has a configuration setting for multithreading, you should make sure it is enabled.

ralphbsz · Nov 11, 2018

The #1 question of all performance optimization: Do you know where your bottleneck is?

I have no idea what HandBrake does in detail, other than being a video transcoder. Are you sure it is really CPU limited? Look at the amount of IO (how many bytes are read/written), and compare that to the infrastructure you are using for IO (disks, filesystem, etc.). If HandBrake were IO limited, then your CPU utilization would be very likely below 100%.

If your CPU utilization is 100%, then IO is probably not the bottleneck, and then the next bottleneck is likely CPU or memory (I'm assuming that HandBrake runs on a single computer, and doesn't use the network). Instead, check that all cores (and all hyper threaded pseudo-cores) are fully utilized. That's relatively easy, there is a version of the "top" command that shows CPU utilization per core. If the cores are not loaded uniformly, then the bottleneck is one core, and you need to change things. Unfortunately, getting code to run uniformly on all cores can be very hard, and is likely to require source code changes.

However, I suspect that the real problem is memory bandwidth. If the bottleneck is memory, then CPU utilization will look like it is 100% (because the CPU is busy waiting for data to come from / go to memory), but it is not actually processing data. In that case, increasing the clock frequency won't help, because the memory bandwidth does not depend on clock frequency. I would focus on two aspects here. First, if your system is NUMA, make sure you have configured it correctly (see command numactl). This may require changing the source code to allocate memory in the correct threads, and giving threads core affinity (perhaps during allocation, perhaps afterwards). Without a full understanding of the memory layout of the program, this is likely difficult. Second, measure the memory throughput of the system, and see whether you are against the limits here. This requires OS- and CPU-specific tools (I have only done this on Linux on Xeon and Power9). Warning: Optimizing code to use memory efficiently can be very hardware specific; when working on this years ago, we found that we had to make significant changes between successive Intel generations.

olli@ · Nov 11, 2018

Just for the record: AMD Ryzen (1st and 2nd generation) are not NUMA, so you don't have to worry about that. AMD Threadripper processors are NUMA.

By the way, I'm using ffmpeg for video/audio transcoding. For my use cases, it is CPU-bound. Neither I/O nor memory bandwidth is the bottleneck. I think HandBrake is based on ffmpeg (but I'm not sure; somebody correct me please if I'm wrong), so it will probably be similar for you. Just be sure to enable multithreading. For the stock ffmpeg this is enabled by default.

pestslent1 · Nov 13, 2018

ralphbsz said:
The #1 question of all performance optimization: Do you know where your bottleneck is?

I have no idea what HandBrake does in detail, other than being a video transcoder. Are you sure it is really CPU limited? Look at the amount of IO (how many bytes are read/written), and compare that to the infrastructure you are using for IO (disks, filesystem, etc.). If HandBrake were IO limited, then your CPU utilization would be very likely below 100%.

If your CPU utilization is 100%, then IO is probably not the bottleneck, and then the next bottleneck is likely CPU or memory (I'm assuming that HandBrake runs on a single computer, and doesn't use the network). Instead, check that all cores (and all hyper threaded pseudo-cores) are fully utilized. That's relatively easy, there is a version of the "top" command that shows CPU utilization per core. If the cores are not loaded uniformly, then the bottleneck is one core, and you need to change things. Unfortunately, getting code to run uniformly on all cores can be very hard, and is likely to require source code changes.

However, I suspect that the real problem is memory bandwidth. If the bottleneck is memory, then CPU utilization will look like it is 100% (because the CPU is busy waiting for data to come from / go to memory), but it is not actually processing data. In that case, increasing the clock frequency won't help, because the memory bandwidth does not depend on clock frequency. I would focus on two aspects here. First, if your system is NUMA, make sure you have configured it correctly (see command numactl). This may require changing the source code to allocate memory in the correct threads, and giving threads core affinity (perhaps during allocation, perhaps afterwards). Without a full understanding of the memory layout of the program, this is likely difficult. Second, measure the memory throughput of the system, and see whether you are against the limits here. This requires OS- and CPU-specific tools (I have only done this on Linux on Xeon and Power9). Warning: Optimizing code to use memory efficiently can be very hardware specific; when working on this years ago, we found that we had to make significant changes between successive Intel generations.

Thanks for the help, I am reading (1,900,000 B/s nic) the file over the local network (physical machine) and writing (1,327,000 B/s nic and 1,000,000B/s Hd) over the local network to a vm (running in byhve run on the box with handbrake).

Do these #'s seem normal ?


last pid:  4984;  load averages: 18.29, 13.88,  7.37                                          up 0+13:58:21  08:38:09
124 processes: 1 running, 123 sleeping
CPU 0:  76.8% user,  0.0% nice,  4.1% system,  0.0% interrupt, 19.1% idle
CPU 1:  78.0% user,  0.0% nice,  2.8% system,  0.0% interrupt, 19.1% idle
CPU 2:  86.6% user,  0.0% nice,  4.1% system,  0.8% interrupt,  8.5% idle
CPU 3:  74.4% user,  0.0% nice,  3.3% system,  0.0% interrupt, 22.4% idle
CPU 4:  78.5% user,  0.0% nice,  3.3% system,  0.0% interrupt, 18.3% idle
CPU 5:  80.1% user,  0.0% nice,  0.8% system,  0.0% interrupt, 19.1% idle
CPU 6:  79.7% user,  0.0% nice,  2.8% system,  0.0% interrupt, 17.5% idle
CPU 7:  80.9% user,  0.0% nice,  2.0% system,  0.0% interrupt, 17.1% idle
CPU 8:  81.7% user,  0.0% nice,  3.7% system,  0.0% interrupt, 14.6% idle
CPU 9:  79.7% user,  0.0% nice,  3.3% system,  0.0% interrupt, 17.1% idle
CPU 10: 77.2% user,  0.0% nice,  4.5% system,  0.0% interrupt, 18.3% idle
CPU 11: 81.7% user,  0.0% nice,  5.3% system,  0.0% interrupt, 13.0% idle
CPU 12: 75.2% user,  0.0% nice,  5.7% system,  0.0% interrupt, 19.1% idle
CPU 13: 82.9% user,  0.0% nice,  2.8% system,  0.0% interrupt, 14.2% idle
CPU 14: 77.6% user,  0.0% nice,  2.0% system,  0.0% interrupt, 20.3% idle
CPU 15: 80.5% user,  0.0% nice,  3.3% system,  0.0% interrupt, 16.3% idle
Mem: 3618M Active, 16G Inact, 87M Laundry, 11G Wired, 675M Buf, 751M Free
ARC: 3842M Total, 1815M MFU, 1218M MRU, 32K Anon, 44M Header, 765M Other
     1545M Compressed, 3463M Uncompressed, 2.24:1 Ratio
Swap: 

  PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
 4951 pete        156  20  -19  1167M   762M select   3  94:40 1217.27% ghb
 1504 root         55  20    0  6271M  5400M kqread   4  94:07  24.96% bhyve
 4943 pete         33  32    0  1555M   298M select  14   0:21  21.11% firefox
 4851 pete         22  29    0   387M   187M select   4   2:22  10.33% kwin_x11
 2322 pete         22  22    0   381M   172M select   5  49:18   7.05% kwin_x11
 1675 root          9  23    0   215M   134M select  15   7:47   4.56% Xorg
 2585 pete         23  20    0   330M   134M select   0  28:53   4.18% kscreenlocker_greet
 4933 pete         71  20    0  2011M   514M select   2   0:23   3.21% firefox
 2260 pete         17  21    0   224M   117M select   7  15:39   2.26% Xorg
 2189 root         21  20    0  4139M  3914M kqread  10  49:56   1.00% bhyve
 4871 pete         19  20    0   247M   100M select   3   0:03   0.77% konsole
 1907 root         21  20    0  4139M  3203M kqread   4   7:18   0.74% bhyve
 2326 pete         24  20    0   471M   186M select   0   5:30   0.72% plasmashell
 4968 pete         23  20    0   309M   119M select   7   0:04   0.52% ksysguard
 4855 pete         24  20    0   479M   203M select   4   0:13   0.47% plasmashell
  844 root          1  20    0    11M  1776K select   4   0:01   0.22% moused
 4938 pete         29  20    0  1568M   333M select   7   0:08   0.13% firefox
 4984 pete          1  20    0    13M  3232K CPU1     1   0:00   0.07% top
 4969 pete          2  20    0    15M  4424K select  10   0:00   0.04% ksysguardd
 1974 root          1  20    0    12M  2520K select  14   0:04   0.04% hald-addon-storage
 1092 root          1  20    0    11M  1660K select  10   0:09   0.03% powerd
 1660 haldaemon     2  20    0    22M  7532K select  13   0:12   0.03% hald
 2339 pete         19  20    0   244M    89M select   8   0:09   0.02% konsole
 2295 pete          5  20    0   178M    52M select  13   0:11   0.02% kdeinit5
 4869 pete         21  20    0   260M   106M select  11   0:01   0.01% dolphin
 4832 pete          5  20    0   180M    63M select   2   0:01   0.01% kdeinit5
 4853 pete         21  20    0   277M    99M select   2   0:01   0.01% krunner
 4954 pete          3  20    0    24M  7056K select  10   0:00   0.01% at-spi2-registryd
 1971 root          1  20    0    12M  2520K select  14   0:04   0.01% hald-addon-storage
 4937 pete         31  20    0  1493M   242M select  11   0:02   0.01% firefox
  857 root          1  20    0    10M  1012K select   7   0:02   0.01% devd
 2337 pete         21  20    0   258M   103M select  15   0:05   0.01% dolphin
 4944 pete         34  20    0  1557M   287M select  15   0:04   0.01% firefox
 4952 pete          1  20    0    12M  2864K select  11   0:00   0.00% dbus-daemon
  930 root          1  20    0    11M  2104K select   7   0:01   0.00% syslogd
 4976 pete         20  20    0   270M   120M select   2   0:01   0.00% kate
 4873 pete          6  20    0   114M    49M select  11   0:00   0.00% org_kde_powerdevil
 4845 pete          4  20    0   126M    52M select  14   0:00   0.00% ksmserver
$

ralphbsz · Nov 14, 2018

Seems all your reading/writing happens over the network. And you are are reading/writing only ~2 Mbyte/second each way, which is nothing. Given enough network hardware (100gig cards, or multiple Mellanox IB cards), a machine of that class can do ~18 GByte/s over the network, so 2 MByte/s doesn't even register.

It seems that your CPUS and memory are utilized reasonably well. The odd thing is that they are not utilized perfectly. If the program were written well, it should have all CPUs at 100% utilization (because the bottleneck should be either CPU or memory access, and memory is measured as CPU utilization). But CPU/memory utilization is only about 80%, except for one CPU, which is roughly 90%. Perhaps that one core has some extra processing to do? Maybe it is running a fine-grained workload manager that distributes tasks over the other CPUs?

The good news: You are running nearly as fast as possible; simple changes (like balancing things) can only get you another 20%.

The bad news: We don't know why you are wasting 20% of your system.

olli@ · Nov 14, 2018

ralphbsz said:
The bad news: We don't know why you are wasting 20% of your system.

He is doing video transcoding, probably using h.264 or h.265 codecs (MPEG AVC / HEVC). With multithreading, these codecs only scale well until about 8–10 threads. Above that, the advantage of multithreading diminishes. Also note that the threads are not completely independent of each other, because the compressed data of a frame can reference data of an earlier frame. Therefore it is not surprising that his 16 virtual cores (8 cores * 2 SMT) are not at 100 %. I've got a similar processor (Ryzen 2700) and also do video transcoding with ffmpeg, so I'm pretty much in the same boat.

PS: Having said that – I'm very satisfied with the performance of the CPU. It can do transcoding of Full-HD with very high quality settings much faster than my previous machine. And if I really wanted to use the processor 100 %, I could just transcode two videos at the same time. ;-)

pestslent1 · Nov 14, 2018

Thanks for the help