additional CPU heat caused by something other than core load

I recently built a computer with an AMD FX-8350 8-core processor, running FreeBSD 9.1. I've been testing it under heavy load, and a curious thing has been happening. I have a program that keeps all 8 cores running at 90-100% for quite a while. Without getting into too many specifics, when I run the program one way the CPU stays at about 53C (95-100% load,) and if I run it another way it stays at about 63C (90-95% load.) This makes no sense to me. The program goes through the exact same operations regardless of how it's run (it multiplies matrices incrementally by preloading small sections into RAM.) The only differences are how much RAM and address space are used, how often disk I/O happens, and how many times each loop is executed.
  1. The first instance (process that causes 53C) mmaps about 24GB and allocates about 520MB. It reads/writes (raidz2) about half as often as the other instance, which is why the core loads are higher. All reads and writes are sequential and in large blocks.
  2. The second instance (process that causes 63C) mmaps about 6GB and allocates about 516MB.
It could be that the first instance is making better use of the CPU caches, or that using the SB (for disk I/O) causes more CPU heat than does using the NB (for RAM.) I really can't think of any other reason for such a large difference in temperature. Neither of the temperatures are horrible, but if/when I decide to mess around with overclocking I'd like to know that I can max out my CPU heat predictably.

Thanks!

Kevin Barry
 
ta0kira said:
I recently built a computer with an AMD FX-8350 8-core processor, running FreeBSD 9.1. I've been testing it under heavy load, and a curious thing has been happening. I have a program that keeps all 8 cores running at 90-100% for quite a while. Without getting into too many specifics, when I run the program one way the CPU stays at about 53C (95-100% load,) and if I run it another way it stays at about 63C (90-95% load.) This makes no sense to me. The program goes through the exact same operations regardless of how it's run (it multiplies matrices incrementally by preloading small sections into RAM.) The only differences are how much RAM and address space are used, how often disk I/O happens, and how many times each loop is executed.
  1. The first instance (process that causes 53C) mmaps about 24GB and allocates about 520MB. It reads/writes (raidz2) about half as often as the other instance, which is why the core loads are higher. All reads and writes are sequential and in large blocks.
  2. The second instance (process that causes 63C) mmaps about 6GB and allocates about 516MB.
It could be that the first instance is making better use of the CPU caches, or that using the SB (for disk I/O) causes more CPU heat than does using the NB (for RAM.) I really can't think of any other reason for such a large difference in temperature. Neither of the temperatures are horrible, but if/when I decide to mess around with overclocking I'd like to know that I can max out my CPU heat predictably.

Thanks!

Kevin Barry

Maybe it is the on-die MMU getting a work-out with all the paging going on? No idea if MMU utilisation is included in reported CPU utilisation statistics, but I suspect not?
 
Please supply the wall times of the programm. You may also find the cpu event counters a good thing to try out, you may want to check the number of memory transfers, cache misses and so on in the code. Sometimes suprising things come up with this.
 
Crivens said:
Please supply the wall times of the programm. You may also find the cpu event counters a good thing to try out, you may want to check the number of memory transfers, cache misses and so on in the code. Sometimes suprising things come up with this.
The first instance takes ~175m and the second ~14m. The first does at most 8x more work but takes 12x longer. By "in the code" do you mean to literally count them in the source code? I wrote the code, so I know that all of the reads and writes (for these particular instances) are page-aligned and sequential, but the memory access probably isn't that simple. The program basically just copies columns/rows into RAM, has GSL multiply them, then adds the resulting block to the respective rows in the output matrix. In the first instance GSL is continually multiplying a 4x256 and a 256x16384, and in the second a 4x128 and a 128x32768.

Kevin Barry
 
I mean what the core does, the code which gets actually executed. You do not only have your code running but also the OS doing the paging and MMU handling. This case looks like increasing the cache hit rate, your case #1 is running cooler because the cores are most likely waiting for main memory to deliver some data. As I said, using the performance counters can tell you a lot.
 
Crivens said:
I mean what the core does, the code which gets actually executed. You do not only have your code running but also the OS doing the paging and MMU handling. This case looks like increasing the cache hit rate, your case #1 is running cooler because the cores are most likely waiting for main memory to deliver some data. As I said, using the performance counters can tell you a lot.
I think this is a start. I ran both with pmcstat -w 10 -p instructions -p dc-misses -p ic-misses. A typical line of output for each when the temp is high:
  1. Code:
    #  p/instructions     p/dc-misses     p/ic-misses
         309810972151               0         6466051
  2. Code:
    #  p/instructions     p/dc-misses     p/ic-misses
         538303758005               0         7510967
Are there better PMCs to use for this? This is actually the first time I've used it.

Also, running the program with pmcstat changed the process execution slightly, so those figures might not be representative of what's actually happening. Both ran more slowly, and with pmcstat core usage dropped below 25% fairly often for the first instance. Thanks!

Kevin Barry
 
Back
Top