Solved ARC always wanting to shrink

My ZFS seems to have swallowed something undisgestible, and is always trying to evict:

Code:
ARC Size:                               49.92%  399.38  MiB
        Target Size: (Adaptive)         25.00%  200.00  MiB
        Min Size (Hard Limit):          25.00%  200.00  MiB
        Max Size (High Water):          4:1     800.00  MiB
It tries to drive the ARC to the lowest allowed value, although there is ample free memory. And the ARC is already at its possible minimum with 19 MB:
Code:
ARC: 411M Total, 42M MFU, 85M MRU, 2024K Anon, 235M Header, 46M Other
     19M Compressed, 113M Uncompressed, 5.81:1 Ratio

If I move the vfs.zfs.arc_min up manually, the ARC does grow again:

Code:
ARC Size:                               55.59%  444.76  MiB
        Target Size: (Adaptive)         100.00% 800.00  MiB
        Min Size (Hard Limit):          100.00% 800.00  MiB
        Max Size (High Water):          1:1     800.00  MiB

But when I put it back down, the target size will also go back down very fast, in a matter of seconds:
Code:
ARC Size:                               54.75%  438.03  MiB
        Target Size: (Adaptive)         50.28%  402.25  MiB
        Min Size (Hard Limit):          25.00%  200.00  MiB
        Max Size (High Water):          4:1     800.00  MiB
ARC Size:                               54.76%  438.06  MiB
        Target Size: (Adaptive)         37.77%  302.13  MiB
        Min Size (Hard Limit):          25.00%  200.00  MiB
        Max Size (High Water):          4:1     800.00  MiB

This has appeared tonight during some house-keeping jobs, it seems like some variable got an overflow, and now ZFS erroneously thinks it is utterly out of memory.

I didn't find a way to get rid of it (exept likely rebooting); it is certainly a bug, probably difficult to reproduce, and I will not bother to search for it.
I just post this here for the community to know that this can happen. If anybody else gets hit, You're welcome to pursue it further. Release is 11.3, on i386 (might well be an escaped 64bit conversion error with the value gone negative).
 
Have you tested this with any 12.x release, and if so, so you see the same behavior? I've noticed ARC behaves differently (in a good way) there.

That said, I've never used an ARC that small, and never run ZFS on i386, so this could also be something weird related to either of those.
 
Negative. I don't see any viable approach to "test" this at all, no matter the version.
I have seen ZFS going into a strange behaviour occasionally before - say, once a year or so. But that was usually during other tests/modifications and it could have had many reasons. Only this time it was clearly visible some misbehaviour happens, with no apparent cause, so, it was at least worth to document.

But then, how could a test be set up here? The only possible approach I see would be to design and run very extensive crafted load envelopes - and I have indeed other business to pursue. A simpler approach would have been to pull a kernel dump in the very act, and then looking at the internal variables, which of them has gone crazy. In any case, finding and fixing the issue appears to be simpler than testing it.
But then, any of such is a huge effort, and, as You say, things might be different in another version. There is a bunch of other things that do bother me more, and there's enough work to do to engage the whole dwarf's army of Moria...
 
Ah, so although this happened once, you've never had it happen since, and don't know how to reproduce it?
 
Precisely. It stayed in that state for a day or two, until I got to reboot it. (Reboot is NOT a solution.)
 
It happened again, and I figured it out: this happens when the KVA heap memory gets too fragmented (precisely, when there is no contiguous 16MB block available anymore).

It can be checked with this command:
dtrace -n 'arc-available_memory {printf("%d", arg1);}'
The number in the last column have these meanings (counting from 0 to 7):
Code:
typedef enum free_memory_reason_t {
        FMR_UNKNOWN,
        FMR_NEEDFREE,
        FMR_LOTSFREE,
        FMR_SWAPFS_MINFREE,
        FMR_PAGES_PP_MAXIMUM,
        FMR_HEAP_ARENA,
        FMR_ZIO_ARENA,
        FMR_ZIO_FRAG,
} free_memory_reason_t;
So, '2' is the good thing here.
 
This was a classic example of "how to shoot yourself in the foot".

While investigating this issue (where the arc continues to grow, way above arc_max) I checked if any sysctl options might influence that behaviour, and so I came across a new option vfs.zfs.abd_scatter_enabled, found that this does increase memory usage, and switched it off.
Deeper analysis shows that while the option indeed increases memory usage, in this case that is a good thing, because it it used to structure the memory allocation in a way to reduce kernel heap fragmentation. Re-enaging this option makes the problem, if not go away entirely, at least get a lot better.
 
Back
Top