Apache Solr Java usage in FreeBSD

last1 · Jan 25, 2022

I am running Solr 8.6.2 under FreeBSD 12.3 inside a VM with 32GB of ram
Hypervisor: Origin = "VMwareVMware"
real memory = 34359738368 (32768 MB)
avail memory = 33370628096 (31824 MB)

Solr is installed from packages and I am limiting it's memory like so:

/usr/local/openjdk11/bin/java -server -Xms16g -Xmx16g -XX:+UseG1GC -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled

Despite this, the process ends up taking an increasing amount of memory, goes into swap, eventually exhausts swap and the box fails unless I monitor it and restart it every 2 weeks or so. But right now I'm stuck in monitoring hell, having to endure countless swap warning messages, etc.

I asked the Solr team about this, they claim it's a FreeBSD issue. That their process really only occupies 16Gb of memory.

top reports this and I'm at a loss as to what to do next...

28 processes: 1 running, 27 sleeping
CPU: 35.9% user, 0.0% nice, 2.7% system, 0.1% interrupt, 61.3% idle
Mem: 23G Active, 4132M Inact, 1777M Laundry, 2258M Wired, 1572M Buf, 682M Free
Swap: 4096M Total, 1947M Used, 2149M Free, 47% Inuse

PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
82778 solr 284 52 0 252G 27G uwait 1 713:51 0.00% java
83076 zookeeper 56 52 0 9876M 1724M uwait 1 11:46 0.00% java

shkhln · Jan 26, 2022

What does jps -lv print?

3301 · Jan 26, 2022

Looks like something is leaking memory. I would bet on Solr or JVM. Please check which exact version of JDK you run and search their issue trackers for bugs related to memory leaking. You can also attach to the running Solr process using tools like visualvm and see if there is something suspicious. Edit: what filesystem you're using? Maybe extensive I/O operations cause such strange behavior.

last1 · Jan 26, 2022

shkhln said:
What does jps -lv print?

Code:

jps -lvV
83076 org.apache.zookeeper.server.quorum.QuorumPeerMain -Dzookeeper.root.logger=INFO,ROLLINGFILE -Dzookeeper.log.dir=/var/log/zookeeper -Dlog4j.configuration=file:/usr/local/etc/zookeeper/log4j.properties
88310 jdk.jcmd/sun.tools.jps.Jps -Dapplication.home=/usr/local/openjdk11 -Xms8m -Djdk.module.main=jdk.jcmd

3301 said:
Looks like something is leaking memory. I would bet on Solr or JVM. Please check which exact version of JDK you run and search their issue trackers for bugs related to memory leaking. You can also attach to the running Solr process using tools like visualvm and see if there is something suspicious. Edit: what filesystem you're using? Maybe extensive I/O operations cause such strange behavior.

it's openjdk11 which was the dependency for solr-8.6.2. I'm using UFS and there's light disk usage, maybe 2-3MB/sec

shkhln · Jan 26, 2022

last1 said:

Code:

83076 org.apache.zookeeper.server.quorum.QuorumPeerMain -Dzookeeper.root.logger=INFO,ROLLINGFILE -Dzookeeper.log.dir=/var/log/zookeeper -Dlog4j.configuration=file:/usr/local/etc/zookeeper/log4j.properties
88310 jdk.jcmd/sun.tools.jps.Jps -Dapplication.home=/usr/local/openjdk11 -Xms8m -Djdk.module.main=jdk.jcmd

And where exactly is Solr?

last1 · Jan 26, 2022

shkhln said:
And where exactly is Solr?

It's hidden because it was started with -XX:+PerfDisableSharedMem . Should I restart it without it ?

shkhln · Jan 26, 2022

I'm only interested in double checking arguments, try procstat -c <pid>.

last1 · Jan 26, 2022

shkhln said:
I'm only interested in double checking arguments, try procstat -c <pid>.

Oh ok, here we go

Code:

  PID COMM             ARGS
86150 java             /usr/local/openjdk11/bin/java -server -Xms16g -Xmx16g -XX:+UseG1GC -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled -XX:MaxGCPauseMillis=250 -XX:+UseLargePages -XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent -Xlog:gc*:file=/var/log/solr/solr_gc.log:time,uptime:filecount=9,filesize=20M -Dsolr.jetty.inetaccess.includes= -Dsolr.jetty.inetaccess.excludes= -DzkClientTimeout=15000 -DzkHost=192.168.1.4:2181,192.168.1.9:2181,192.168.1.10:2181,192.168.1.11:2181 -Dsolr.log.level=WARN -Dsolr.log.dir=/var/log/solr -Djetty.port=8983 -DSTOP.PORT=7983 -DSTOP.KEY=solrrocks -Dhost=192.168.1.11 -Duser.timezone=UTC -XX:-OmitStackTraceInFastThrow -Djetty.home=/usr/local/solr/server -Dsolr.solr.home=/var/db/solr -Dsolr.data.home= -Dsolr.install.dir=/usr/local/solr -Dsolr.default.confdir=/usr/local/solr/server/solr/configsets/_default/conf -Djetty.host=192.168.1.11 -Djetty.host=0.0.0.0 -Dlog4j2.formatMsgNoLookups=true -Xss256k -Dsolr.jetty.https.port=8983 -Dsolr.log.muteconsole -XX:OnOutOfMemoryError=/usr/local/solr/bin/oom_solr.sh 8983 /var/log/solr -jar start.jar --module=http

Jose · Jan 26, 2022

Maybe it's Zookeeper that's running out of memory?

Does the /var/log/solr/solr_gc.log have anything interesting in it when the box runs out of swap? Are there oom logs in the /var/log/solr directory?

docker-solr/6.6/scripts/oom_solr.sh at master · docker-solr/docker-solr

A Docker build for Solr, to manage the official Docker hub solr image - docker-solr/docker-solr

github.com

shkhln · Jan 27, 2022

No, it's not Zookeeper. The poorly formatted top output posted earlier shows 27 GB resident memory being consumed by Solr, which is admittedly a bit too much vs 16 GB heap limit. I'm assuming it's probably just index files mapped with MMapDirectory, but it wouldn't hurt to verify that assumption. (Still thinking how this can be done.)

last1 · Jan 28, 2022

Jose said:
Maybe it's Zookeeper that's running out of memory?

Does the /var/log/solr/solr_gc.log have anything interesting in it when the box runs out of swap? Are there oom logs in the /var/log/solr directory?

docker-solr/6.6/scripts/oom_solr.sh at master · docker-solr/docker-solr

A Docker build for Solr, to manage the official Docker hub solr image - docker-solr/docker-solr

github.com

I'm not sure what to look for in that file. There's obviously tons of output. On this particular server, swap usage goes from 0% to 45% within 12 hours. Here's a small excerpt from the gc log file in case it's relevant

[2022-01-27T23:08:32.873+0500][69509.494s] GC(6095) Humongous regions: 922->805
[2022-01-27T23:08:32.873+0500][69509.494s] GC(6095) Metaspace: 253500K->253500K(1318912K)
[2022-01-27T23:08:32.873+0500][69509.494s] GC(6095) Pause Young (Normal) (G1 Evacuation Pause) 10569M->8882M(16384M) 84.918ms
[2022-01-27T23:08:32.873+0500][69509.494s] GC(6095) User=0.16s Sys=0.00s Real=0.08s
[2022-01-27T23:08:45.867+0500][69522.487s] GC(6096) Pause Young (Normal) (G1 Evacuation Pause)
[2022-01-27T23:08:45.867+0500][69522.487s] GC(6096) Using 2 workers of 2 for evacuation
[2022-01-27T23:08:45.960+0500][69522.580s] GC(6096) Pre Evacuate Collection Set: 0.1ms
[2022-01-27T23:08:45.960+0500][69522.580s] GC(6096) Evacuate Collection Set: 89.1ms
[2022-01-27T23:08:45.960+0500][69522.580s] GC(6096) Post Evacuate Collection Set: 2.9ms
[2022-01-27T23:08:45.960+0500][69522.580s] GC(6096) Other: 0.6ms
[2022-01-27T23:08:45.960+0500][69522.580s] GC(6096) Eden regions: 107->0(132)
[2022-01-27T23:08:45.960+0500][69522.580s] GC(6096) Survivor regions: 9->7(15)
[2022-01-27T23:08:45.960+0500][69522.580s] GC(6096) Old regions: 297->301

I'm now down to restarting this server daily in order to keep it from becoming unresponsive

last1 · Jan 28, 2022

and to update my post, after I restarted solr.

Apparently it wasn't just solr tying up that 45% of swap, it was zookeeper too - 22% apparently

last pid: 93353; load averages: 0.68, 1.03, 1.13 up 376+08:11:50 23:18:10
31 processes: 1 running, 30 sleeping
CPU: 0.0% user, 0.0% nice, 0.2% system, 0.0% interrupt, 99.8% idle
Mem: 3035M Active, 7516M Inact, 867M Laundry, 2009M Wired, 1569M Buf, 18G Free
Swap: 4096M Total, 910M Used, 3186M Free, 22% Inuse

PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
83076 zookeeper 56 52 0 9864M 901M uwait 1 23:13 0.67% java

I've restarted zookeeper too now and that 900MB got released as well

last pid: 93687; load averages: 1.56, 1.35, 1.25 up 376+08:14:13 23:20:33
32 processes: 1 running, 31 sleeping
CPU: 41.6% user, 0.0% nice, 3.5% system, 0.2% interrupt, 54.7% idle
Mem: 20G Active, 7893M Inact, 5808K Laundry, 2099M Wired, 1570M Buf, 1489M Free
Swap: 4096M Total, 12M Used, 4084M Free

PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
93412 solr 173 52 0 253G 19G uwait 1 2:29 89.37% java
93686 zookeeper 44 52 0 9854M 128M uwait 1 0:02 0.10% java

Jose · Jan 28, 2022

This might be a legit bug in the Openjdk port to Freebsd. Not sure how to prove that beyond trying the same workload in Linux, unfortunately.

last1 · Jan 29, 2022

Jose said:
This might be a legit bug in the Openjdk port to Freebsd. Not sure how to prove that beyond trying the same workload in Linux, unfortunately.

That's what Solr team pretty much told me as well. Should I be forwarding this to someone on FreeBSD side ?

shkhln · Jan 29, 2022

What bug? Where the hell do you see any bugs?

last1 · Jan 29, 2022

shkhln said:
What bug? Where the hell do you see any bugs?

ok, so then is this expected behavior ? The Solr team doesn't think so

Jose · Jan 29, 2022

The bug would be the JVM is ignoring the -mx parameter.

shkhln said:
The poorly formatted top output posted earlier shows 27 GB resident memory being consumed by Solr, which is admittedly a bit too much vs 16 GB heap limit. I'm assuming it's probably just index files mapped with MMapDirectory, but it wouldn't hurt to verify that assumption. (Still thinking how this can be done.)

I think filing a bug might be your best recourse. It doesn't look like many humans participate in the freebsd-java mailing list.

The problem I see is that it's hard to reproduce. Basically you only see this after some time of heavy use? Can you reproduce it using synthetic traffic?

Another thing to try is reduce the heap size to say, 10GB or so and see how much memory it consumes. Is it a linear function of the mx parameter, or is this parameter simply ignored? This may be unacceptable since it's a production machine, and this may cause it to need reboots more often.

Jose · Jan 29, 2022

You might look here for how to collect data to attach to your bug:

Different Ways to Capture Java Heap Dumps | Baeldung

Learn multiple ways to capture a heap dump in Java

www.baeldung.com

grahamperrin · Jan 29, 2022

last1 said:
… I think filing a bug might be your best recourse. It doesn't look like many humans participate in the freebsd-java mailing list. …

+1

List archives are not ideal for tracking what might be a bug, and so on.

Bugzilla is better suited to triage.

shkhln · Jan 29, 2022

last1 said:
ok, so then is this expected behavior ? The Solr team doesn't think so

Citation need. Either you misunderstood them or they misunderstood you.

Jose said:
The bug would be the JVM is ignoring the -mx parameter.

I already mentioned MMapDirectory. It's not a subject to the heap limit, since memory mapped files do not use JVM's heap in the first place.

Jose · Jan 30, 2022

Interesting. I tried to find out the value UNMAP_SUPPORTED on Freebsd, and indeed it is false. However the setUseUnmap() hack they mention in the class's javadoc does appear to work:

Java:

import java.nio.file.Paths;

import org.apache.lucene.store.MMapDirectory;

public class TestLucene
{
    public static void main( String[] args ) throws Exception
    {
        System.out.println(MMapDirectory.UNMAP_SUPPORTED);
        System.out.println(MMapDirectory.UNMAP_NOT_SUPPORTED_REASON);

        MMapDirectory dir = new MMapDirectory( Paths.get( "/usr/home", "me" ) );
        dir.setUseUnmap( true );
    }
}

Output:

Code:

true
null

Maybe ask the Solr folks if there's a way to enable that hack on Freebsd?