Apache Solr Java usage in FreeBSD

I am running Solr 8.6.2 under FreeBSD 12.3 inside a VM with 32GB of ram
Hypervisor: Origin = "VMwareVMware"
real memory = 34359738368 (32768 MB)
avail memory = 33370628096 (31824 MB)

Solr is installed from packages and I am limiting it's memory like so:

/usr/local/openjdk11/bin/java -server -Xms16g -Xmx16g -XX:+UseG1GC -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled

Despite this, the process ends up taking an increasing amount of memory, goes into swap, eventually exhausts swap and the box fails unless I monitor it and restart it every 2 weeks or so. But right now I'm stuck in monitoring hell, having to endure countless swap warning messages, etc.

I asked the Solr team about this, they claim it's a FreeBSD issue. That their process really only occupies 16Gb of memory.

top reports this and I'm at a loss as to what to do next...

28 processes: 1 running, 27 sleeping
CPU: 35.9% user, 0.0% nice, 2.7% system, 0.1% interrupt, 61.3% idle
Mem: 23G Active, 4132M Inact, 1777M Laundry, 2258M Wired, 1572M Buf, 682M Free
Swap: 4096M Total, 1947M Used, 2149M Free, 47% Inuse

PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
82778 solr 284 52 0 252G 27G uwait 1 713:51 0.00% java
83076 zookeeper 56 52 0 9876M 1724M uwait 1 11:46 0.00% java
 
Looks like something is leaking memory. I would bet on Solr or JVM. Please check which exact version of JDK you run and search their issue trackers for bugs related to memory leaking. You can also attach to the running Solr process using tools like visualvm and see if there is something suspicious. Edit: what filesystem you're using? Maybe extensive I/O operations cause such strange behavior.
 
What does jps -lv print?
Code:
jps -lvV
83076 org.apache.zookeeper.server.quorum.QuorumPeerMain -Dzookeeper.root.logger=INFO,ROLLINGFILE -Dzookeeper.log.dir=/var/log/zookeeper -Dlog4j.configuration=file:/usr/local/etc/zookeeper/log4j.properties
88310 jdk.jcmd/sun.tools.jps.Jps -Dapplication.home=/usr/local/openjdk11 -Xms8m -Djdk.module.main=jdk.jcmd
Looks like something is leaking memory. I would bet on Solr or JVM. Please check which exact version of JDK you run and search their issue trackers for bugs related to memory leaking. You can also attach to the running Solr process using tools like visualvm and see if there is something suspicious. Edit: what filesystem you're using? Maybe extensive I/O operations cause such strange behavior.

it's openjdk11 which was the dependency for solr-8.6.2. I'm using UFS and there's light disk usage, maybe 2-3MB/sec
 
Code:
83076 org.apache.zookeeper.server.quorum.QuorumPeerMain -Dzookeeper.root.logger=INFO,ROLLINGFILE -Dzookeeper.log.dir=/var/log/zookeeper -Dlog4j.configuration=file:/usr/local/etc/zookeeper/log4j.properties
88310 jdk.jcmd/sun.tools.jps.Jps -Dapplication.home=/usr/local/openjdk11 -Xms8m -Djdk.module.main=jdk.jcmd
And where exactly is Solr?
 
I'm only interested in double checking arguments, try procstat -c <pid>.

Oh ok, here we go
Code:
  PID COMM             ARGS
86150 java             /usr/local/openjdk11/bin/java -server -Xms16g -Xmx16g -XX:+UseG1GC -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled -XX:MaxGCPauseMillis=250 -XX:+UseLargePages -XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent -Xlog:gc*:file=/var/log/solr/solr_gc.log:time,uptime:filecount=9,filesize=20M -Dsolr.jetty.inetaccess.includes= -Dsolr.jetty.inetaccess.excludes= -DzkClientTimeout=15000 -DzkHost=192.168.1.4:2181,192.168.1.9:2181,192.168.1.10:2181,192.168.1.11:2181 -Dsolr.log.level=WARN -Dsolr.log.dir=/var/log/solr -Djetty.port=8983 -DSTOP.PORT=7983 -DSTOP.KEY=solrrocks -Dhost=192.168.1.11 -Duser.timezone=UTC -XX:-OmitStackTraceInFastThrow -Djetty.home=/usr/local/solr/server -Dsolr.solr.home=/var/db/solr -Dsolr.data.home= -Dsolr.install.dir=/usr/local/solr -Dsolr.default.confdir=/usr/local/solr/server/solr/configsets/_default/conf -Djetty.host=192.168.1.11 -Djetty.host=0.0.0.0 -Dlog4j2.formatMsgNoLookups=true -Xss256k -Dsolr.jetty.https.port=8983 -Dsolr.log.muteconsole -XX:OnOutOfMemoryError=/usr/local/solr/bin/oom_solr.sh 8983 /var/log/solr -jar start.jar --module=http
 
No, it's not Zookeeper. The poorly formatted top output posted earlier shows 27 GB resident memory being consumed by Solr, which is admittedly a bit too much vs 16 GB heap limit. I'm assuming it's probably just index files mapped with MMapDirectory, but it wouldn't hurt to verify that assumption. (Still thinking how this can be done.)
 
Maybe it's Zookeeper that's running out of memory?

Does the /var/log/solr/solr_gc.log have anything interesting in it when the box runs out of swap? Are there oom logs in the /var/log/solr directory?

I'm not sure what to look for in that file. There's obviously tons of output. On this particular server, swap usage goes from 0% to 45% within 12 hours. Here's a small excerpt from the gc log file in case it's relevant

[2022-01-27T23:08:32.873+0500][69509.494s] GC(6095) Humongous regions: 922->805
[2022-01-27T23:08:32.873+0500][69509.494s] GC(6095) Metaspace: 253500K->253500K(1318912K)
[2022-01-27T23:08:32.873+0500][69509.494s] GC(6095) Pause Young (Normal) (G1 Evacuation Pause) 10569M->8882M(16384M) 84.918ms
[2022-01-27T23:08:32.873+0500][69509.494s] GC(6095) User=0.16s Sys=0.00s Real=0.08s
[2022-01-27T23:08:45.867+0500][69522.487s] GC(6096) Pause Young (Normal) (G1 Evacuation Pause)
[2022-01-27T23:08:45.867+0500][69522.487s] GC(6096) Using 2 workers of 2 for evacuation
[2022-01-27T23:08:45.960+0500][69522.580s] GC(6096) Pre Evacuate Collection Set: 0.1ms
[2022-01-27T23:08:45.960+0500][69522.580s] GC(6096) Evacuate Collection Set: 89.1ms
[2022-01-27T23:08:45.960+0500][69522.580s] GC(6096) Post Evacuate Collection Set: 2.9ms
[2022-01-27T23:08:45.960+0500][69522.580s] GC(6096) Other: 0.6ms
[2022-01-27T23:08:45.960+0500][69522.580s] GC(6096) Eden regions: 107->0(132)
[2022-01-27T23:08:45.960+0500][69522.580s] GC(6096) Survivor regions: 9->7(15)
[2022-01-27T23:08:45.960+0500][69522.580s] GC(6096) Old regions: 297->301

I'm now down to restarting this server daily in order to keep it from becoming unresponsive
 
and to update my post, after I restarted solr.

Apparently it wasn't just solr tying up that 45% of swap, it was zookeeper too - 22% apparently

last pid: 93353; load averages: 0.68, 1.03, 1.13 up 376+08:11:50 23:18:10
31 processes: 1 running, 30 sleeping
CPU: 0.0% user, 0.0% nice, 0.2% system, 0.0% interrupt, 99.8% idle
Mem: 3035M Active, 7516M Inact, 867M Laundry, 2009M Wired, 1569M Buf, 18G Free
Swap: 4096M Total, 910M Used, 3186M Free, 22% Inuse

PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
83076 zookeeper 56 52 0 9864M 901M uwait 1 23:13 0.67% java

I've restarted zookeeper too now and that 900MB got released as well

last pid: 93687; load averages: 1.56, 1.35, 1.25 up 376+08:14:13 23:20:33
32 processes: 1 running, 31 sleeping
CPU: 41.6% user, 0.0% nice, 3.5% system, 0.2% interrupt, 54.7% idle
Mem: 20G Active, 7893M Inact, 5808K Laundry, 2099M Wired, 1570M Buf, 1489M Free
Swap: 4096M Total, 12M Used, 4084M Free

PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
93412 solr 173 52 0 253G 19G uwait 1 2:29 89.37% java
93686 zookeeper 44 52 0 9854M 128M uwait 1 0:02 0.10% java
 
This might be a legit bug in the Openjdk port to Freebsd. Not sure how to prove that beyond trying the same workload in Linux, unfortunately.
 
This might be a legit bug in the Openjdk port to Freebsd. Not sure how to prove that beyond trying the same workload in Linux, unfortunately.
That's what Solr team pretty much told me as well. Should I be forwarding this to someone on FreeBSD side ?
 
The bug would be the JVM is ignoring the -mx parameter.
The poorly formatted top output posted earlier shows 27 GB resident memory being consumed by Solr, which is admittedly a bit too much vs 16 GB heap limit. I'm assuming it's probably just index files mapped with MMapDirectory, but it wouldn't hurt to verify that assumption. (Still thinking how this can be done.)
I think filing a bug might be your best recourse. It doesn't look like many humans participate in the freebsd-java mailing list.

The problem I see is that it's hard to reproduce. Basically you only see this after some time of heavy use? Can you reproduce it using synthetic traffic?

Another thing to try is reduce the heap size to say, 10GB or so and see how much memory it consumes. Is it a linear function of the mx parameter, or is this parameter simply ignored? This may be unacceptable since it's a production machine, and this may cause it to need reboots more often.
 
Interesting. I tried to find out the value UNMAP_SUPPORTED on Freebsd, and indeed it is false. However the setUseUnmap() hack they mention in the class's javadoc does appear to work:
Java:
import java.nio.file.Paths;

import org.apache.lucene.store.MMapDirectory;

public class TestLucene
{
    public static void main( String[] args ) throws Exception
    {
        System.out.println(MMapDirectory.UNMAP_SUPPORTED);
        System.out.println(MMapDirectory.UNMAP_NOT_SUPPORTED_REASON);

        MMapDirectory dir = new MMapDirectory( Paths.get( "/usr/home", "me" ) );
        dir.setUseUnmap( true );
    }
}
Output:
Code:
true
null
Maybe ask the Solr folks if there's a way to enable that hack on Freebsd?
 
Back
Top