IPFW VIMAGE data spillage between ipfw instances (dummynet)

Trying to run a traffic shaper in a VIMAGE jail. (It does not appear to work.)

The simple part: sysctl net.inet.ip.fw.one_pass=0 must be set for every jail individually.

For half a year I've now been running VIMAGE jails, and setting up ipfw within them, but didn't notice that they are all running with one_pass=1, in contrast to the base system. Only when confguring pipes/queues, this does obviousely not work, because the packets leave the ruleset and cannot be processed further after being transferred the queue.

The freaky part: pipes and queues exist only once.

After things were running as expected, I deleted the traffic shaper in the base system, as it should no longer be needed. And the one in the jail was also gone!

There is only one instance of pipes+queues. The same ones are visible in the base system and in every jail, and they can be created, modified and deleted from any jail! So, if you have a traffic shaper in the base system, and give a jail to some freaks, they can do with it whatever they want...

The known part: delays do not work.

This one is already on record: https://docs.freebsd.org/cgi/getmsg.cgi?fetch=40348+0+archive/2019/freebsd-net/20190728.freebsd-net
Sadly, there is no comment on the issue, whatsoever.
But i the light of the further findings, I might suppose these packets do not just "get lost", they rather reappear at some unexpected place...

The ugly part: data traverses between ipfw instances

Since I do not need delays on the pipes, I then started to run the whole thing. And I perceived very strange behaviour, e.g. I could access my own webserver with http, but not with https. I could read this forum more or less well (with occasionally missing CSS), but when I tried to compose a message, I got timeout and "secure connection failed".
Analyzing the whole packet's path I found that some packets, which should go through the pipe in the jail's ipfw, were rejected in the base system's ipfw, at a rule which they could impossibly have reached. Also, some of these packets were reported as incoming on a netif that did not belong to the base system, but to the jail!
Switching off the traffic shaper in the jail resolves all these issues.

The only explanation I have so far is: the pipes do not keep track on which ipfw-instance a packet belongs to, and re-insert it into some arbitrary ruleset, depending on packet-size, port-number or whatever.
 
You should tread the jails and guest vm as insecure environments and should not provide them any control over network. All VLAN tagging/ firewall filtering / queue and etc should be done one level above on the HOST and if you don't trust the host then control it on the device before it (router or switch).
 
You should tread the jails and guest vm as insecure environments and should not provide them any control over network. All VLAN tagging/ firewall filtering / queue and etc should be done one level above on the HOST

I'm saying that they have control already! When doing firewall queueing on the host, any jail has access to and can modify that configuration - by design.

Also I have proof that traffic that happens inside the jail, can spill out and suddenly appear within the host's routing. I do not yet have proof of the other direction (that traffic from within the host spills into the jail - which would be really bad), but from what I have seen so far, I do not consider that impossible.

I am looking at this from a merely technical viewpoint, figuring out what actually does happen, and leaving the security concerns out of the loop for that. If we bring in the security/best-practices/paranoia stuff, then for now I would tend to say that DUMMYNET is insecure, and any kernel compiled with VIMAGE and DUMMYNET together should not be used for a secure env.
 
Analyzing the whole packet's path I found that some packets, which should go through the pipe in the jail's ipfw, were rejected in the base system's ipfw, at a rule which they could impossibly have reached. Also, some of these packets were reported as incoming on a netif that did not belong to the base system, but to the jail!
This sounds like a namespace issue. Probably due to having only one pipe. I can imagine some overlap with states getting matched to the 'wrong' jail or interface.

I think the best way would be to ask on the freebsd-ipfw mailing list. I'm sure someone there can shed more light on the issue.
 
This sounds like a namespace issue. Probably due to having only one pipe. I can imagine some overlap with states getting matched to the 'wrong' jail or interface.

I just tracked it down:
There are two ipfw rulesets concerned, one is in the base system, and one in the jail.
In the jail there are these rules:
Code:
/sbin/ipfw add 1920 set 16 queue 13 udp from 192.168.98.10/32 to XX.XX.XX.XX/32 5006
/sbin/ipfw add 1930 set 16 skipto 2500 udp from 192.168.98.10/32 to XX.XX.XX.XX/32 5006
But the counters do not match:
Code:
01920      47     33520 queue 13 udp from 192.168.98.10 to XX.XX.XX.XX 5006
01930      25      4144 skipto 2500 udp from 192.168.98.10 to XX.XX.XX.XX 5006
Some of the packets that go into the queue do not reappear at rule #1930.

But then, if I insert a rule #1930 on the base system, I can grab exactly these packets!

Furthermore: when the pipe has 0ms delay, only the big packets are misdelivered, whereas with a delay on the pipe, all packets are misdelivered.
This explains the phenomenon: if a packet can immediately traverses the pipe, it gets correctly reinserted, but if it gets delayed, the proper information appears to be lost, and it gets reinserted at the base system. (Big packets always suffer some delay due to bandwidth limiting.)
(The reason for this behaviour is probably that I have configured net.inet.ip.dummynet.io_fast=1.)

So this is the very same issue that was already reported in beforementioned list message https://docs.freebsd.org/cgi/getmsg.cgi?fetch=40348+0+archive/2019/freebsd-net/20190728.freebsd-net and got no attention.
 
So, lets have a thorough look at this calamity:

1.
You can use dummynet(4) (pipes, queues) only in the base system, not in a jail. This is not implemented. You can run ipfw in a VIMAGE jail, but no dummynet.
(Dummynet, i.e. traffic shaping, is usually necessary when routing VoIP and bulk data together over the same saturated links.)

2.
The dummynet configuration in the base system is visible to all jails. It can be modified and deleted from any jail. This means: if you use pipes/queues for manipulating IP traffic, any jail can delete or reconfigure these.

To avoid this, each jail must be set to kern.securelevel=3. Then a jail is no longer allowed manipulate the dummynet.
If the jail runs it's own ipfw, that one can then also no longer be reconfigured, and only by restarting the jail - so this approach will not work with blacklistd, suricata, etc.

Even if a jail is not allowed to manipulate dummynet. it is still able to retrieve the current configuration - which may contain sensible traffic information including current port numbers etc., e.g.:

Code:
q00021 100 KB 35 flows (2048 buckets) sched 2 weight 1 lmax 1492 pri 0 droptail
    mask:  0xff 0xffffffff/0xffff -> 0xffffffff/0xffff
  4 tcp      199.19.53.1/53       192.168.92.3/10042    4     1276  0    0   0
 10 tcp   172.217.16.202/443     192.168.93.49/52828   35     7555  0    0   0
 14 tcp      199.19.57.1/53       192.168.92.3/10047    5     1071  0    0   0
 37 tcp     192.0.32.132/53       192.168.92.3/39206    4      276  0    0   0
 39 tcp     192.0.32.132/53       192.168.92.3/39207   88   124293  0    0   0
[...]

If you don't like this, you cannot run dummynet on a host with VIMAGE jails.

3.
Basically the same applies to the netgraph interface ng_ipfw(4): this can also be used only in the base system.
But here is not much risk that jails may obtain undue information, because netgraph itself is VIMAGE-aware and builds distinct graphs in each jail; only the ng_ipfw feature is not VIMAGE-aware and exists only once.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Now for the technical details:

I think the best way would be to ask on the freebsd-ipfw mailing list. I'm sure someone there can shed more light on the issue.
I tried that, but it didn't make it to the list, was probably not approved. Anyway, it is easier to read the source, because everything that happens is in the source, and nothing can happen that is not in the source.

The manpage ng_ipfw(4) gives a hint on what to look for. (ng_ipfw behaves the same as dummynet: ipfw instances in a VIMAGE jail can contain a netgraph rule, and the respective packets will be sent to netgraph, but on return they will appear in the base system's ruleset.)
Packets injected via the netgraph command are tagged with struct ipfw_rule_ref. This tag contains information that helps the packet to re-enter ipfw(4) processing, should the packet come back from netgraph(4) to ipfw(4).

This struct ipfw_rule_ref can be found in sys/netinet/ip_var.h:
Code:
struct ipfw_rule_ref {
        uint32_t        slot;           /* slot for matching rule       */
        uint32_t        rulenum;        /* matching rule number         */
        uint32_t        rule_id;        /* matching rule id             */
        uint32_t        chain_id;       /* ruleset id                   */
        uint32_t        info;           /* see below                    */
};

This is the data that a packet carries alongside when temporarily leaving the ruleset (for netgraph, divert, dummynet, etc.), so that it later can be reinserted at the next rule. Here we have problem nr. 1: this struct contains a chain_id number, which indeed identifies the originating ruleset - but, this number is not static, it increments every time when a rule is added or deleted to the ruleset. So this value offers no easy way to identify the proper ruleset at re-entry - it is actually used only as a flag to detect if the ruleset has been changed in the meantime, because then the rule position must be recomputed. This value has not been put in with the expectation that there might be multiple rulesets to choose from.

So, to reinsert properly into the correct ruleset, maybe another field is needed to hold the jail-id. But then, this is a public ABI interface - and I didn't find any hints that somebody would want to change this in R.13.

Which leads to the second difficulty:
The file sys/net/vnet.c tells a bit about how VIMAGE is implemented. It is done in a way so that other software does not necessarily need to be aware of it. Software usually expects the network stack to exist and does not bother about multiple of them - and only the system makes certain that what appears as "the" network stack is the proper one.

And since dummynet and ng_ipfw do exist only once in the system and are not VIMAGE-aware, their notion of the network stack is obviousely the default one - that of the base system. That means, even if we had a jail-id in that struct, it may not be trivial to then teach these modules how to choose the proper one from a number of network stacks.
 
Back
Top