PF Practical, Generally Effective Deny-By-Default Implementation

coldr3ality · Feb 15, 2020

Hey You Guys,

I'm posting here because I spent a little over 4 years working on a firewall problem, and I ultimately found a solution which relies on FreeBSD's PF firewall service as the engine. Now the funny thing is that I haven't actually implemented the solution with PF yet, but that is only because the final steps seem trivial and my immediate need to work on it went away. Therefore the final solution is not empirically proven, but I feel safe in taking it for granted because I choose to believe that PF supports "IP tables" in the way that claims to in the documentation. My development testing was done using a SonicWall TZ300, which worked great under low load but was prone to glitching under higher load because its internal DNS cache implementation was not well-integrated with its "IP tables" implementation.

Four years of concentrated effort is a lot for me. I had three generations of experimental code until I fully understood the problem and the essential solution was clear. Someone else with more pre-knowledge of the DNS would have got it quicker, but I feel that the technique needs to be considered widely. I did a fair share of research on forums looking for solutions before I committed to four years of development, but I was not able to even find anyone asking the same questions -- which is usually a good sign that your thinking is off-track, but in my estimation I was on the right track whereas the sysadmins whose threads I had searched were just not concerned enough with the problem.

The problem is this- technology has unpatched security vulnerabilities at every software and hardware level, and the only real defenses are isolation, segmentation, and deny-by-default. When the internet was simpler, it was always possible, and practical in many cases, while necessary only in some, to configure your firewall to deny all outbound connections by default, excepting specific hosts. The tactical reasoning for this is obvious when you consider that machines on your network will absolutely be executing virus code and attempting to "phone home". The industry standard antivirus solutions may work as advertised in many cases, but these solutions are naive, and there is an infinite degree of difference in the relative effectiveness of a solution that certainly always works, versus one which seems to work most of the time as long as patches and antivirus definitions are up-to-date -- especially considering the fact that a properly-crafted virus is never detected.

Ah, let's not get off-track. Patches should remain highly important and necessary until a future time when AI is able to generate perfect software.

The immediate issue preventing me from using my off-the-shelf SonicWall in a deny-by-default configuration is that it doesn't work when you specify the allowed remote hosts by their FQDNs instead of their IP addresses, but neither approach is practical because everything important is hosted on cloud networks, and these IP addresses are changing every 20 seconds or so, and every machine on my network is getting a different IP for the same FQDN, including the firewall itself. So, I wrote a DNS proxy in Perl which is positioned in front of Unbound, to minimize the dynamics of the cloud neworks' load balancing, aggregate definitive IP tables for groups of FQDNs, and instantaneously push the IP table updates to the firewall so that the default rule never applies to connections which should be allowed. Additionally, I implemented a DNS firewall in the same process, which was a natural thing to do because it already had the scope and position in the network topology to handle this role too.

The deficiency of the SonicWall's (and probably most other firewalls') internal routines presents problems today for just about any rule that hinges on FQDNs -- it is only obvious that the problem exists when users make it obvious because they can't get on their websites. I think that user complaints can scare sysadmins more than breeches do, which makes it seem natural that the industry may have avoided pursuing a difficult solution to this difficult problem.

Something about my particular implementation. At the present, my perl DNS proxy downloads its FQDN-groupings from the SonicWall via SSH upon startup, but if I shift to pf, I would either store the configuration in a .conf or an sqlite db. The process is multi-threaded and also hosts an https web config for remote administration. In the future it will support failover and high availability.

Mjölnir · Feb 26, 2021

The "default" ipfw(8) does support dynamic tables, too. I commented on the use of Perl on your profile; but that's not so important as it can be replaced more or less easily. On DNS (spoofing and the like): to prevent, at least mitigate, MTM attacks, IMHO it's clear that only DNSsec is an option. Still have to try harder to understand what the exact problem is... Maybe you can explain with a few words? I'm not dumb, have some background & "healthy sciolism", but frankly I didn't quite fully understand the problem. Last not least there is this: rotating/floating/changeing IP numbers is a really old thing - it's beeing done for decades. Likewise, while cloud computing is younger, but it's been done for several years now. If this ominous issue is a real issue, I'd strongly guess that a solution exists. Are you really shure you didn't reinvent the wheel but made it quadratic? I mean, some very big players would have gone bankrupt, they're relying on cloud computing. Maybe I'll comment more when I did some research on that topic.

coldr3ality · Mar 2, 2021

I still find the problem very interesting (see PROBLEM below). I'll continue to develop this software, and I'll begin to use it in production again when the frontend is done.

WHY
------
When security is very important, a deny-by-default firewall configuration can be used to great effect. With all unsanctioned traffic stopped, a few simple heuristics (or less) can more easily identify unusual network activity, and it becomes much more feasible to stop a data breach in-progress without incident.

Private HTTP proxy servers are typically used to implement deny-by-default. Outbound traffic from workstations is totally blocked, while the organization's web proxies stand on the front line of defense. In my case, human and material resources were both too limited to sustain this model. With only two domain controllers, one general purpose server and one of me, HTTP proxy was not a fit. Realistically, I think I would have needed at least two dedicated proxy servers and one full-time coworker. Even then, I would have been reducing the overall fault tolerance of the private infrastructure by funneling all web traffic into just two machines (large organizations use many). I decided to keep pressing for the reason why my SonicWall firewall could hardly enforce outbound FQDN-based rules when the remote hosts were cloud. Packet capturing revealed that the IP addresses of those cloud hosts would match the correct rules for a short time, and then suddenly stop matching. Why? It was obvious from the start that DNS was involved. I did more packet capturing, but this time I wrote a Perl program to intercept and log all DNS traffic in an SQLite database. From this data I determined that the SonicWall's DNS cache must be "forgetting" IPs sometimes, and I verified this by extending the Perl program to explicitly add those IPs to the firewall's configuration using an automated SSH state machine. This stopped the problem. However, it was a bad solution. I kept testing and developing until I determined the exact reason for what was going on. As it turned out, there are nuanced side-effects caused by the DNS-based load balancing techniques employed by cloud networks. These side-effects do not present themselves unless an IP firewall is relying on a standard DNS cache to translate FQDNs. The side-effects are preventable with a specialized DNS proxy daemon.

I found many reasons to pursue a worthy solution to this problem:

>Even the standard web proxy model would be much improved in terms of performance and security with a well-working layer-3 IP filter underneath it. (I mean, one which features outbound deny-by-default, the simplest and most practical failsafe defense against one's own compromised machines that I can imagine.)
>There are much more ubiquitous use-cases than mine which could benefit from this same solution. If it were implemented in IoT devices, for example, the potential for botnets could be sharply reduced.
>This DNS proxy exposes a much less vulnerable suite of services than an HTTP proxy, meaning that a solid v1.2 written in C would not require dedicated hardware or VM sandboxing to use safely in production, as HTTP web proxies do. The caveat with this point is that in order to control access to specific cloud-hosted websites, you still either need HTTP proxies or layer 4-7 application filtering. If all you're trying to do is block the vast majority of all unsanctioned outbound traffic by default, as I am, this DNS proxy is a clean alternative.
>This DNS proxy is also a DNS firewall. The threat of malware bypassing the IP firewall via DNS tunnel is a universal threat, so I needed to go in this general direction anyway.
>Even more interesting firewall configurations are possible, such as combining a deny-by-default IP firewall with an allow-by-default DNS firewall. The effect is that any traffic is allowed through, so long as your DNS resolvers are compatible with their nameservers, and your user-agent does a DNS query before it tries to open a connection. This basically rules out a whole category of shady stuff without imposing any extra maintenance work on you or your users - otherwise a completely open internet connection.

PROBLEM
-----------
The problem is essentially that DNS caches cannot append new records without overwriting old ones. This is by design, but to understand how this results in the instability I described, you have to understand that cloud nameservers can hand out a different set of address records any time for the same one name; then, consider the structure of a typical DNS response from a cloud nameserver. It has a chain of four or five CNAMEs, drilling down into a more and more general pool of hosts, until finally it gets to the IP addresses. Each of these CNAME records has its own TTL, and dozens of other FQDNs can drill down to the same final CNAMEs. If any of these CNAME records expire, a query for a same or similar FQDN will trigger a sudden overwrite of the yet unexpired address records in cache. The firewall will no longer be aware of them, while the user-agent which queried them will keep using them until they expire. Keeping in mind that these TTL values are typically very low, 5-20 seconds, on an active computer network with just 10 users, all browsing the same handful of cloud applications, this can happen dozens of times per minute. When outbound-allow rules hinge on accurate FQDN interpretation, this causes constant interruptions for the users.

SOLUTION
------------
In order to keep accurate and complete IP tables for cloud resources, you need a more complex DNS cache structure with some non-standard behavior. Effectively what we are doing is counter-acting the load balancer without interrupting traffic.

>To facilitate configuration, the DNS cache must support wildcards and grouping by table name.
>It must append (rather than replace) existing resource records. Cache objects which represent discrete zones have a one-to-many relationship with the cache objects they point to. It is necessary to store an array of pointers to zones which are CNAMES of "this" zone, as well as a separate array of pointers to zones for which "this" zone is a CNAME, in order to facilitate proper destruction.
>It must not delete expired address-records as soon as the TTL is up; rather, it should disable them until a separate, proprietary timer expires, I call it TTC (Time To Cache). Your user-agents will not see these expired-yet-still-cached records, but they will be seen in the firewall's IP tables.
>It should control the potential for an explosion in cache size (which is made possible by the accumulation of concurrent records being appended). It is possible to effect a theoretical limit on the maximum possible number of address-records to be accumulated. The policies I implemented which represent a very basic technique that works are these:
>>Enforce a maximum of one new address-record to be cached per resolution, regardless of whether the NS responds with multiple address-records. A good way to select which record to keep would be the one with the longest TTL (they are probably always the same anyway) or always pick the first one in the list.
>>If any of the resolved addresses match an expired record (which your cache may still be holding onto until TTC expires), there is no need to cache a new one; the TTL of the old record can be refreshed.
>>I enforced a minimum TTL of about 300. This seemed to work and it's very effective at reducing cache size, but it's possible this could cause problems if cloud nodes get re-provisioned after the actual TTL expires. A TTL of 30 might be more reasonable, but there is no way to tell for sure. This policy might be too problematic to enable by default.

Eventually I could whitelist things like Office 365 and SharePoint practically and reliably. I was able to define firewall rules, routing rules and NAT policies based on tables conveniently named after groups of FQDNs containing wildcards. I am sure this sounds pretty underwhelming to anyone who has not tried to do this under a deny-by-default configuration. Even with the previous generation SonicWall firewalls, you could setup rules based on FQDNs, and I had taken for granted that it should work. When it didn't, I went down a rabbit hole. Ultimately I had to give up on the SonicWall anyway, because it doesn't support directly modifying IP tables like pf and IPFW do. I had been trying to update the SonicWall's IP tables by targeting it with more non-standard DNS behavior, but as well as this worked sometimes, it didn't work at all when the SonicWall didn't feel like it. The effort became too much to put upon my clients at that point, so it became a side project. That's when I found out about IP tables in pf.

coldr3ality · Mar 2, 2021

I must apologize for heavily editing these posts over the course of days.

Over the past year I've been putting my personal time into a related project with the intent to circle back and finish this one. My current version is not enterprise-grade yet, it lacks failover and redundancy. This is an important point, because realtime cache synchronization between failover nodes would be required to maintain the integrity of the service. While my working design is amenable to such improvement, and it doesn't have to be complicated, I can't say when I'll finally dig in.

However, this project at least serves as a solid proof of concept for the underlying logic. Despite Perl, it is not too inefficient for production use; its working memory footprint is minimal, predictable and economical, not a vulnerability. I avoided using Perl's hashes and arrays in favor of packed scalars. Computational efficiency could be much improved, but because the parameters of the logic are well defined, its below-grade computational efficiency is not a potential vulnerability either. I would share the source code in its current form if asked, but otherwise I'll hold off until I finish some work on the HTTP frontend, which is currently broken.

That said, I'm sure that other vulnerabilities are inherent due to Perl. I will begin the switch to C by re-implementing the essential part as a Perl module. Then, a standalone executable.

I would like it if the idea were picked up by an established project like Unbound, but maybe adding such non-standard options to their elegant program would complicate it too much. Perhaps it would be best to keep it as a standalone proxy.

coldr3ality · Mar 4, 2021

I am not an authority on the subject of DNS or firewalls.... I'm just a simple country boy... I guess it would be helpful of me to produce some kind of test suite so others could easily check for the issue on any combination of firewall and DNS cache. It could be fairly simple.

PF Practical, Generally Effective Deny-By-Default Implementation

coldr3ality

Mjölnir

coldr3ality

coldr3ality

coldr3ality