How do you manage your services during upgrade/updates (during server offline)

freebuser · Oct 18, 2021

I operate my own server with unbound, website and email servers within separate jails.

Just wondering what others do when it is time to upgrade the services or host os.
Like if you have to take your system offline for an hour or so, or even a day or two.

Do you have a backup site saying the website is under maintenance?, what about DNS and email servers.
Or do you have an RPi serving temporary files until the server is back online?

D-FENS · Oct 18, 2021

* I take the system offline and do the chores because we are a small team and it's enough to just announce it.
* If many people will be affected one could use some kind of High Availability schema. For example you could create an updated server offline and then replicate your data to it (for example, via zfs send/receive). Then after receiving the data you would have a couple of minutes (or hours) of difference between the two hosts. You take your original host offline for a couple of minutes and again zfs send/receive the rest of the data (it takes much less time the second time). And then you switch the machines and bring the site online again. Total downtime should be a couple of minutes. The switching could be done via a DNS entry.
* Another solution could be load balancing. It is more complicated to configure but you could bring chosen hosts offline and update them without downtime.

P.S. And when I write "hosts" I mean jails too.

SirDice · Oct 18, 2021

Simple websites like my own will just be offline for a couple of minutes. Servers I maintain for a client are all doubled up, and several webservers are balanced using HAProxy. So nobody is going to notice if I take something offline.

freebuser · Oct 18, 2021

roccobaroccoSC said:
* I take the system offline and do the chores because we are a small team and it's enough to just announce it.
* If many people will be affected one could use some kind of High Availability schema. For example you could create an updated server offline and then replicate your data to it (for example, via zfs send/receive). Then after receiving the data you would have a couple of minutes (or hours) of difference between the two hosts. You take your original host offline for a couple of minutes and again zfs send/receive the rest of the data (it takes much less time the second time). And then you switch the machines and bring the site online again. Total downtime should be a couple of minutes. The switching could be done via a DNS entry.
* Another solution could be load balancing. It is more complicated to configure but you could bring chosen hosts offline and update them without downtime.

P.S. And when I write "hosts" I mean jails too.

It's more the public facing website and services I am worried (for a small business).
Internal network is manageable (other than DNS - which I need to figure something out).
All files are in SVN so the staff can still work on local files so no problems there either.

But, I am worried, someone checking our website and if it is down - likely will ~~loose~~ lose a client or two.
And How are you managing email servers - a secondary MX record? and then how the emails get moved to local server?

SirDice said:
Simple websites like my own will just be offline for a couple of minutes. Servers I maintain for a client are all doubled up, and several webservers are balanced using HAProxy. So nobody is going to notice if I take something offline.

Never heard of (or didn't pay much attention to) HAProxy before. You learn something new every day.

drhowarddrfine · Oct 18, 2021

Lose is spelled l-o-s-e

Some DNS providers, like namecheap, allow you to redirect to their own page for such things.

SirDice · Oct 18, 2021

freebuser said:
Never heard of (or didn't pay much attention to) HAProxy before. You learn something new every day.

It's a load-balancer for webservers (although you can load-balance other services with it too). Quite useful. We've set this up with two HAProxy servers using CARP (so I can take one of them offline without interrupting the web sites). And we have about 6 webservers running in the backend. With HAProxy I can just take one of the webservers offline, update it, test it, then put it back in the pool and move onto the next webserver. That way I can update/upgrade everything without interfering with the web sites.

freebuser said:
what about DNS and email servers.

DNS servers can be 'doubled' using CARP, then you can just take one offline without missing anything (you don't want to rely on the second server in /etc/resolv.conf, it only gets accessed after a request to the first DNS server has timed out; causing an annoying delay for your users). Email servers are easy, mail is held up to 5 days when a mailserver is offline. So this is really not a problem to take down for a couple minutes. But you should have two MX records any way, so mail will automatically get sent to the secondary server.

covacat · Oct 18, 2021

i upgrade off peak hours (weekend/late night)
for a small biz it's not worth the effort to have double everything which creates more overhead all the time / split brain scenarios etc
also a secondary mx creates a spam vector if the secondary mx does not have mean to verify the final recipient is valid

D-FENS · Oct 19, 2021

freebuser said:
And How are you managing email servers - a secondary MX record? and then how the emails get moved to local server?

I have not managed a mail server per se, and as I wrote, I don't have experience in HA.

But managing mail servers should be principally the same as any other service. You could swap DNS and you could use load balancing.
If you need to update the host where your data is stored, then you need some way of hotswapping and data replication. Honestly, I don't know how people do that. I can imagine it could be done with a minimal downtime by using ZFS send/receive and taking the system offline for a couple of minutes for example around 2 a.m.

sko · Oct 19, 2021

covacat said:
also a secondary mx creates a spam vector if the secondary mx does not have mean to verify the final recipient is valid

A secondary MX can double as a very nice spamtrap - no RFC-compliant mailserver will try the secondary MX first, so if your primary is up you can safely blacklist anything that first tries to contact your secondary MX. Of course you have to make sure to reliably monitor if your primary is actually reachable.
To sync those PF table entries, one can nicely abuse OpenBGPd and some small scripts to periodically sync between BGP entries and PF tables.

BOT:
When running everything in jails you can easily clone the jail, upgrade everything and then switch over to the new jail. For the host just run the newly created BE in a jail, upgrade, then reboot into the new BE.
This is only feasible for kernel/base system upgrades and major release upgrades - for simple package updates all of this is IMHO not needed as the services are simply restarted, so the "downtime" is usually less than one second...

As for DNS: you usually run a secondary resolver anyways, so no need for a complex HA setup - the clients will take care of switching to another DNS themselves if they can't reach the primary...

And for important systems like hypervisor hosts or routers: I have a weekly maintenance window where I can reboot or take down such systems.

Jose · Oct 19, 2021

covacat said:
...also a secondary mx creates a spam vector if the secondary mx does not have mean to verify the final recipient is valid

Works fine for me. I get a handful of "no valid recipients" messages a month. Worth it for the availability and redundancy it adds.

Jose · Oct 19, 2021

roccobaroccoSC said:
If you need to update the host where your data is stored, then you need some way of hotswapping and data replication. Honestly, I don't know how people do that. I can imagine it could be done with a minimal downtime by using ZFS send/receive and taking the system offline for a couple of minutes for example around 2 a.m.

You're getting into enterprisy territory here. You can do this at the application level E.g.,

Wiki has been closed

26.5. Hot Standby

26.5. Hot Standby 26.5.1. User's Overview 26.5.2. Handling Query Conflicts 26.5.3. Administrator's Overview 26.5.4. Hot Standby Parameter Reference 26.5.5. Caveats Hot …

www.postgresql.org

I dunno if SANs are still a thing, and you'll need a cluster-aware filesystem to use one anyway. There are some recent discussions about clustered filesystems around.

covacat · Oct 19, 2021

Jose said:
Works fine for me. I get a handful of "no valid recipients" messages a month. Worth it for the availability and redundancy it adds.

i was thinking at the scenario where the 2nd MX keeps the mail in the queue until the primary MX comes up and has no information about which recipient in your domain is valid and which is not
the spammer sends mail to your 2nd MX for a non existing account with a forget From: (which is the target of the spam)
when the mail arrives to your primary MX is bounced to the forged From: and the spam is delivered

this also works when primary MX is up because deliver 2nd MX => 1st MX is async
there are workarounds like milter-ahead and probably others

Jose · Oct 19, 2021

Your analysis is correct. My primary MX is on a virtual host that has no idea which recipient addresses are valid, and thus forwards mail with bogus recipients to my main MX. This main MX, which is behind a firewall at my house, then sends failed delivery notifications to the postmaster. That's me.

Forged "from" reflection attacks can be defeated with SPF records, which I publish for all my domains.

gpw928 · Oct 19, 2021

SirDice said:
DNS servers can be 'doubled' using CARP, then you can just take one offline without missing anything (you don't want to rely on the second server in /etc/resolv.conf, it only gets accessed after a request to the first DNS server has timed out; causing an annoying delay for your users).

When the "primary" name server goes offline, the repeated timeouts before contacting the "secondary" name server can add up to something quite horrific (applications become "unresponsive").

Since the regular maintenance and patching cycle is going to take name servers off line on a routine basis, a solution is definitely required.

CARP is one good way of doing it , especially if you have full control of the network infrastructure required to configure it (you need a hub, or specially configured switching).

Another approach, that does not require network changes, is to install a small caching name server on every "application" host. dns/dnsmasq fits the requirements (however negative caching should be disabled). From the perspective of each application host, the name server is local (127.0.0.1) -- and thus always responding. There could be a timeout if an upstream query was required (and the "primary" name server was offline), but the cache prevents any magnification of the problem. I have used this approach in large networks, and it works well.

Jose · Oct 20, 2021

gpw928 said:
CARP is one good way of doing it , especially if you have full control of the network infrastructure required to configure it (you need a hub, or specially configured switching).

No special configuration needed. Both IP addresses have to be on the same VLAN if you have a fancy switch. CARP will work just fine with no config on cheap switches.

Are you thinking of the old arpbalance option? That's been gone from the Freebsd CARP implementation for a while. It's even gone from the Openbsd implementation too. They replaced it with the carpnodes thing that does require a hub or tricks at the switch:

carp(4) - OpenBSD manual pages

There's no active-active CARP story on Freebsd, sadly.

gpw928 · Oct 20, 2021

Jose said:
No special configuration needed.

Ok. It certainly used to need special setup on the network switches (or dig out a hub from somewhere) -- but it's been a while since I looked into it.

drhowarddrfine · Oct 20, 2021

sko said:
BOT:
newly created BE
a complex HA setup

Klaatu barada nikto

freebuser · Oct 20, 2021

sko said:
BOT:
When running everything in jails you can easily clone the jail, upgrade everything and then switch over to the new jail. For the host just run the newly created BE in a jail, upgrade, then reboot into the new BE.
This is only feasible for kernel/base system upgrades and major release upgrades - for simple package updates all of this is IMHO not needed as the services are simply restarted, so the "downtime" is usually less than one second...

I never thought of clone the jail. Thanks for the the insight.

I have started moving my services to jail recently with vnet jails. Still a bit more to go.

Never tried BE (stands for Boot Environment?).
Any good guides?

Klaatu barada nikto

Just noticed, thanks for pointing it out - me too not technically advanced

kjpetrie · Oct 20, 2021

I have a VM on my desktop which can run my websites, so I just synchronise that and point my router's port forwarding to the desktop, which forwards http, https, and smtp ports. The PC's internal firewall then sends those requests to the VM.

sko · Oct 21, 2021

kjpetrie said:
I have a VM on my desktop which can run my websites, so I just synchronise that and point my router's port forwarding to the desktop, which forwards http, https, and smtp ports. The PC's internal firewall then sends those requests to the VM.

If this is even remotely possible in your companies network you should just nuke it from orbit...

kjpetrie · Oct 21, 2021

Why? It allows me to keep my sites on line while I upgrade my server. My "company" is just me.