A couple of suggestions based on what I have read of your predicament so far:
You're already familiar with jails, so keep the base OS as minimal as possible and do everything in jails. Doing this also makes it easier to test out new configurations (just snapshot the filesystems for a jail and create a new one) as well as confining any mess to a smaller environment. Generally I run one major service per jail (mx, www, db, etc) with a local virtual network that's unrouted and filtered via pf to provide isolation and allow services that need to to interconnect (e.g. services that need to be able to get to the DB can, but those that don't can't and it's absolutely unreachable from outside the network).
As others have said, use pkg except where you need something custom. It's faster and less error prone.
Build custom (or getting to latest faster, such as with security vulnerabilities) packages using poudriere rather than portmaster. It'll build the port in a separate jail (and thus clean environment) which helps for inter-package dependencies as well as just general sanity (your non-build environments aren't cluttered with build dependencies, for instance). It also outputs packages and you can roll back to an earlier version by simply doing a pkg install of the older version. The following script will work (or you can use a build environment such as Jenkins or GoCD).
Code:
#!/bin/sh
if [ $# -ne 1 ]; then
echo "USAGE: $0 port_name" >&2
exit 1
fi
# Update the ports so we have the latest version before each build
sudo poudriere ports -uv
# Build the port ($1 is the port)
# Replace $poudriere_jail_name with the name of your Poudriere jail (or assign the variable further up)
sudo poudriere bulk -j $poudriere_jail_name $1
Force cores to be dropped to a known location. This way they won't clutter the disk (by dropping in whatever the current directory is of the process) and won't fill your disk (if you put a quota on, which I recommend).
Code:
zfs create zroot/var/coredumps
# Change 2G to whatever size is appropriate to your setup
zfs set quota=2G zroot/var/coredumps
# Make root the only one to access core dumps
chmod 0700 /var/coredumps
cat >> /etc/sysctl.conf <<EOF
# Store all cores in /var/coredumps/
# See core(5) for details of variables
kern.corefile=/var/coredumps/%H_%N.%P.%U.%I
# Compress cores
kern.compress_user_cores=1
EOF
# Reload /etc/sysctl.conf
sysctl -f /etc/sysctl.conf
PHP is a security nightmare. Make sure you are on the security mailing lists for PHP itself as well as any and all software that uses it. I recommend removing PHP from your environment if practical, especially for security conscious services such as medical records. Each instance of PHP software should go in its own jail.
As VladiBG said, you can get to root via "su". You can also do so with "sudo -s", which has you typing your password instead of roots'. This is generally considered preferable. It is also generally a good practice to do all root operations via sudo so that you don't accidentally do something in a root shell (lots of mistakes leading to a reinstall happen this way).
This is not practical or financially viable. You need to work out how much you're willing to be down and then design around that requirement. The closer you get to 100% the more money you'll through at it (going from 99.9% to 99.99% is often an order of magnitude difference in cost). The SLA, service level agreement, you provide to customers will involve financial penalties for too much unscheduled downtime, so you really need to be aware of what is practical for your configuration. You'll also need monitoring to tell you when things are down and when you're getting close to your SLAs (you should have SLOs, service level objectives, which are tighter than your SLA so that you fix things before lawyers for your customers start scrutinizing your SLAs).
A reasonable production SLA is 99.9% uptime. This is approximately 43 minutes of downtime per month. Depending on what your product is and what your customers hours are, you may be able to build in service windows or make it business hours. This will give you more time for maintenance. A good architecture will also help with doing maintenance window free changes.
Remember that the availability is not just the machines, it's everything between you and the customer. You'll need high availability local networking, redundant ISP connections etc.
Your ZFS architecture depends on what your reliability and throughput/latency requirements are. I generally recommend putting the OS on a separate pair of mirrored disks and then everything else on the remaining disks. Unfortunately, this only leaves you with 4 disks, so your choices are:
- Mirrored stripes (RAID 1+0). This means that you can safely lose 2 disks, one from each stripe, without losing data. If you lose two disks from the same stripe you will lose data. The storage you'll be left with is 50% of your disk size. Throughput is 4x the individual disk throughput for reads and 2x for writes (throughput numbers, as with those below, really depend on your workload so they are approximations and could be off by quite a bit).
- RAIDZ (3 data + 1 parity). This means you can safely lose any 1 disk without losing data. The storage space you'll have is 75% of your disk size. Throughput is approximately 3x your individual disk throughput for reads and writes. Due to the size of modern disks and chance of losing a second disk during a rebuild, it is not recommended for non-HA configurations. Reads and writes are slowed down by having to compute parity.
- RAIDZ2 (2 data + 2 parity). You can safely lose any 2 disks without losing data. The storage you'll be left with is 50% of your disk size. Throughput is approximately 2x the individual disk throughput for reads and writes. Reads and writes are slowed down by having to compute parity.
Personally, for extremely sensitive data, I go with RAID 1+0 using three disk mirrors. I have had double failures on a single mirror before where the three disk replication saved me. These are typically larger installations using tens or hundreds of disks though and the hardware you have won't work for this.
The reasoning for putting the OS on separate disks is because the workloads and requirements are different. For example, you might need to encrypt your data disks (for medical data, you definitely do), but you can't boot off encrypted ZFS or GEOM. ZFS encryption is only available in HEAD (FreeBSD 13).
New hotness, all the cool kids are doing it. Docker is like a less secure version of jails, but it's great for getting up demo environments really fast. Production is a whole different ball game with Docker containers being the script kiddies wet dream.
Other things which have not yet come up which you definitely want to think about (I'm guessing you might not have considered some or all of these based on your assumption of 100% uptime):
- Monitoring. Everything needs to be monitored both internally and externally and someone needs to be oncall (i.e. be pageable) all the time. Operationally, you need two people minimum (and more if possible) as 24/7 oncall leads to burn out and people need to be able to take holidays. A reasonable number of people for proper production oncall is 6 people in two timezones (so 12 total, preferably in two different economic zones such as USA and EU), although most places get by with fewer staff and take a hit on project work and burnout rate. Oncall staff will also need to be paid an oncall bonus because they're effectively working 24 hours a day for their oncall period. You'll skip a lot of this in the early days as you just struggle to get your operation going, but keep it in mind as something that needs fixing sooner rather than later. Prometheus with Grafana makes a good local monitoring starting point. Pingdom is a good starting point for remote monitoring. Remember that monitoring too much is as bad as monitoring too little. It's easy to get swamped by things which are not impacting your customers and so not actually relevant.
- Configuration management. Is every change checked into a source code repository and then automatically deployed? How are changes deployed to production so that all of your machines providing a given sub-service all look identical? Which configuration management tool will you use: Ansible, SaltStack, Puppet or Chef? How do you upgrade your production software? Can you do it live or do you need to do it during a maintenance window? If you're aiming for live, are you using blue/green or configuration flags or some other process? How do you name your machines/jails? Are those names visible in DNS and how do they get there? Do you need service discovery for your components?
- Local scripts and software: /opt/? /usr/local/? Packaged or deployed via configuration management copying? I recommend running your own package repo and packaging all local software/scripts so they deploy the same way as everything else.
- Backups. All the data you store needs to be backed up securely offsite. As it's healthcare data, it needs to be backed up in a HIPPA compliant way. I recommend restic to your favorite cloud provider as it encrypts and de-duplicates. I use Backblaze B2, but they aren't HIPPA compliant so you'll likely want SpiderOak, Carbonite or someone else willing to sign a BAA.
- High availability. What can fail without causing an outage? How quickly can you fail over if a critical component goes down (e.g. motherboard failure or a datacenter power outage)? Is that automated or not? How is your customer data (most likely, your database) replicated to its partners?
- Performance and scalability. How many customers of what size can your hardware support? How do you tell when you're hitting a limit (e.g. if your average response time slows down or disks are filling up)? How do you expand that? It sounds like you're starting with everything on a single tier. How are you going to split things up to be multiple tiers when the applications no longer fit on a single machine? How do you scale each tier?
- SLAs, SLOs and SLIs. As mentioned above, you'll need to settle on a service level that's appropriate to your customers. You'll want internal objectives which are tighter than the SLA and SLI (service level indicators) which tell you about service critical path failures. If you have to go to backups to recover data (e.g. during a data corruption event), how long does that take to restore and get your system running again (this is the mean time to recover or MTTR)?
- Security and privacy. How are you going to ensure the security of your service? Are you HIPPA compliant? How about COPA, GDPR and COPPA? Are you part of the financial chain and also need to get SOC 2 or SSAE 18?