[OPNsense Series Part 6] Backups, Updates, Recovery, and When to Move to Dedicated Hardware

VLANs are in place, but reliability still depends on how you back up, patch, and recover the firewall that protects everything else. Virtual firewalls rely on the hypervisor, storage, and power stack simultaneously, so operations matter as much as design. This final part in the series walks through backup tiers, safe update sequencing, recovery playbooks, and the moment you should extract OPNsense from Proxmox onto dedicated hardware.

How this post flows

Signals that tell you to revisit operations
How to design the 3-2-1 backup tiers, storage locations, and realistic RTO/RPO
How to stage backups, updates, and rollbacks in a predictable order
How to turn failure scenarios into concrete recovery playbooks
How to know when virtualization is no longer enough and dedicated hardware is warranted

Terms used

Configuration backup: the encrypted .bak or XML export captured via System > Configuration > Backups. It includes certificates and VPN keys, so store the decryption key separately.
Snapshot: a Proxmox point-in-time disk capture stored alongside the VM. Snapshots are incremental and disappear when the storage pool fails; they need an off-site backup companion.
Cold standby: spare hardware kept powered off until a failure occurs. Add the standby boot time (5–10 minutes) to your RTO.
RTO/RPO: Recovery Time Objective and Recovery Point Objective. Example: RTO 15 minutes / RPO 4 hours means “restore service in 15 minutes, tolerate up to 4 hours of data loss.”
HA pair: two or more firewalls clustered with CARP/VRRP for high availability. Requires layer-2 adjacency and a dedicated sync link.

Reading card

Estimated time: 17 minutes

Prereqs: familiarity with OPNsense backup menu, Proxmox snapshots, and having a Git/cloud target for storing files

Outcome: you can define backup tiers, update order, recovery workflows, and a hardware migration plan.

Why revisit operations after VLAN work

The network layout may be segmented, but operations still hinge on one hypervisor. You need a plan if any of these are true:

A single Proxmox host runs both application workloads and the firewall.
Internet access is critical for remote work or exposed services.
Firewall rules are complex enough that one mistake can block the entire network.

Define realistic RTO/RPO targets per failure scenario and build your playbooks before the outage happens.

Backup strategy: tiers and cadence

Follow the 3-2-1 rule (three copies, two media types, one off-site) by splitting your backups into three layers.

Tier	Storage target	Cadence	Target RTO/RPO	Watch-outs
OPNsense configuration backup	Git/private cloud, encrypted `.bak`	Every firewall change + weekly	RTO 10 min / RPO 1 day	Store decryption keys separately; rehearse restores in a test VM
Proxmox VM backup (PBS/NAS)	Proxmox Backup Server, ZFS snapshots, external NAS	Daily (or every 4 hrs if change rate is high)	RTO 30 min / RPO 4–24 hrs	Snapshots alone don’t survive host/storage failure → off-site copy required
Runbooks & scripts	Git, wiki, private docs repository	Commit immediately after changes	Not applicable	Keep VLAN IDs, switch maps, and recovery commands in version control with access controls

Backup tiers form a pipeline as shown below.

Tip: XML backups alone don’t shrink RTO. Run a full restore rehearsal (PBS or NAS) at least monthly so you know exactly which prompts, keys, and passwords are needed during a crisis.

Update sequencing and rollback points

Always follow “backup → test → deploy.” Repeat the exact sequence every time so rollback muscle memory kicks in when something breaks.

Capture snapshots/backups: create qm snapshot <VMID> pre-opnsense-update and confirm the PBS/NAS job succeeded within the last 24 hours.
Patch the Proxmox host: run apt update && apt full-upgrade, reboot, and ensure console/IPMI access exists while the VM is down.
Patch OPNsense: visit System > Firmware > Updates, apply minor releases before majors, and sync IDS/IPS (Suricata) rule sets separately.
Verify: rerun the Part 5 validation (service/management/guest flows, VLAN isolation, VPN ingress). Enable rule logging so Firewall > Live View shows the results.

Roll back in two layers:

Hypervisor level: keep the latest snapshot ready for qm rollback. Delete stale snapshots afterwards so performance doesn’t degrade.
Firewall level: download the newest .bak, and keep SSH/console access handy to run opnsense-backup restore if the web UI dies.

Recovery flows by failure type

Break incidents into three buckets and assign owners, tools, and success criteria to each.

OPNsense-only failure: misconfiguration or failed update while the hypervisor and switch remain healthy.
Proxmox host failure: hardware issue or kernel panic takes down the entire hypervisor (and therefore the firewall VM).
Facility-wide or storage failure: PBS, NAS, or the power/UPS layer fails alongside the hypervisor.

Recovery checklist

Failure type	Step 1	Step 2	Step 3
OPNsense-only	Use Proxmox console → `1) Restore a configuration`	Select the newest `.bak`/XML and restore	Reboot and confirm VLAN interfaces under `Interfaces > Overview`
Proxmox host down	Boot a cold-standby mini PC with OPNsense ISO + USB NICs to maintain basic connectivity	Restore the latest PBS/NAS backup onto repaired hardware or the standby	Reassign switch trunk ports to the new NICs and retest policies
Facility-wide/storage failure	Stabilize power/UPS, fetch backups from off-site storage	Rebuild the minimum viable firewall first, then restore remaining services	Document actual RTO/RPO and gaps for the postmortem

After each step, tick the corresponding item in your Runbook and log the actual time taken so you can adjust RTO/RPO assumptions later.

When dedicated hardware is the better answer

Consider moving OPNsense off Proxmox when any of the following apply. Virtual firewalls turn the hypervisor into a single point of failure, so add these checks to your quarterly reviews.

RTO under five minutes: the firewall must stay up while Proxmox reboots, so eliminating VM boot time becomes mandatory.
Throughput exceeds 1 Gbps: VirtIO/VMXNET3 plus IDS/IPS and TLS inspection hit CPU bottlenecks quickly.
You need HA: CARP/VRRP demands two physical appliances; running both firewalls on the same hypervisor defeats the purpose.
Security policy forbids shared hypervisors: auditors may require dedicated hardware per security zone.
Boot-order dependencies exist: if Proxmox must boot before OPNsense and OPNsense must be up for Proxmox to be reachable, break the cycle with separate hardware.

Condition	Stay virtualized	Move to dedicated hardware
Acceptable downtime ≥ 30 min	✅
Acceptable downtime ≤ 5 min		✅
Aggregate traffic ≤ 1 Gbps	✅
IDS/IPS + TLS inspection > 2 Gbps		✅
CAPEX/space limited	✅
Separate audit/compliance requirements		✅

Common mistakes

Backup ≠ restore test: exporting XML without ever restoring it leaves you blind during outages.
Updating without snapshots: if the OPNsense patch fails, you’re stuck booting from ISO unless a snapshot exists.
Pointing guest VLAN DNS to internal resolvers: guests can still enumerate internal hosts; use public DNS or split-horizon filtering.
Same UPS for hypervisor and firewall: when that UPS trips, both layers go down. Give the firewall a separate UPS or cold standby.
No plan for hardware migration: by the time you need dedicated gear, lead times delay the cutover.

Wrap-up

Production-ready network segmentation needs an operational backbone. Keep these guardrails close to your daily/weekly checklist.

Separate backup tiers: keep OPNsense configs, Proxmox VM images, and Runbooks in different locations, and store encryption keys separately.
Serial update workflow: obey the backup → Proxmox → OPNsense → test order every time, and prune old snapshots afterwards.
Exercise the recovery playbooks: boot the cold standby, rehearse PBS restores, and run opnsense-backup restore on a test VM every quarter.
Plan for dedicated hardware: track traffic, RTO/RPO, and compliance requirements so you know when virtualization stops meeting the bar.

Practice the Runbooks quarterly so you can recover under pressure instead of relearning how Proxmox and OPNsense interact when everything is already on fire.