VLANs are in place, but reliability still depends on how you back up, patch, and recover the firewall that protects everything else. Virtual firewalls rely on the hypervisor, storage, and power stack simultaneously, so operations matter as much as design. This final part in the series walks through backup tiers, safe update sequencing, recovery playbooks, and the moment you should extract OPNsense from Proxmox onto dedicated hardware.
How this post flows
- Signals that tell you to revisit operations
- How to design the 3-2-1 backup tiers, storage locations, and realistic RTO/RPO
- How to stage backups, updates, and rollbacks in a predictable order
- How to turn failure scenarios into concrete recovery playbooks
- How to know when virtualization is no longer enough and dedicated hardware is warranted
Terms used
- Configuration backup: the encrypted
.bakor XML export captured viaSystem > Configuration > Backups. It includes certificates and VPN keys, so store the decryption key separately. - Snapshot: a Proxmox point-in-time disk capture stored alongside the VM. Snapshots are incremental and disappear when the storage pool fails; they need an off-site backup companion.
- Cold standby: spare hardware kept powered off until a failure occurs. Add the standby boot time (5–10 minutes) to your RTO.
- RTO/RPO: Recovery Time Objective and Recovery Point Objective. Example: RTO 15 minutes / RPO 4 hours means “restore service in 15 minutes, tolerate up to 4 hours of data loss.”
- HA pair: two or more firewalls clustered with CARP/VRRP for high availability. Requires layer-2 adjacency and a dedicated sync link.
Reading card
- Estimated time: 17 minutes
- Prereqs: familiarity with OPNsense backup menu, Proxmox snapshots, and having a Git/cloud target for storing files
- Outcome: you can define backup tiers, update order, recovery workflows, and a hardware migration plan.
Why revisit operations after VLAN work
The network layout may be segmented, but operations still hinge on one hypervisor. You need a plan if any of these are true:
- A single Proxmox host runs both application workloads and the firewall.
- Internet access is critical for remote work or exposed services.
- Firewall rules are complex enough that one mistake can block the entire network.
Define realistic RTO/RPO targets per failure scenario and build your playbooks before the outage happens.
Backup strategy: tiers and cadence
Follow the 3-2-1 rule (three copies, two media types, one off-site) by splitting your backups into three layers.
| Tier | Storage target | Cadence | Target RTO/RPO | Watch-outs |
|---|---|---|---|---|
| OPNsense configuration backup | Git/private cloud, encrypted .bak |
Every firewall change + weekly | RTO 10 min / RPO 1 day | Store decryption keys separately; rehearse restores in a test VM |
| Proxmox VM backup (PBS/NAS) | Proxmox Backup Server, ZFS snapshots, external NAS | Daily (or every 4 hrs if change rate is high) | RTO 30 min / RPO 4–24 hrs | Snapshots alone don’t survive host/storage failure → off-site copy required |
| Runbooks & scripts | Git, wiki, private docs repository | Commit immediately after changes | Not applicable | Keep VLAN IDs, switch maps, and recovery commands in version control with access controls |
Backup tiers form a pipeline as shown below.
Tip: XML backups alone don’t shrink RTO. Run a full restore rehearsal (PBS or NAS) at least monthly so you know exactly which prompts, keys, and passwords are needed during a crisis.
Update sequencing and rollback points
Always follow “backup → test → deploy.” Repeat the exact sequence every time so rollback muscle memory kicks in when something breaks.
- Capture snapshots/backups: create
qm snapshot <VMID> pre-opnsense-updateand confirm the PBS/NAS job succeeded within the last 24 hours. - Patch the Proxmox host: run
apt update && apt full-upgrade, reboot, and ensure console/IPMI access exists while the VM is down. - Patch OPNsense: visit
System > Firmware > Updates, apply minor releases before majors, and sync IDS/IPS (Suricata) rule sets separately. - Verify: rerun the Part 5 validation (service/management/guest flows, VLAN isolation, VPN ingress). Enable rule logging so
Firewall > Live Viewshows the results.
Roll back in two layers:
- Hypervisor level: keep the latest snapshot ready for
qm rollback. Delete stale snapshots afterwards so performance doesn’t degrade. - Firewall level: download the newest
.bak, and keep SSH/console access handy to runopnsense-backup restoreif the web UI dies.
Recovery flows by failure type
Break incidents into three buckets and assign owners, tools, and success criteria to each.
- OPNsense-only failure: misconfiguration or failed update while the hypervisor and switch remain healthy.
- Proxmox host failure: hardware issue or kernel panic takes down the entire hypervisor (and therefore the firewall VM).
- Facility-wide or storage failure: PBS, NAS, or the power/UPS layer fails alongside the hypervisor.
Recovery checklist
| Failure type | Step 1 | Step 2 | Step 3 |
|---|---|---|---|
| OPNsense-only | Use Proxmox console → 1) Restore a configuration |
Select the newest .bak/XML and restore |
Reboot and confirm VLAN interfaces under Interfaces > Overview |
| Proxmox host down | Boot a cold-standby mini PC with OPNsense ISO + USB NICs to maintain basic connectivity | Restore the latest PBS/NAS backup onto repaired hardware or the standby | Reassign switch trunk ports to the new NICs and retest policies |
| Facility-wide/storage failure | Stabilize power/UPS, fetch backups from off-site storage | Rebuild the minimum viable firewall first, then restore remaining services | Document actual RTO/RPO and gaps for the postmortem |
After each step, tick the corresponding item in your Runbook and log the actual time taken so you can adjust RTO/RPO assumptions later.
When dedicated hardware is the better answer
Consider moving OPNsense off Proxmox when any of the following apply. Virtual firewalls turn the hypervisor into a single point of failure, so add these checks to your quarterly reviews.
- RTO under five minutes: the firewall must stay up while Proxmox reboots, so eliminating VM boot time becomes mandatory.
- Throughput exceeds 1 Gbps: VirtIO/VMXNET3 plus IDS/IPS and TLS inspection hit CPU bottlenecks quickly.
- You need HA: CARP/VRRP demands two physical appliances; running both firewalls on the same hypervisor defeats the purpose.
- Security policy forbids shared hypervisors: auditors may require dedicated hardware per security zone.
- Boot-order dependencies exist: if Proxmox must boot before OPNsense and OPNsense must be up for Proxmox to be reachable, break the cycle with separate hardware.
| Condition | Stay virtualized | Move to dedicated hardware |
|---|---|---|
| Acceptable downtime ≥ 30 min | ✅ | |
| Acceptable downtime ≤ 5 min | ✅ | |
| Aggregate traffic ≤ 1 Gbps | ✅ | |
| IDS/IPS + TLS inspection > 2 Gbps | ✅ | |
| CAPEX/space limited | ✅ | |
| Separate audit/compliance requirements | ✅ |
Common mistakes
- Backup ≠ restore test: exporting XML without ever restoring it leaves you blind during outages.
- Updating without snapshots: if the OPNsense patch fails, you’re stuck booting from ISO unless a snapshot exists.
- Pointing guest VLAN DNS to internal resolvers: guests can still enumerate internal hosts; use public DNS or split-horizon filtering.
- Same UPS for hypervisor and firewall: when that UPS trips, both layers go down. Give the firewall a separate UPS or cold standby.
- No plan for hardware migration: by the time you need dedicated gear, lead times delay the cutover.
Wrap-up
Production-ready network segmentation needs an operational backbone. Keep these guardrails close to your daily/weekly checklist.
- Separate backup tiers: keep OPNsense configs, Proxmox VM images, and Runbooks in different locations, and store encryption keys separately.
- Serial update workflow: obey the backup → Proxmox → OPNsense → test order every time, and prune old snapshots afterwards.
- Exercise the recovery playbooks: boot the cold standby, rehearse PBS restores, and run
opnsense-backup restoreon a test VM every quarter. - Plan for dedicated hardware: track traffic, RTO/RPO, and compliance requirements so you know when virtualization stops meeting the bar.
Practice the Runbooks quarterly so you can recover under pressure instead of relearning how Proxmox and OPNsense interact when everything is already on fire.
💬 댓글
이 글에 대한 의견을 남겨주세요