Resilience is a design choice, not a product
Networks fail in predictable ways — a cable is cut during construction, a switch power supply dies, a building loses mains power, an internet link drops. The difference between a network that shrugs these off and one that goes dark is not a single magic box; it is a series of deliberate design choices made at every layer. High availability is built up from the link, the device, the power and the edge, each contributing a backup for a different kind of failure.
This guide walks those layers from the cable up: how spanning tree and link aggregation protect paths, how dual cores and stacking protect against device failure, how power design rides through outages, and how gateway and WAN failover keep a site online. The right amount of each depends on what downtime costs you, a theme we return to throughout.
- Link — aggregation and redundant uplinks, so no single cable kills it
- Device — dual cores and stacking, so no single box kills it
- Power — UPS and dual supplies to ride through outages
- Edge — gateway and WAN failover to stay online
Link redundancy: never trust a single cable
The simplest failure is a single cable, and the simplest redundancy is a second path. Link aggregation (using LACP) bonds several physical links between two switches into one logical link. The payoff is double: more aggregate bandwidth, and resilience — if one cable in the bundle fails, traffic simply continues over the survivors with no outage and no manual intervention. For uplinks between an access switch and the core, an aggregated pair is a cheap, high-value safeguard.
Beyond bonding, you can run genuinely diverse uplinks — two links to two different upstream switches — so that even the failure of an entire upstream device leaves a path standing. The transceivers and fibre that carry these uplinks should themselves be planned with spares in mind, a point covered in our optics buyer’s guide.
Spanning tree: redundancy without loops
The moment you wire redundant links between switches, you risk a loop — and a Layer 2 loop is catastrophic, flooding the network with traffic until it collapses. Spanning Tree Protocol (STP, and its faster successors RSTP and MSTP) solves this by computing a loop-free path and holding the redundant links on standby. If the active path fails, STP brings a standby link up automatically, restoring connectivity within seconds.
Modern RSTP converges far faster than the original STP, and MSTP lets different VLANs use different paths for better load distribution. The key point is that spanning tree is what makes redundant Layer 2 wiring safe: it gives you backup paths without the loops that backup paths would otherwise create. Every managed switch in a resilient design should have it correctly configured.
Device redundancy: no single box
Links are only half the story; the devices themselves fail. At the core, the answer is two aggregation switches rather than one, with access switches uplinking to both. If one core fails, the other carries the load. Stacking complements this by letting several switches behave as one logical unit with a shared control plane, so a link or member can fail without taking down the stack.
This is why core and aggregation selection matters so much — these are the devices whose failure affects everyone, so they are where redundancy is most worth the spend. Immunity’s NetForce L3 range supports the dual-core and stacking patterns that underpin a resilient core, and our guide to choosing a switch covers sizing them.
Power: the failure everyone forgets
A perfectly redundant network still dies if its switches lose power. UPS protection on critical switches rides through short outages and gives generators time to start; dual power supplies in core devices mean a single PSU failure does not down the box. PoE-heavy access switches need power protection sized for their full load, including the access points and cameras they feed, or a power blip cascades into a wireless and surveillance outage.
Power redundancy is often the cheapest high-availability investment per hour of downtime avoided, yet it is the one most frequently overlooked. A resilient network design that ignores power is only half a design — the cable and device redundancy mean nothing the moment the rack goes dark.
Gateway and WAN failover
For most sites, the internet link is the lifeline, and a single WAN connection is a single point of failure. WAN failover at the gateway uses a second internet connection — ideally over a different medium, such as a cellular backup to a fibre primary — and switches to it automatically when the primary drops. Users may notice a brief blip; the site stays online.
The gateway itself should be resilient too, since it sits in the path of all internet and inter-site traffic. Immunity’s Gateway Controller handles WAN failover and edge security, so a dropped primary link or a single fault does not isolate the site. For multi-site organisations, this edge resilience is what keeps branches productive through local outages.
Matching redundancy to risk
Redundancy costs money, and not every site warrants the full treatment. The disciplined approach is to map redundancy to the cost of downtime. A hospital, airport or payment-processing site justifies dual cores, redundant uplinks, UPS power and WAN failover, because an outage there is dangerous or hugely expensive. A small back-office might be well served by a UPS and a spare switch on the shelf.
Work through each layer and ask “what does it cost us if this fails, and what does protecting it cost?” That calculation, not a blanket rule, should set how much resilience each site gets. Over-building everywhere wastes budget; under-building the critical sites invites disaster. The art is putting the spend where the risk actually is.
- Map redundancy to the cost of downtime at each site
- Critical sites — dual cores, redundant uplinks, UPS, WAN failover
- Small offices — often just a UPS and a spare switch on the shelf
- Over-building everywhere wastes budget; under-building the critical sites invites disaster
Testing: redundancy you never test isn’t real
The cruel truth of high availability is that untested redundancy often fails when called upon — a backup link that was never validated, a failover that was misconfigured, a UPS battery that died unnoticed. Resilience must be tested: pull a cable and confirm traffic re-routes, fail a power supply and confirm the box stays up, drop the primary WAN and confirm the backup takes over within the expected window.
Scheduled failover testing turns assumptions into evidence. It also surfaces the slow degradations — an ageing battery, a transceiver drifting toward threshold — before they coincide with a real failure. A network that has rehearsed its failures handles the real ones calmly.
Visibility ties it together
Redundancy and monitoring are partners. When a backup path activates or a power supply fails, the network is now running without its safety margin, and you need to know immediately so you can restore the redundancy before a second failure bites. Without visibility, a network can quietly burn through its redundancy and you only discover it when the last path fails.
A cloud control plane with AIOps watches every link, device, power supply and WAN connection across the fleet, alerting the moment a redundant element is consumed. Immunity’s Net Cloud turns a resilient design into a resilient operation — surfacing the silent failures that would otherwise erode your high availability unnoticed.
First-hop redundancy: a gateway that never disappears
Every device on a subnet points at a default gateway, and if that gateway vanishes, the whole subnet loses its route off-network even if every cable is intact. First-hop redundancy protocols solve this by letting two routing devices share a single virtual gateway address: if the active one fails, the standby takes over the address instantly, and end devices never notice. On a dual-core design, this is what makes the failover between cores seamless rather than something users have to wait out.
It is an easily forgotten layer — the links and devices are redundant, but the gateway address must be redundant too, or you have simply moved the single point of failure up a level. Pairing first-hop redundancy with your dual L3 cores closes that gap and is standard practice in any serious high-availability design.
Convergence time: how fast is fast enough
Redundancy is judged not only on whether it works but on how quickly. Convergence time is the gap between a failure and the backup taking over, and it ranges from sub-second for well-tuned link aggregation and first-hop redundancy to several seconds for spanning-tree recalculation. The right target depends on the application: a file copy shrugs off a two-second blip, but a voice call or a live clinical feed may not.
Designing for fast convergence means choosing the right protocols and tuning them — RSTP over legacy STP, aggregated links for instant failover, first-hop redundancy for the gateway. Match the convergence target to what your most sensitive application can tolerate, and validate it by actually inducing failures rather than trusting the datasheet.
Avoiding correlated failures
Redundancy only helps if the backup does not share the primary’s fate. Two uplinks in the same conduit are both cut by the same backhoe; two power supplies on the same circuit both die in the same outage; two cores in the same rack both drown in the same leak. Diversity — separate paths, separate circuits, separate locations — is what turns nominal redundancy into real resilience.
When you design redundancy, trace each pair back to its shared dependencies and break them where it matters. Diverse fibre routes, separate power feeds, and physically separated core switches cost a little more to plan but eliminate the correlated failures that otherwise defeat the whole exercise. Resilience on paper is not resilience until the backup is genuinely independent.
Documenting and operating a resilient network
A resilient network is also a well-documented one. When a failure occurs and the backup takes over, the team needs to know immediately what failed, what is now carrying the load, and that the network is running without its margin until repaired. Clear documentation of every redundant path and a monitoring platform that flags consumed redundancy turn a silent degradation into an actionable alert.
This is where design meets operations. Immunity’s Net Cloud continuously watches links, devices, power and WAN across the fleet and raises an alert the instant a backup is in use, so the safety margin is restored before a second failure can combine with the first. A resilient design without this visibility slowly and invisibly erodes; with it, resilience stays real over the network’s whole life.
Putting a resilient design together
A complete high-availability site looks like this: access switches with aggregated, diverse uplinks; a pair of L3 cores or a stack; spanning tree configured for fast, safe failover; UPS and dual power on critical devices; a gateway with WAN failover; and a monitoring platform that alerts the instant any redundancy is used. Each layer covers a different failure, and together they mean no single event takes the site down.
You do not have to build all of it everywhere — match it to risk, and grow it as the stakes rise. If you would like help designing resilience for your sites, our engineers will work from your availability targets and budget to place redundancy where it earns its keep, using the switching, routing and gateway hardware that supports it.
