TL;DR
- N+1 means N units are required to carry the load and one additional unit is provided as spare. If any single unit fails, the spare picks up the load with no service impact.
- The dominant redundancy model for UPS, chillers, pumps, CRAH units, generators, and PDUs in Tier II and Tier III data centres.
- Concurrent maintainability (Tier III) is achievable with N+1 across every critical system, provided distribution paths allow the spare to be activated without disrupting service.
- Cheaper than 2N (one extra unit vs duplication of the whole system) but with a smaller failure envelope — only one fault at a time is tolerated.
Definition#
N+1 redundancy means that if the load requires N units of a given resource (UPS modules, chillers, generators, pumps, fans), the system has N+1 units installed. Any single unit can fail or be taken out of service for maintenance, and the remaining N units carry the full load.
It is the most widely deployed redundancy model in modern data centres because it tolerates the single-component failures and planned maintenance events that account for the overwhelming majority of operational disruptions, at a fraction of the cost of duplicate systems.
Typical Applications#
| System | Common N+1 implementation |
|---|---|
| UPS | 4× 500 kW modules in a 1.5 MW load: 3 required, 1 spare |
| Generators | 3× 2 MW gensets for a 4 MW critical load |
| Chillers | 4× 800 kW chillers for a 2.4 MW thermal load |
| Pumps | Triplex (2 duty + 1 standby) on chilled water |
| CRAH units | 10 units in a row sized so 9 can carry the heat |
| PDU feeders | 2 PDUs per rack with combined capacity > rack load |
| Fan walls (modular) | EC fan arrays where any single fan can fail without impact |
N+1 vs 2N — When Each Is Right#
- N+1 covers 95-99 % of real-world failures: single-component failure or planned maintenance. The remaining percentage covers concurrent failures and is what 2N exists to handle.
- Capex difference: 2N is typically 1.6-1.9× N+1. The marginal cost buys protection against concurrent failures.
- Operating model: N+1 sites schedule maintenance carefully (you have no spare during the maintenance window). 2N sites perform maintenance at any time.
- Tier III requires N+1 plus concurrent maintainability (multiple distribution paths). Tier IV requires 2N plus fault tolerance.
- Hyperscale model: many hyperscalers run N+1 at the site level and achieve fault tolerance via inter-site replication. The customer experience is fault-tolerant even though no single site is.
Distribution and the 'Concurrently Maintainable' Trap#
N+1 redundancy at the unit level does not guarantee concurrent maintainability of the system as a whole. If all four UPS modules feed a single distribution bus, then maintenance on the bus takes the whole load down even though the units are N+1.
Tier III certification therefore requires not only N+1 on the units, but multiple distribution paths so that any one path can be taken out of service while load continues to be carried via the others. This is the engineering subtlety that distinguishes a 'redundant' build from a 'concurrently maintainable' one.
Operational Pitfalls#
- Hidden single points of failure: a single fuel manifold serving N+1 generators is a single point of failure. Trace the full topology before claiming N+1.
- Load growth: 'N+1' calculated for a half-loaded site can become 'N' or worse as load grows. Reassess as the site fills.
- Maintenance during peak: doing UPS maintenance during a hot afternoon means losing thermal headroom at the same time as electrical headroom. Schedule with care.
- Documented vs actual: paper redundancy and real redundancy diverge over the years. Audit annually.
- Spare-mode tests: the spare unit must be tested regularly under load. Untested spares fail in production at the worst possible moment.