Cyber Resilience Recovery Framework
What to recover — and in what order
Recovery Enablement
Precondition — validated during tests, not recovered during an incident
Targeted for testing
- Immutable backups and vaults
- Isolated recovery environment
(clean subscription / tenant / landing zone) - Recovery runbooks, credentials, and tooling access
- Break-glass accounts (offline validation)
Why first
If these are compromised or untested, nothing else matters. Many failed recoveries trace back to assuming recovery tooling was available.
You can access clean recovery tooling without touching production. Isolation is confirmed.
Identity & Trust Anchor
Re-establish who is allowed to do anything
Recover / validate
- Identity provider
- Entra ID / directory service integrity
- Privileged access
- Global Admins
- Emergency access accounts
- Authentication controls
- MFA
- Conditional Access (known-safe mode)
- Directory integrations
- AD sync / federation (only after validation)
Why first
Identity is the trust root. Restoring systems before identity risks re-infection or attacker persistence in the environment.
A small, verified recovery team can authenticate, elevate, and act — nobody else.
Control Plane & Security Baseline
Restore the rules of the environment
Recover / validate
- Access control
- RBAC roles and assignments
- Configuration governance
- Azure Policy
- Management groups / subscriptions
- Secrets & crypto
- Key Vault (keys, certs, secrets)
- Security tooling
- Defender / EDR onboarding
- SIEM workspace availability
Why second
This phase ensures anything you rebuild is governed, logged, and protected from the moment it is created.
You can prove that new resources are created securely and monitored.
Core Infrastructure & Connectivity
Enable systems to exist and communicate safely
Recover / validate
- Networking
- VNets, subnets, routing
- Firewalls, NSGs
- Connectivity
- VPN / ExpressRoute
- Private endpoints
- DNS
- Internal and private resolution
- Platform foundations
- Images, templates, IaC pipelines
Why third
Applications restored without networking or security controls fail silently or reconnect to unsafe dependencies.
During Phase 3, no recovered workload may establish external connectivity until explicitly approved. All traffic must route through validated firewalls and NSGs. Private endpoints must be verified before any data service is reachable. Any deviation requires documented exception with CISO sign-off.
Clean workloads can communicate only with approved paths.
Workloads & Platforms
Rebuild systems, not infections
Recover / rebuild
- Compute
- VMs (clean OS, restored data only)
- VM scale sets
- Platforms
- App Services
- AKS (control plane first, then nodes)
- Schedulers / automation
- Job services
- Batch or integration runtimes
Critical rule
Always rebuild the clean platform first, then restore data into it. Never restore data into an unvalidated environment. Any shortcut risks re-infection and invalidates the test.
Applications start, run, and authenticate without privileged exceptions.
Data & Business Services
Restore what the business actually cares about
Recover / validate
- Tier 0 / Tier 1 data
- Databases
- Transaction systems
- Storage
- File shares
- Object storage
- SaaS data
- Microsoft 365 (Exchange, SharePoint, OneDrive, Teams)
- Application dependencies
- Queues
- Caches
- External APIs
Why last
Data is useless if the platform, security, or identity layers are not trustworthy. This phase is only reached once all prior phases are validated.
Business services are usable, validated, and monitored — not just restored.
Final Validation — Business & Governance
Cyber resilience ≠ systems running. Final validation confirms the environment is trustworthy, monitored, and governance-compliant before transitioning out of recovery mode.
- Users can perform critical transactions
- Monitoring and alerts fire correctly
- Logs are retained and available for forensics
- Access is reduced from recovery mode to steady-state permissions
- Evidence is captured for audit and regulatory review
Summary: Phase Order, RTO Targets & Owners
| Phase | Name | Primary goal | RTO target | Owner |
|---|---|---|---|---|
| 0 | Recovery Enablement | Ensure recovery is possible | Always ready | CISO / Cloud Ops |
| 1 | Identity & Trust | Control who can act | < 2 hours | IAM Lead |
| 2 | Control Plane & Security | Enforce safe rules | < 4 hours | Cloud Ops / Security |
| 3 | Infrastructure & Network | Enable safe communication | < 6 hours | Network / Platform Eng. |
| 4 | Workloads & Platforms | Rebuild clean systems | < 12 hours | Application / Platform Lead |
| 5 | Data & Business Services | Restore business value | < 24–48 hours | Data Lead / Business Owner |
Common Test Failure Modes
Most cyber resilience test failures trace to one of the following root causes. These should be explicitly tested against during each exercise.