Cyber Resilience Recovery Framework

What to recover — and in what order

Version 1.0  |  Confidential

Core principle: Cyber resilience is not the same as systems running. A recovery test is only successful when identity is trusted, controls are enforced, systems are rebuilt clean, and business services are validated — in that order. Skipping or reordering phases is the primary cause of test failure.
0

Recovery Enablement

Precondition — validated during tests, not recovered during an incident

  • Immutable backups and vaults
  • Isolated recovery environment
    (clean subscription / tenant / landing zone)
  • Recovery runbooks, credentials, and tooling access
  • Break-glass accounts (offline validation)

If these are compromised or untested, nothing else matters. Many failed recoveries trace back to assuming recovery tooling was available.

RTO target
Always ready (pre-incident)
Phase owner
CISO / Cloud Operations Lead
Validated by
Quarterly tabletop exercise
Go / No-Go gate → Phase 1
Recovery team can access clean tooling, credentials, and runbooks without touching production systems.
✓ Pass
✗ Stop
Test outcome

You can access clean recovery tooling without touching production. Isolation is confirmed.

1

Identity & Trust Anchor

Re-establish who is allowed to do anything

  • Identity provider
    • Entra ID / directory service integrity
  • Privileged access
    • Global Admins
    • Emergency access accounts
  • Authentication controls
    • MFA
    • Conditional Access (known-safe mode)
  • Directory integrations
    • AD sync / federation (only after validation)

Identity is the trust root. Restoring systems before identity risks re-infection or attacker persistence in the environment.

RTO target
< 2 hours
Phase owner
Identity / IAM Lead
Validated by
Security Architecture
Go / No-Go gate → Phase 2
A small, verified recovery team can authenticate, elevate, and act — and only that team. No uncontrolled access paths remain open.
✓ Pass
✗ Stop
Test outcome

A small, verified recovery team can authenticate, elevate, and act — nobody else.

2

Control Plane & Security Baseline

Restore the rules of the environment

  • Access control
    • RBAC roles and assignments
  • Configuration governance
    • Azure Policy
    • Management groups / subscriptions
  • Secrets & crypto
    • Key Vault (keys, certs, secrets)
  • Security tooling
    • Defender / EDR onboarding
    • SIEM workspace availability

This phase ensures anything you rebuild is governed, logged, and protected from the moment it is created.

RTO target
< 4 hours
Phase owner
Cloud Operations / Security Eng.
Validated by
Compliance / Audit
Go / No-Go gate → Phase 3
New resources created during recovery are confirmed secure, governed by policy, and visible in the SIEM. No ungoverned resources permitted.
✓ Pass
✗ Stop
Test outcome

You can prove that new resources are created securely and monitored.

3

Core Infrastructure & Connectivity

Enable systems to exist and communicate safely

  • Networking
    • VNets, subnets, routing
    • Firewalls, NSGs
  • Connectivity
    • VPN / ExpressRoute
    • Private endpoints
  • DNS
    • Internal and private resolution
  • Platform foundations
    • Images, templates, IaC pipelines

Applications restored without networking or security controls fail silently or reconnect to unsafe dependencies.

Isolation enforcement

During Phase 3, no recovered workload may establish external connectivity until explicitly approved. All traffic must route through validated firewalls and NSGs. Private endpoints must be verified before any data service is reachable. Any deviation requires documented exception with CISO sign-off.

RTO target
< 6 hours
Phase owner
Network / Platform Engineering
Validated by
Security Engineering
Go / No-Go gate → Phase 4
Clean workloads can communicate only via approved paths. All firewall rules validated. No unauthorized external routes exist.
✓ Pass
✗ Stop
Test outcome

Clean workloads can communicate only with approved paths.

4

Workloads & Platforms

Rebuild systems, not infections

  • Compute
    • VMs (clean OS, restored data only)
    • VM scale sets
  • Platforms
    • App Services
    • AKS (control plane first, then nodes)
  • Schedulers / automation
    • Job services
    • Batch or integration runtimes
Rebuild before restore

Always rebuild the clean platform first, then restore data into it. Never restore data into an unvalidated environment. Any shortcut risks re-infection and invalidates the test.

RTO target
< 12 hours
Phase owner
Application / Platform Lead
Validated by
DevOps / Architecture
Go / No-Go gate → Phase 5
Applications start, run, and authenticate without privileged exceptions. Workloads confirmed rebuilt from clean source — no image reuse from potentially compromised state.
✓ Pass
✗ Stop
Test outcome

Applications start, run, and authenticate without privileged exceptions.

5

Data & Business Services

Restore what the business actually cares about

  • Tier 0 / Tier 1 data
    • Databases
    • Transaction systems
  • Storage
    • File shares
    • Object storage
  • SaaS data
    • Microsoft 365 (Exchange, SharePoint, OneDrive, Teams)
  • Application dependencies
    • Queues
    • Caches
    • External APIs

Data is useless if the platform, security, or identity layers are not trustworthy. This phase is only reached once all prior phases are validated.

RTO target
< 24 hrs (Tier 0)  /  < 48 hrs (Tier 1)
Phase owner
Data / Database Lead
Validated by
Business Owner / Compliance
Go / No-Go gate → Final validation
Business services are usable and validated by business owners — not just technically restored. Data integrity confirmed against known-good checksums.
✓ Pass
✗ Stop
Test outcome

Business services are usable, validated, and monitored — not just restored.

Final Validation — Business & Governance

Cyber resilience ≠ systems running. Final validation confirms the environment is trustworthy, monitored, and governance-compliant before transitioning out of recovery mode.

  • Users can perform critical transactions
  • Monitoring and alerts fire correctly
  • Logs are retained and available for forensics
  • Access is reduced from recovery mode to steady-state permissions
  • Evidence is captured for audit and regulatory review

Summary: Phase Order, RTO Targets & Owners

Phase Name Primary goal RTO target Owner
0Recovery EnablementEnsure recovery is possibleAlways readyCISO / Cloud Ops
1Identity & TrustControl who can act< 2 hoursIAM Lead
2Control Plane & SecurityEnforce safe rules< 4 hoursCloud Ops / Security
3Infrastructure & NetworkEnable safe communication< 6 hoursNetwork / Platform Eng.
4Workloads & PlatformsRebuild clean systems< 12 hoursApplication / Platform Lead
5Data & Business ServicesRestore business value< 24–48 hoursData Lead / Business Owner

Common Test Failure Modes

Most cyber resilience test failures trace to one of the following root causes. These should be explicitly tested against during each exercise.

Failure mode
Why it matters
Starting with applications or data
Phases 4–5 depend on Phases 0–3. Skipping earlier phases produces an untrustworthy environment even if services appear to run.
Assuming identity or security will be there
Unvalidated identity is the most common attacker persistence vector. It must be explicitly proven, not assumed.
Testing restores instead of rebuild + restore
A restore test validates backup integrity only. A resilience test must validate the full sequence: clean rebuild, then restore.
No isolation enforcement during recovery
Without isolation, recovered systems may reconnect to compromised dependencies, re-establishing the attack path.
No named phase owners
Absence of ownership means no single point of accountability at each gate. Decisions slow or fail silently.
No RTO targets per phase
Without phase-level RTOs, teams cannot detect that they are already outside recovery tolerances during the test.

Leave a Reply

Your email address will not be published. Required fields are marked *