IAM Disaster Recovery: Building Resilient Identity Infrastructure
Design and implement disaster recovery for identity services, covering IdP failover, directory replication, backup strategies, and RTO/RPO planning for IAM systems.
IAM Disaster Recovery: Building Resilient Identity Infrastructure
When your identity provider goes down, everything goes down. Users cannot authenticate, applications cannot authorize, APIs cannot validate tokens, and your organization grinds to a halt. Unlike a single application failure that affects one business process, an identity outage is a cascading catastrophe that impacts every connected system simultaneously.
IAM disaster recovery (DR) is not optional. It is the single most critical DR domain in modern IT because identity is the dependency that all other systems share. This guide covers how to design, implement, and test DR for your identity infrastructure.
Prerequisites
- Documented identity architecture — A complete inventory of your IdP, directories, federation services, MFA providers, and all downstream dependencies.
- Business impact analysis — Understanding of which business processes depend on identity services and their tolerance for downtime.
- Executive sponsorship — DR investment requires budget for redundant infrastructure, and leadership must understand the business case.
- Monitoring and alerting — You cannot recover from what you do not detect. Identity service monitoring must be in place before DR planning.
Architecture: Identity DR Framework
Understanding Identity Dependencies
Map your identity dependency chain:
Users / Applications
↓
Load Balancer / DNS
↓
Identity Provider (Entra ID, Okta, Ping, etc.)
↓ ↓ ↓
Directory Store MFA Service Federation Service
↓ ↓ ↓
LDAP/AD DS Authenticator App SAML/OIDC Metadata
↓ Push Service Certificate Store
Database/ SMS Gateway
Replication FIDO2 Service
Every component in this chain is a potential failure point. Your DR plan must address each one.
RTO and RPO for Identity Services
Recovery Time Objective (RTO) — How quickly must identity services be restored?
For most organizations, identity RTO should be measured in minutes, not hours:
- Tier 1 (Critical): 5-15 minutes — Authentication, token validation, MFA
- Tier 2 (Important): 1-4 hours — Provisioning, access reviews, self-service password reset
- Tier 3 (Standard): 24 hours — Reporting, analytics, non-critical integrations
Recovery Point Objective (RPO) — How much data loss is acceptable?
- Directory data: Near-zero RPO. User accounts, group memberships, and access policies must be current. Even minutes of data loss can mean recently offboarded users regain access.
- Audit logs: 1-hour RPO acceptable if logs are streamed to a separate SIEM in real time.
- Configuration: Near-zero RPO for policies. All conditional access, MFA, and federation configs must be recoverable to the exact pre-failure state.
Step-by-Step Implementation
Step 1: Architect for High Availability First
DR starts with high availability (HA). Before planning for disaster scenarios, eliminate single points of failure in normal operations.
For cloud IdPs (Entra ID, Okta, Auth0):
Cloud identity providers handle infrastructure HA for you — geo-distributed data centers, automatic failover, and built-in replication. Your responsibility is:
- Ensure your network can reach the IdP through multiple paths (dual ISPs, SD-WAN).
- Configure DNS with appropriate TTLs for IdP endpoints.
- Maintain local caching proxies for token validation during brief outages.
- Monitor the IdP's status page and subscribe to incident notifications.
For on-premises IdPs (Active Directory, LDAP):
- Deploy a minimum of two domain controllers per site.
- Place domain controllers in different failure domains (different racks, power circuits, network switches).
- Ensure at least one FSMO role holder is in each major site.
- Configure Active Directory Sites and Services for optimal replication topology.
- Deploy read-only domain controllers (RODCs) in branch offices.
For hybrid environments:
- Maintain Entra Connect (formerly Azure AD Connect) servers in active-passive configuration.
- Stage a warm standby Connect server that can be activated within minutes.
- Ensure the sync engine database is backed up and restorable.
Step 2: Implement Directory Replication and Backup
Active Directory backup strategy:
- System State backup — Back up at least two domain controllers' system state daily. System state includes the AD database (NTDS.DIT), SYSVOL, registry, and certificate services.
- Backup retention — Retain backups for longer than your AD tombstone lifetime (default 180 days). If you need to recover a deleted object, you need a backup from before the deletion.
- Backup verification — Monthly test restores to an isolated environment. A backup you have never tested is not a backup.
- Backup storage — Store backups in a location that does not depend on AD for access. If your backup system requires AD authentication to retrieve backups, you have a circular dependency.
Cloud directory backup strategy:
Cloud IdPs typically do not support traditional backup/restore. Instead:
- Export configurations regularly — Use APIs to export conditional access policies, named locations, group configurations, app registrations, and service principals. Store these exports in version control.
- Recycle bin — Enable the directory recycle bin (Entra soft delete retains deleted users for 30 days).
- Configuration as code — Manage IdP configuration through Terraform, Pulumi, or the IdP's own IaC tooling. This gives you version history and the ability to redeploy configuration.
Step 3: Plan for IdP Failover Scenarios
Scenario 1: Cloud IdP outage
When your cloud IdP (Entra, Okta) experiences a global outage:
- Token caching — Applications that validate tokens locally (using cached signing keys) can continue operating for the lifetime of existing tokens. Configure token lifetimes strategically — longer lifetimes increase resilience but reduce security responsiveness.
- Cached credentials — Windows devices with cached logon credentials allow users to access local resources. Configure group policy to allow sufficient cached logons (default is 10).
- Break-glass procedures — Maintain local admin accounts on critical servers that do not depend on the IdP. These accounts should be stored in sealed envelopes or a hardware security module.
- Secondary IdP — For organizations that cannot tolerate any identity downtime, maintain a secondary IdP (e.g., an on-premises ADFS instance that can be activated if Entra is unavailable). This is expensive and complex but provides true failover capability.
Scenario 2: On-premises AD outage (all domain controllers down)
- Authoritative restore — If the AD database is corrupted, perform an authoritative restore from the most recent verified backup.
- Forest recovery — If the entire forest is compromised, follow Microsoft's AD forest recovery procedure: isolate, restore one DC per domain, verify replication, and rebuild.
- Cloud continuity — If you use hybrid identity, Entra ID retains a copy of all synced identities. Users can continue authenticating to cloud apps even if on-premises AD is completely down.
Scenario 3: MFA provider outage
- Multiple MFA methods — Require users to register at least two MFA methods. If push notifications fail, TOTP codes or FIDO2 keys still work.
- Temporary MFA bypass — Pre-configure emergency conditional access policies that relax MFA requirements during provider outages. These policies should require approval from two security team members to activate.
- SMS fallback — While SMS is the weakest MFA factor, it uses different infrastructure than app-based push. Consider allowing SMS as a last-resort fallback during outages.
Step 4: Implement Configuration Backup and Recovery
Identity configuration is as critical as identity data. A restored directory with missing conditional access policies is dangerously exposed.
What to back up:
- Conditional access policies (all policy definitions, named locations, terms of use)
- Application registrations and service principals
- Group definitions and membership rules (dynamic groups)
- Role assignments (PIM configurations)
- Authentication methods policies
- Cross-tenant access settings
- Identity Protection policies
- Custom security attributes
How to back up:
Use Microsoft Graph API or the IdP's equivalent API to export configurations as JSON:
# Example: Export conditional access policies
az rest --method GET \
--url "https://graph.microsoft.com/v1.0/identity/conditionalAccess/policies" \
--output json > ca-policies-backup-$(date +%Y%m%d).json
# Example: Export named locations
az rest --method GET \
--url "https://graph.microsoft.com/v1.0/identity/conditionalAccess/namedLocations" \
--output json > named-locations-backup-$(date +%Y%m%d).json
Automate this daily and store exports in a Git repository for version history.
Step 5: Build and Document Recovery Procedures
Recovery runbook template:
For each DR scenario, document:
- Detection — How will we know this failure has occurred? (Monitoring alerts, user reports, vendor status page)
- Assessment — How do we determine the scope of impact? (Which services are affected, how many users)
- Communication — Who do we notify? (Incident commander, security team, leadership, affected users)
- Decision — Under what conditions do we activate DR? (Duration threshold, scope threshold)
- Execution — Step-by-step recovery actions with specific commands and expected outcomes.
- Validation — How do we verify identity services are functioning correctly? (Test authentications, token validation, provisioning)
- Return to normal — Steps to deactivate DR measures and return to primary infrastructure.
- Post-incident — Review, lessons learned, runbook updates.
Step 6: Establish Emergency Access
Emergency access procedures ensure that critical administrative actions can be performed even during identity outages.
Break-glass accounts:
- Create at least two cloud-only Global Administrator accounts.
- Use long, complex passwords (20+ characters) stored in a physical safe.
- Do not assign MFA (or use a FIDO2 key stored separately from the password).
- Exclude from all conditional access policies.
- Configure Azure Monitor alerts on any sign-in to these accounts.
- Test quarterly — verify the accounts can sign in and perform administrative actions.
Local recovery accounts:
- Maintain local administrator accounts on critical servers.
- Store credentials in a Privileged Access Management (PAM) vault that does not depend on the IdP being recovered.
- Document which servers have local accounts and their purposes.
Best Practices
Test Quarterly, Not Annually
Annual DR tests are insufficient for identity services. The identity landscape changes too frequently — new applications, new conditional access policies, new federation trusts. Test quarterly with tabletop exercises and semi-annually with actual failover tests.
Separate Your Monitoring from Your Identity
If your monitoring system authenticates through the same IdP you are monitoring, you will not receive alerts when that IdP fails. Ensure your identity monitoring uses a separate authentication path — API keys, local service accounts, or a different IdP.
Document Your Dependencies Obsessively
Maintain a live dependency map that shows every application, API, and service that depends on each identity component. When an identity service fails, this map tells you exactly what is impacted and in what order to restore.
Automate Recovery Where Possible
Manual recovery procedures are slow and error-prone during high-stress incidents. Automate what you can: failover scripts, configuration redeployment, health checks, and notification workflows. But always maintain manual runbooks as a fallback.
Testing
Tabletop Exercise
Gather your identity team, security team, and key application owners. Present a scenario (e.g., "Entra ID has been experiencing intermittent authentication failures for 30 minutes, impacting all cloud applications") and walk through your response plan. Identify gaps and update procedures.
Controlled Failover Test
In a maintenance window:
- Simulate a component failure (shut down a domain controller, disable a federation trust, block network access to the IdP).
- Verify detection — did monitoring alerts fire?
- Execute recovery procedures — did they work as documented?
- Measure recovery time — did you meet your RTO?
- Validate functionality — can users authenticate and access applications?
- Restore normal operations and verify no data loss.
Chaos Engineering for Identity
For mature organizations, introduce controlled identity failures in production:
- Briefly disable a single domain controller during business hours.
- Simulate MFA provider latency by introducing network delays.
- Revoke a federation signing certificate in a test environment to verify certificate rollover procedures.
Common Pitfalls
Circular Dependencies
The most dangerous DR pitfall is a circular dependency: you need System A to recover System B, but System A depends on System B for authentication. Common examples:
- PAM vault requires AD authentication, but AD recovery credentials are in the vault.
- Backup system requires SSO, but SSO is down.
- Communication tools require identity, so you cannot coordinate recovery.
Audit every step of your recovery procedures for circular dependencies and break them with local accounts, offline credential storage, or alternative communication channels.
Ignoring Certificate and Key Expiration
Federation trusts, SAML signing certificates, and OIDC signing keys expire. If a certificate expires during a disaster and you need to rebuild a federation trust, you need access to the certificate management system — which may itself depend on identity services. Maintain offline copies of all identity-related certificates and their renewal procedures.
Over-Relying on Vendor SLAs
Your cloud IdP's 99.99% SLA means they expect up to 52 minutes of downtime per year. That is 52 minutes where your entire organization cannot authenticate. SLAs provide financial credits, not business continuity. Plan for outages that exceed the SLA.
Not Testing with Real Applications
Testing DR by verifying that users can sign in is insufficient. Test that end-to-end application workflows function — can users access data, can APIs authorize requests, can provisioning pipelines execute? The authentication may work, but downstream dependencies may have cached stale state.
Conclusion
Identity disaster recovery is unique because identity is the universal dependency. When identity fails, the blast radius is your entire organization. The investment in redundancy, backup, and tested recovery procedures is justified by the catastrophic cost of an extended identity outage.
Build your DR program in layers: high availability first, then backup and configuration management, then failover procedures, and finally regular testing. Document everything, test frequently, and ruthlessly eliminate circular dependencies. The organizations that recover fastest from identity disasters are those that practiced for them.
Frequently Asked Questions
Q: Should we maintain a secondary IdP for failover? A: It depends on your risk tolerance and budget. A secondary IdP (e.g., on-premises ADFS as backup for Entra ID) provides true failover capability but adds significant complexity and cost. Most organizations rely on their cloud IdP's built-in HA and focus DR efforts on cached credential strategies and break-glass procedures.
Q: How do we back up Entra ID / Okta configurations? A: Use the management APIs (Microsoft Graph, Okta Admin API) to export all configurations as JSON. Automate daily exports to a Git repository. Tools like Entra Exporter, AzureADConfigBackup, and Terraform can help. Treat identity configuration as code.
Q: What is an acceptable RTO for identity services? A: For authentication and authorization services, target 5-15 minutes. Most organizations cannot tolerate longer because every connected application is affected. For non-critical identity functions (provisioning, reporting), 4-24 hours is typically acceptable.
Q: How do we handle identity DR across multiple cloud providers? A: If you federate identity across AWS, Azure, and GCP, your IdP is the single point of failure for all three. Ensure your IdP's DR plan accounts for multi-cloud impact. Consider configuring each cloud provider's emergency access (AWS root account, GCP super admin) independently of your federated IdP.
Q: Should break-glass accounts have MFA? A: This is debated. MFA on break-glass accounts adds security but risks locking you out during an MFA provider outage — which is exactly when you might need break-glass access. The common compromise is to use a hardware FIDO2 key stored in a physical safe, separate from the password, providing MFA without depending on a cloud MFA service.
Share this article