API Key Rotation Automation Guide
A practical guide to automating API key rotation with zero downtime, covering rotation strategies, secrets management integration, dual-key patterns, monitoring, and rollback procedures.
Static API keys are one of the most common security vulnerabilities in modern applications. They sit in configuration files, environment variables, CI/CD pipelines, and sometimes source code, unchanged for months or years. When a key is compromised — through a repository leak, a compromised developer machine, or a third-party breach — the blast radius is unlimited because the key has been valid since creation.
Automated key rotation limits this blast radius dramatically. A key that is rotated every 24 hours has a maximum exposure window of 24 hours. A key that is never rotated has an infinite exposure window. Yet most organizations resist rotation because they fear downtime. The application reads the key at startup, and replacing the key means restarting the application — or worse, a failed deployment.
This guide eliminates that fear. You will learn patterns for zero-downtime key rotation that work at any scale, with full automation and monitoring.
What You Will Learn
- API key rotation strategies and the dual-key pattern
- Integrating rotation with secrets management platforms
- Implementing zero-downtime rotation for different application architectures
- Monitoring key usage and detecting rotation failures
- Rollback procedures when rotation goes wrong
Prerequisites
- Secrets management platform — HashiCorp Vault, AWS Secrets Manager, Azure Key Vault, or Google Secret Manager.
- API keys to rotate — An inventory of all API keys your applications use, including the provider, the application that uses them, and how the application reads them.
- Provider API access — Admin access to the API providers whose keys you need to rotate (e.g., Stripe, Twilio, SendGrid, or internal APIs).
- Deployment pipeline — A CI/CD pipeline or configuration management system that can trigger key rotation and deploy updated configurations.
- Monitoring and alerting — The ability to detect when an API call fails due to an invalid key.
Architecture Overview
Automated key rotation involves four components working together:
- Secrets Vault: Stores the current and previous versions of each API key. Applications read from the vault rather than from local configuration.
- Rotation Orchestrator: A scheduled process (Lambda function, CronJob, Vault rotation policy) that triggers key rotation on a defined schedule.
- API Provider: The service that issues and revokes API keys (Stripe Dashboard API, AWS IAM, your internal API management platform).
- Application: The consumer of the API key. Must be configured to read keys dynamically from the vault rather than from static configuration.
The dual-key (overlap) pattern is the foundation of zero-downtime rotation:
- Generate a new key at the API provider. Both old and new keys are now valid.
- Store the new key in the vault as the primary version.
- Wait for all application instances to pick up the new key (grace period).
- Revoke the old key at the API provider.
During the grace period, both keys are valid, so any application instance — whether it has picked up the new key or is still using the old one — can authenticate successfully. There is no moment where neither key works.
Step-by-Step Implementation
Step 1: Inventory and Classify Your API Keys
Create a comprehensive inventory:
api_keys:
- name: "stripe-api-key"
provider: "Stripe"
environment: "production"
used_by: ["payment-service", "billing-service"]
vault_path: "secret/prod/stripe/api-key"
rotation_frequency: "30d"
supports_dual_key: true
provider_api: "https://api.stripe.com/v1/api_keys"
criticality: "high"
- name: "sendgrid-api-key"
provider: "SendGrid"
environment: "production"
used_by: ["notification-service"]
vault_path: "secret/prod/sendgrid/api-key"
rotation_frequency: "90d"
supports_dual_key: true
provider_api: "https://api.sendgrid.com/v3/api_keys"
criticality: "medium"
- name: "internal-service-api-key"
provider: "Internal API Gateway"
environment: "production"
used_by: ["frontend-bff", "mobile-api"]
vault_path: "secret/prod/internal/gateway-key"
rotation_frequency: "7d"
supports_dual_key: true
provider_api: "https://gateway.internal/admin/keys"
criticality: "high"
Classify each key by:
- Dual-key support: Can the provider have two valid keys simultaneously? Most can.
- Rotation frequency: Based on risk, compliance requirements, and operational feasibility.
- Criticality: What happens if the key stops working? Use this to prioritize rotation automation.
Step 2: Configure Dynamic Secret Retrieval
Before implementing rotation, applications must read keys dynamically — not from static configuration files or environment variables set at deploy time.
Pattern 1: Vault Agent Sidecar (recommended for Kubernetes)
# Kubernetes pod with Vault agent sidecar
apiVersion: v1
kind: Pod
metadata:
annotations:
vault.hashicorp.com/agent-inject: "true"
vault.hashicorp.com/agent-inject-secret-stripe: "secret/prod/stripe/api-key"
vault.hashicorp.com/agent-inject-template-stripe: |
{{- with secret "secret/prod/stripe/api-key" -}}
STRIPE_API_KEY={{ .Data.data.key }}
{{- end }}
vault.hashicorp.com/agent-inject-reload: "true" # Re-read on secret change
spec:
containers:
- name: payment-service
image: payment-service:latest
volumeMounts:
- name: vault-secrets
mountPath: /vault/secrets
readOnly: true
Pattern 2: Application-level Vault client
import hvac
import time
from functools import lru_cache
class SecretManager:
def __init__(self, vault_url, vault_token):
self.client = hvac.Client(url=vault_url, token=vault_token)
self._cache = {}
self._cache_ttl = 300 # 5 minutes
def get_api_key(self, path):
now = time.time()
if path in self._cache and now - self._cache[path]['timestamp'] < self._cache_ttl:
return self._cache[path]['value']
response = self.client.secrets.kv.v2.read_secret_version(path=path)
key = response['data']['data']['key']
self._cache[path] = {'value': key, 'timestamp': now}
return key
# Usage
secrets = SecretManager(vault_url="https://vault.internal:8200", vault_token=os.environ['VAULT_TOKEN'])
# Every API call reads the current key (with 5-minute cache)
def call_stripe_api():
api_key = secrets.get_api_key("prod/stripe/api-key")
response = requests.post(
"https://api.stripe.com/v1/charges",
headers={"Authorization": f"Bearer {api_key}"},
data=charge_data
)
return response
Pattern 3: AWS Secrets Manager with automatic caching
from aws_secretsmanager_caching import SecretCache, SecretCacheConfig
cache_config = SecretCacheConfig(
max_cache_size=100,
secret_refresh_interval=300 # Refresh every 5 minutes
)
cache = SecretCache(config=cache_config)
def get_api_key():
secret = cache.get_secret_string("prod/stripe-api-key")
return json.loads(secret)["key"]
Step 3: Implement the Rotation Function
The rotation function executes the dual-key pattern:
import json
import logging
from datetime import datetime, timedelta
logger = logging.getLogger("key-rotation")
def rotate_api_key(secret_name, provider):
"""
Zero-downtime API key rotation using dual-key pattern.
"""
logger.info(f"Starting rotation for {secret_name}")
# Step 1: Read current key metadata
current_secret = vault.read(secret_name)
current_key_id = current_secret["data"]["key_id"]
current_key = current_secret["data"]["key"]
# Step 2: Create new key at the provider
logger.info(f"Creating new key at {provider.name}")
new_key = provider.create_api_key(
name=f"{secret_name}-{datetime.now().strftime('%Y%m%d%H%M%S')}",
permissions=provider.get_key_permissions(current_key_id)
)
logger.info(f"New key created: {new_key.id}")
# Step 3: Store new key in vault (both old and new are now valid)
vault.write(secret_name, {
"key": new_key.value,
"key_id": new_key.id,
"previous_key": current_key,
"previous_key_id": current_key_id,
"rotated_at": datetime.now().isoformat(),
"previous_key_revocation_at": (datetime.now() + timedelta(hours=1)).isoformat()
})
logger.info(f"New key stored in vault for {secret_name}")
# Step 4: Wait for grace period (applications pick up new key)
# This is handled by a separate deferred revocation job
schedule_revocation(secret_name, current_key_id, provider, delay_hours=1)
logger.info(f"Scheduled revocation of old key {current_key_id} in 1 hour")
return {"status": "success", "new_key_id": new_key.id}
def schedule_revocation(secret_name, old_key_id, provider, delay_hours):
"""
Deferred revocation: revoke the old key after the grace period.
"""
# This could be a delayed message queue task, a scheduled Lambda, etc.
task_scheduler.schedule(
function=revoke_old_key,
args=(secret_name, old_key_id, provider),
delay=timedelta(hours=delay_hours)
)
def revoke_old_key(secret_name, old_key_id, provider):
"""
Revoke the old key. Verify that new key is being used first.
"""
# Safety check: verify new key is working
current_secret = vault.read(secret_name)
if current_secret["data"]["key_id"] == old_key_id:
logger.error(f"New key not yet active for {secret_name}. Aborting revocation.")
alert_security_team(f"Rotation incomplete for {secret_name}")
return
# Verify at least one successful API call with new key
if not verify_new_key_usage(secret_name):
logger.warning(f"No confirmed usage of new key for {secret_name}. Delaying revocation.")
schedule_revocation(secret_name, old_key_id, provider, delay_hours=1)
return
# Revoke old key
provider.revoke_api_key(old_key_id)
logger.info(f"Old key {old_key_id} revoked for {secret_name}")
# Clean up vault metadata
vault.write(secret_name, {
"key": current_secret["data"]["key"],
"key_id": current_secret["data"]["key_id"],
"rotated_at": current_secret["data"]["rotated_at"]
})
Step 4: Configure Rotation Schedules
# Rotation schedule configuration
rotation_schedules:
- secret: "prod/stripe-api-key"
provider: "stripe"
schedule: "0 2 1 * *" # Monthly at 2 AM on the 1st
grace_period: "2h"
alert_on_failure: true
rollback_on_failure: true
- secret: "prod/internal-gateway-key"
provider: "internal_gateway"
schedule: "0 3 * * 0" # Weekly at 3 AM Sunday
grace_period: "1h"
alert_on_failure: true
rollback_on_failure: true
- secret: "prod/sendgrid-api-key"
provider: "sendgrid"
schedule: "0 2 1 */3 *" # Quarterly
grace_period: "4h"
alert_on_failure: true
rollback_on_failure: true
For AWS Secrets Manager, rotation is built-in:
import boto3
client = boto3.client('secretsmanager')
# Enable automatic rotation
client.rotate_secret(
SecretId='prod/stripe-api-key',
RotationLambdaARN='arn:aws:lambda:us-east-1:123456789:function:rotate-stripe-key',
RotationRules={
'AutomaticallyAfterDays': 30,
'Duration': '2h', # Rotation window
'ScheduleExpression': 'rate(30 days)'
}
)
Step 5: Implement Monitoring and Alerting
monitoring_rules:
- name: "API key rotation failure"
condition: "rotation_job_status == 'failed'"
severity: "critical"
action: "page_on_call_engineer"
description: "Key rotation failed. Old key may not be revoked. Manual intervention required."
- name: "API key approaching max age"
condition: "key_age > rotation_frequency * 1.5"
severity: "high"
action: "alert_security_team"
description: "Key has not been rotated within expected window."
- name: "Old key still in use after grace period"
condition: "old_key_usage_after_grace_period > 0"
severity: "warning"
action: "alert_application_team"
description: "Application instances still using the old key. May indicate deployment issue."
- name: "API authentication failures spike"
condition: "api_auth_failure_rate > baseline * 3"
severity: "critical"
action: "page_on_call_engineer"
description: "Spike in API auth failures. May indicate rotation issue or key compromise."
- name: "Revoked key usage attempt"
condition: "revoked_key_auth_attempt > 0"
severity: "high"
action: "alert_security_team"
description: "Attempt to use a revoked key. May indicate compromise or misconfigured application."
Step 6: Build Rollback Procedures
Rotation can fail. Have a plan:
def rollback_rotation(secret_name, provider):
"""
Emergency rollback: restore the previous key if the new key is not working.
"""
logger.warning(f"Initiating rollback for {secret_name}")
current_secret = vault.read(secret_name)
previous_key = current_secret["data"].get("previous_key")
previous_key_id = current_secret["data"].get("previous_key_id")
if not previous_key:
logger.error(f"No previous key available for rollback of {secret_name}")
alert_security_team(f"Rollback failed for {secret_name}: no previous key")
return False
# Verify the previous key is still valid at the provider
if not provider.validate_key(previous_key_id):
logger.error(f"Previous key {previous_key_id} is no longer valid. Cannot rollback.")
alert_security_team(f"Rollback failed for {secret_name}: previous key revoked")
return False
# Restore previous key as the active key
vault.write(secret_name, {
"key": previous_key,
"key_id": previous_key_id,
"rolled_back_at": datetime.now().isoformat(),
"rollback_reason": "new_key_failure"
})
# Revoke the problematic new key
new_key_id = current_secret["data"]["key_id"]
provider.revoke_api_key(new_key_id)
logger.info(f"Rollback complete for {secret_name}. Restored key {previous_key_id}")
return True
Configuration Best Practices
- Never hardcode API keys. Every key must be read from a secrets vault at runtime. Use environment variables only as a pointer to the vault path, not to store the key itself.
- Keep both keys valid during rotation. The dual-key overlap period is critical. Never revoke the old key until you have confirmed the new key is working in production.
- Use short grace periods. The grace period should be long enough for all application instances to refresh their cached key (typically 5-15 minutes for sidecar injection, up to 1 hour for manual deployment pipelines).
- Log every rotation event. Record when each key was created, activated, and revoked. This audit trail is essential for incident investigation.
- Test rotation in staging first. Every rotation procedure should be proven in a non-production environment before being applied to production keys.
- Separate rotation from deployment. Key rotation should be independent of application deployments. If you can only rotate keys during a deploy, your rotation frequency is limited to your deploy frequency.
Testing and Validation
- Happy-path rotation test: Trigger a manual rotation in staging. Verify the new key is issued, stored in the vault, picked up by the application, and the old key is revoked — all without downtime.
- Grace period test: Trigger rotation and immediately attempt API calls. Verify that calls succeed with both the old key (during grace period) and the new key.
- Rollback test: Trigger rotation, then simulate a new key failure. Trigger rollback and verify the application recovers using the previous key.
- Provider outage test: Simulate the API provider being unavailable during rotation. Verify the rotation job fails gracefully and the old key remains active.
- Vault outage test: Simulate a vault outage during normal operation. Verify applications use cached keys and continue to function.
- Concurrent rotation test: Trigger two rotations simultaneously for the same key. Verify the system handles the race condition gracefully.
Common Pitfalls and Troubleshooting
| Problem | Cause | Solution | |---------|-------|----------| | Application uses stale key after rotation | Key is read at startup and cached indefinitely | Implement TTL-based cache refresh (5-15 minutes) | | Downtime during rotation | Old key revoked before all instances picked up new key | Increase grace period; verify new key usage before revoking old key | | Rotation fails silently | No monitoring on rotation jobs | Add alerts for rotation failures and key age violations | | Provider rate-limits key creation | Too many rotations or retries | Implement exponential backoff; rotate during off-peak hours | | Multiple instances reading different key versions | Cache TTLs are not synchronized | Accept this temporarily (dual-key pattern handles it); ensure grace period > max cache TTL | | Rollback fails because old key was already revoked | Revocation happened before rollback was needed | Extend the old key's validity window; do not revoke until new key is confirmed working |
Security Considerations
- Rotation is not a substitute for detection. Rotating a compromised key stops future abuse but does not undo past abuse. Pair rotation with monitoring to detect unauthorized key usage.
- Protect the rotation infrastructure. The rotation function has the ability to create and revoke API keys — it is highly privileged. Secure the rotation function's credentials, limit its network access, and audit its executions.
- Secrets in transit. When the vault sends a key to the application, that transmission must be encrypted (TLS) and authenticated (Vault token). Never transmit keys over unencrypted channels.
- Emergency rotation capability. Beyond scheduled rotation, you must be able to rotate a key immediately when a breach is detected. Build and test an emergency rotation procedure that can be triggered with a single command.
- Third-party provider limitations. Some API providers limit the number of active keys, restrict how quickly you can create and revoke keys, or do not support programmatic key management at all. Document these limitations for each provider.
Conclusion
Automated API key rotation transforms a major security liability into a managed process. The dual-key pattern ensures zero downtime, secrets management platforms centralize control, and monitoring catches failures before they impact production.
Start by migrating applications from static configuration to dynamic secret retrieval. Then implement rotation for your highest-risk keys first — those with the broadest access and the longest age. As you build confidence, extend rotation to all API keys and progressively shorten the rotation interval. The goal is a world where no API key is more than 30 days old, and compromised keys are replaced within minutes.
FAQs
Q: How frequently should I rotate API keys? A: As frequently as operationally feasible. 30 days is a good starting target for most keys. For high-sensitivity keys (production database credentials, payment processor keys), consider 7 days. For internal service-to-service keys, daily or even hourly rotation is achievable with dynamic secrets.
Q: What if the API provider does not support multiple active keys? A: Some providers allow only one key at a time. For these, you must accept a brief interruption or use an API gateway as a proxy that can be updated atomically. Alternatively, negotiate with the provider for multi-key support — it is a common and reasonable request.
Q: Should I rotate keys on a schedule or on-demand? A: Both. Scheduled rotation provides regular hygiene. On-demand rotation provides emergency response capability. Implement scheduled rotation first, then build the on-demand trigger.
Q: What about database credentials? A: Use dynamic database credentials (Vault database secrets engine, AWS RDS IAM authentication) instead of static passwords. Each application instance gets a unique, short-lived credential that is automatically revoked when it expires.
Q: How do I handle API keys in CI/CD pipelines? A: CI/CD pipelines should read keys from the vault at runtime, not from pipeline configuration. Use Vault's AppRole or Kubernetes auth method to authenticate the pipeline to the vault, then read the key for the duration of the pipeline run.
Share this article