Cloud Security, Backup & Disaster Recovery
Assume failure and attack will occur. Design so the system degrades gracefully, recovers predictably, and leaves evidence. Optimism is not a strategy.
Why security and resilience fail in the cloud
Security prevents incidents. Resilience limits impact. Recovery restores trust. All three must exist by design — not as afterthoughts bolted on when something breaks.
Security controls applied inconsistently
Controls exist in policy documents but not in the platform. Some teams apply them; most do not. The result is a patchwork that provides the appearance of security without the substance.
Backups assumed to work but never tested
Backup jobs run nightly and reports say "success." Nobody has tried restoring. The first real test is during a data loss incident.
Disaster recovery plans exist only on paper
DR documents are written, reviewed annually, and never exercised. When a real failure occurs, the plan does not match the actual infrastructure.
Resilience confused with redundancy
Two instances running is not resilience. Resilience is the ability to degrade gracefully, recover predictably, and continue delivering core value under stress.
Shared responsibility misunderstood
Teams assume the cloud provider handles security. The provider secures the infrastructure. The customer secures everything running on it — a distinction that causes most cloud security failures.
Incident response is improvised
When something goes wrong, teams improvise. No runbooks, no ownership, no escalation paths. Panic replaces procedure and the incident gets worse.
The consequences
- Prolonged outages that could have been contained or recovered from in minutes.
- Data loss that cannot be recovered because backups were never validated.
- Regulatory exposure when controls cannot be evidenced to auditors.
- Reputational damage from incidents that were preventable by design.
- Leadership loss of confidence in the engineering organisation.
Cloud security baseline
Security baselines established at the platform level so teams cannot bypass them. Security reviews are replaced with preventive controls.
Identity and access management
Centralised identity with role-based access, short-lived credentials, and service identities for workloads.
Least-privilege enforcement
Every human and system identity has the minimum permissions required. No shared credentials, no standing admin access.
Network segmentation and isolation
Workloads isolated by risk profile. Lateral movement restricted. Ingress and egress explicitly controlled.
Encryption in transit and at rest
All data encrypted in transit using modern TLS. Encryption at rest enforced for all storage resources.
Secrets management
No credentials in code, environment variables, or configuration files. All secrets retrieved from a managed store at runtime.
Secure defaults
New resources are locked down by default. Access is granted explicitly, never inherited from permissive defaults.
Identity as the primary control plane
Most breaches begin with identity. Shared credentials and standing admin access are prohibited without exception.
- Centralised identity provider — no local accounts on cloud resources.
- Role-based access control with minimum necessary permissions per role.
- Short-lived credentials — no long-lived keys or static API tokens.
- Service identities for workloads — no workload uses a human identity.
- Privileged access logging — all elevated access recorded with timestamps and justification.
Network trust boundaries
Zero-trust assumptions applied within the platform, not just at the edge.
- Minimise attack surface — expose only what must be publicly accessible.
- Restrict lateral movement — internal traffic controlled at the network layer, not just the application layer.
- Isolate workloads by risk — high-sensitivity systems in dedicated segments.
- Control ingress and egress — explicit allow-lists, default deny.
- Data exposure — how could data be accessed by unauthorised parties?
- Privilege escalation — how could a low-privilege identity gain elevated access?
- Dependency compromise — what happens if a third-party service or library is compromised?
- Misconfiguration — what configuration mistakes would expose data or allow access?
- Insider risk — what could an authorised user do that they should not be able to do?
Engineering backup, not hoping
Backups are meaningless if they cannot be restored. Every backup has a defined restore scenario.
- Backups are automated — no manual backup process for production data.
- Backup scope is explicit — what is backed up, what is not, and why.
- Retention aligns with risk and regulation — not set to a default and forgotten.
- Backups are encrypted — backup data has the same encryption requirements as live data.
- Access to backups is restricted — backups are not accessible to the same identities that can modify the source data.
Restore testing requirements
- Periodic restore testing — at a defined frequency, backups are actually restored.
- Verification of data integrity — restored data is validated, not just assumed complete.
- Documented recovery time — restore tests record how long recovery actually takes against RTO targets.
DR designed from business impact
Over-engineering DR wastes money. Under-engineering destroys trust. Strategy is matched to workload criticality.
Backups with cold restore
Lowest cost, highest RTO. Suitable for non-critical systems where hours of recovery are acceptable.
Warm standby
Infrastructure pre-provisioned but not serving traffic. Faster recovery than cold restore, lower cost than active-active.
Active-active architecture
Traffic distributed across multiple regions. Near-zero RTO. Highest cost. Required only for tier-0 systems.
Regional failover
Secondary region ready to receive traffic. Automated or manual promotion. Balances cost and recovery speed.
Beyond disaster recovery
Resilience is not just about disasters. The system must continue delivering core value under stress.
- Partial failures — one component failing should not cascade to the whole system.
- Dependency outages — external dependencies can be unavailable without causing total failure.
- Traffic spikes — sudden load increases are absorbed, not passed directly to downstream systems.
- Degraded modes — the system continues delivering core value when non-essential features are unavailable.
- Graceful feature degradation — lower-priority features are disabled under load before core features are affected.
- Controlled failure injection — deliberately terminate instances, block network paths, and exhaust resources to observe how the system responds.
- Dependency disruption tests — simulate third-party service failures to validate that fallbacks and timeouts work.
- Recovery validation — confirm that automated recovery mechanisms activate and that systems return to steady state without manual intervention.
Panic replaced with procedure
- Incidents are detectable — alert coverage exists before an incident is needed.
- Alerts are actionable — every alert has a defined response procedure.
- Ownership is clear — every system has a named on-call owner.
- Escalation paths exist — when the first responder cannot resolve, escalation is automatic and documented.
- Response actions are documented — runbooks exist, are current, and are tested periodically.
- Access logs — who accessed what, when, from where.
- Configuration history — what the configuration was at any point in time.
- Backup records — when backups ran, what was captured, and whether they succeeded.
- Restore evidence — when restores were tested, what was restored, and the outcome.
- Incident reports — what happened, what was done, and what changed as a result.
Who owns what
Cloud provider
Physical infrastructure, hypervisor, network fabric, and the managed services layer. The provider does not secure your applications, data, or identities.
Platform team
Baseline security controls, network design, identity policies, encryption standards, monitoring infrastructure, and incident response tooling.
Product teams
Application-level security, input validation, data handling, dependency management, and correct use of platform-provided controls.
What creates fragile systems
Trusting provider defaults blindly
Cloud providers optimise defaults for ease of use, not security. Default configurations regularly expose resources to the internet.
Assuming backups work
Backup jobs can complete with "success" status while capturing incomplete or corrupted data. An untested backup is an untrusted backup.
DR plans never tested
Plans written against an earlier version of the infrastructure. Tested during an actual incident for the first time. The result is improvised recovery with no baseline to measure against.
All systems treated equally
Applying the same DR investment to every system is wasteful for low-criticality workloads and dangerously insufficient for business-critical ones.
Security reviews without enforcement
Security reviews that produce recommendations without controls leave compliance to individual team discipline. Under delivery pressure, security is always deferred.
Resilience added after incidents
Retrofitting resilience is dramatically more expensive than designing for it. Every outage that reveals a resilience gap had a preventable root cause.
What we produce
- Cloud security baseline architecture with preventive controls by default.
- Threat model and mitigation map for each workload tier.
- Backup and restore strategy with scope, retention, and encryption requirements.
- Disaster recovery architecture matched to RTO/RPO requirements per workload.
- Resilience patterns and guidelines for service and dependency design.
- Incident response playbooks with ownership, escalation, and runbook structure.
- Audit-ready security evidence framework covering access, configuration, and recovery.