Cloud, DevOps & Platform Engineering

Cloud Security, Backup & Disaster Recovery

Assume failure and attack will occur. Design so the system degrades gracefully, recovers predictably, and leaves evidence. Optimism is not a strategy.

The Problem

Why security and resilience fail in the cloud

Security prevents incidents. Resilience limits impact. Recovery restores trust. All three must exist by design — not as afterthoughts bolted on when something breaks.

Security controls applied inconsistently

Controls exist in policy documents but not in the platform. Some teams apply them; most do not. The result is a patchwork that provides the appearance of security without the substance.

Backups assumed to work but never tested

Backup jobs run nightly and reports say "success." Nobody has tried restoring. The first real test is during a data loss incident.

Disaster recovery plans exist only on paper

DR documents are written, reviewed annually, and never exercised. When a real failure occurs, the plan does not match the actual infrastructure.

Resilience confused with redundancy

Two instances running is not resilience. Resilience is the ability to degrade gracefully, recover predictably, and continue delivering core value under stress.

Shared responsibility misunderstood

Teams assume the cloud provider handles security. The provider secures the infrastructure. The customer secures everything running on it — a distinction that causes most cloud security failures.

Incident response is improvised

When something goes wrong, teams improvise. No runbooks, no ownership, no escalation paths. Panic replaces procedure and the incident gets worse.

The consequences

Prolonged outages that could have been contained or recovered from in minutes.
Data loss that cannot be recovered because backups were never validated.
Regulatory exposure when controls cannot be evidenced to auditors.
Reputational damage from incidents that were preventable by design.
Leadership loss of confidence in the engineering organisation.

Security by Design

Cloud security baseline

Security baselines established at the platform level so teams cannot bypass them. Security reviews are replaced with preventive controls.

Identity and access management

Centralised identity with role-based access, short-lived credentials, and service identities for workloads.

Least-privilege enforcement

Every human and system identity has the minimum permissions required. No shared credentials, no standing admin access.

Network segmentation and isolation

Workloads isolated by risk profile. Lateral movement restricted. Ingress and egress explicitly controlled.

Encryption in transit and at rest

All data encrypted in transit using modern TLS. Encryption at rest enforced for all storage resources.

Secrets management

No credentials in code, environment variables, or configuration files. All secrets retrieved from a managed store at runtime.

Secure defaults

New resources are locked down by default. Access is granted explicitly, never inherited from permissive defaults.

Identity

Identity as the primary control plane

Most breaches begin with identity. Shared credentials and standing admin access are prohibited without exception.

Centralised identity provider — no local accounts on cloud resources.
Role-based access control with minimum necessary permissions per role.
Short-lived credentials — no long-lived keys or static API tokens.
Service identities for workloads — no workload uses a human identity.
Privileged access logging — all elevated access recorded with timestamps and justification.

Network Security

Network trust boundaries

Zero-trust assumptions applied within the platform, not just at the edge.

Minimise attack surface — expose only what must be publicly accessible.
Restrict lateral movement — internal traffic controlled at the network layer, not just the application layer.
Isolate workloads by risk — high-sensitivity systems in dedicated segments.
Control ingress and egress — explicit allow-lists, default deny.

Threat Modelling

Data exposure — how could data be accessed by unauthorised parties?
Privilege escalation — how could a low-privilege identity gain elevated access?
Dependency compromise — what happens if a third-party service or library is compromised?
Misconfiguration — what configuration mistakes would expose data or allow access?
Insider risk — what could an authorised user do that they should not be able to do?

Backup Strategy

Engineering backup, not hoping

Backups are meaningless if they cannot be restored. Every backup has a defined restore scenario.

Backups are automated — no manual backup process for production data.
Backup scope is explicit — what is backed up, what is not, and why.
Retention aligns with risk and regulation — not set to a default and forgotten.
Backups are encrypted — backup data has the same encryption requirements as live data.
Access to backups is restricted — backups are not accessible to the same identities that can modify the source data.

Restore testing requirements

Periodic restore testing — at a defined frequency, backups are actually restored.
Verification of data integrity — restored data is validated, not just assumed complete.
Documented recovery time — restore tests record how long recovery actually takes against RTO targets.

Disaster Recovery

DR designed from business impact

Over-engineering DR wastes money. Under-engineering destroys trust. Strategy is matched to workload criticality.

Recovery Time Objective (RTO) — how long can the system be unavailable?

Recovery Point Objective (RPO) — how much data loss is acceptable?

Workload criticality — does this workload block revenue, operations, or compliance?

Data sensitivity — does recovery require special handling or access controls?

Cost tolerance — what is the business willing to spend to achieve the RTO/RPO?

Backups with cold restore

Lowest cost, highest RTO. Suitable for non-critical systems where hours of recovery are acceptable.

Warm standby

Infrastructure pre-provisioned but not serving traffic. Faster recovery than cold restore, lower cost than active-active.

Active-active architecture

Traffic distributed across multiple regions. Near-zero RTO. Highest cost. Required only for tier-0 systems.

Regional failover

Secondary region ready to receive traffic. Automated or manual promotion. Balances cost and recovery speed.

Resilience Engineering

Beyond disaster recovery

Resilience is not just about disasters. The system must continue delivering core value under stress.

Partial failures — one component failing should not cascade to the whole system.
Dependency outages — external dependencies can be unavailable without causing total failure.
Traffic spikes — sudden load increases are absorbed, not passed directly to downstream systems.
Degraded modes — the system continues delivering core value when non-essential features are unavailable.
Graceful feature degradation — lower-priority features are disabled under load before core features are affected.

Chaos testing

Controlled failure injection — deliberately terminate instances, block network paths, and exhaust resources to observe how the system responds.
Dependency disruption tests — simulate third-party service failures to validate that fallbacks and timeouts work.
Recovery validation — confirm that automated recovery mechanisms activate and that systems return to steady state without manual intervention.

Incident Response

Panic replaced with procedure

Incidents are detectable — alert coverage exists before an incident is needed.
Alerts are actionable — every alert has a defined response procedure.
Ownership is clear — every system has a named on-call owner.
Escalation paths exist — when the first responder cannot resolve, escalation is automatic and documented.
Response actions are documented — runbooks exist, are current, and are tested periodically.

Evidence and compliance

Access logs — who accessed what, when, from where.
Configuration history — what the configuration was at any point in time.
Backup records — when backups ran, what was captured, and whether they succeeded.
Restore evidence — when restores were tested, what was restored, and the outcome.
Incident reports — what happened, what was done, and what changed as a result.

Shared Responsibility

Who owns what

Cloud provider

Physical infrastructure, hypervisor, network fabric, and the managed services layer. The provider does not secure your applications, data, or identities.

Platform team

Baseline security controls, network design, identity policies, encryption standards, monitoring infrastructure, and incident response tooling.

Product teams

Application-level security, input validation, data handling, dependency management, and correct use of platform-provided controls.

Anti-Patterns

What creates fragile systems

Trusting provider defaults blindly

Cloud providers optimise defaults for ease of use, not security. Default configurations regularly expose resources to the internet.

Assuming backups work

Backup jobs can complete with "success" status while capturing incomplete or corrupted data. An untested backup is an untrusted backup.

DR plans never tested

Plans written against an earlier version of the infrastructure. Tested during an actual incident for the first time. The result is improvised recovery with no baseline to measure against.

All systems treated equally

Applying the same DR investment to every system is wasteful for low-criticality workloads and dangerously insufficient for business-critical ones.

Security reviews without enforcement

Security reviews that produce recommendations without controls leave compliance to individual team discipline. Under delivery pressure, security is always deferred.

Resilience added after incidents

Retrofitting resilience is dramatically more expensive than designing for it. Every outage that reveals a resilience gap had a preventable root cause.

Deliverables

What we produce

Cloud security baseline architecture with preventive controls by default.
Threat model and mitigation map for each workload tier.
Backup and restore strategy with scope, retention, and encryption requirements.
Disaster recovery architecture matched to RTO/RPO requirements per workload.
Resilience patterns and guidelines for service and dependency design.
Incident response playbooks with ownership, escalation, and runbook structure.
Audit-ready security evidence framework covering access, configuration, and recovery.

Related Services

Connected disciplines

Infrastructure as Code→

SRE & Reliability Engineering→

Platform Engineering & CI/CD→

Containerization & Kubernetes→

Start a Conversation

Engineer security and resilience into your platform

We design cloud security baselines, backup strategies, and disaster recovery architectures that withstand failure, attack, and human error — and recover predictably when they occur.