Container Security Architecture: Design Decisions That Actually Matter

Container security gets treated like a checklist. Run a scanner. Enable some admission policies. Call it done. But when you're actually responsible for the environments — when you're the one getting paged at 2am — you start to realize that the checklist approach misses the point entirely.

This post is about the architectural decisions that matter when you're building container security into real infrastructure. Not the theoretical ideals, but the trade-offs you actually face when working across ECS Fargate in a large enterprise, Kubernetes in a training environment, and standalone Docker in a resource-constrained healthcare context.

Start with the threat model, not the tool

The first mistake I see teams make is reaching for tooling before they understand what they're actually protecting against. Container security means very different things depending on your environment:

In a managed environment like ECS Fargate, you don't control the underlying host — your threat surface shifts almost entirely to the image and the runtime configuration.
In Kubernetes, you have significantly more attack surface: the control plane, etcd, node security, inter-pod network policies, RBAC — the list is long.
In standalone Docker, you're typically dealing with smaller teams, less enforcement tooling, and a higher likelihood that security is entirely manual.

Knowing which environment you're in shapes every decision downstream. A control that's essential in Kubernetes may not even apply in Fargate, and vice versa. Spend the time to model your actual threat surface before you start building controls.

Key Principle

The container runtime isn't your biggest risk. Your biggest risk is what's inside the image you're running — and the process that put it there.

Image vulnerabilities: the supply chain problem nobody takes seriously enough

Of all the container security problems I've worked through, image vulnerabilities and supply chain integrity are consistently the most underestimated. Not because people don't know they're important — everyone knows they're important — but because the organizational friction of actually enforcing controls here is significant.

The issue isn't scanning. Scanning is easy. The issue is what you do with scan results, who owns the remediation, and whether you have the organizational leverage to actually block deployments on critical findings.

The three supply chain failures I see most often

1. Base image sprawl with no ownership. Teams pull whatever base image is convenient — ubuntu:latest, python:3.9, node:18 — with no consideration for who's responsible for keeping those images updated. Six months later, you have 40 different base images across your environment, many of which haven't been rebuilt since the original deployment. Every one of them is accumulating CVEs.

2. Scanning that runs but doesn't gate. CI/CD pipelines get a scanner bolted on, findings get reported to a dashboard somewhere, and then nothing happens. The scanner is running, the findings are visible, but there's no enforcement gate that actually prevents a critically vulnerable image from reaching production. This gives the impression of security without any of the actual protection.

3. Assuming the registry is the trust boundary. "We only pull from our internal registry" sounds like a good control until you realize the internal registry is a mirror of Docker Hub images that were pulled, scanned once at import time, and never rescanned. The registry isn't a trust boundary — it's a storage location. The trust boundary is the current vulnerability state of what's in it.

Common Mistake

Scanning at build time without rescanning in the registry creates a false sense of security. A clean image today will have new CVEs in 60 days. Your scanning strategy needs to account for images aging in the registry.

Building an actual supply chain control architecture

Here's the architecture pattern I've landed on after working through these problems in different environments. It's not revolutionary — the components are well-known — but the key is the enforcement chain, not any individual piece.

Layer 1: Curated, owned base images

Define a small set of approved base images that your organization owns and maintains. This means a dedicated pipeline that rebuilds these images on a schedule — not just when someone remembers — and pushes updated versions to your internal registry automatically. Teams inherit from these base images, not from public registries directly.

Dockerfile pattern — inheriting from internal base
# DON'T: Pull directly from public registry
FROM python:3.11-slim

# DO: Inherit from your org's maintained base image
FROM registry.internal.yourorg.com/base/python:3.11-slim-hardened

# Base image handles: OS patching, user config,
# non-root defaults, removed unnecessary packages
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
USER nonroot

Layer 2: Scan at build, block on critical

Integrate scanning into your CI/CD pipeline with a hard gate on critical and high severity findings. The key word is "block" — not "report." If a build produces an image with an unacknowledged critical CVE, the pipeline fails. No exceptions without a documented, time-bounded waiver.

The tooling here matters less than the enforcement. Trivy, Grype, Snyk, AWS ECR scanning — pick one your team will actually use and wire it to your pipeline gate. In ECS Fargate environments, ECR's native scanning integrates cleanly. In more heterogeneous environments, a tool like Trivy that runs anywhere is more practical.

Layer 3: Continuous registry scanning

Build-time scanning is table stakes. The harder problem is images that pass a clean scan at build time but accumulate CVEs as they sit in the registry waiting for deployment — or worse, after they're already running in production.

Your registry needs continuous or scheduled rescanning of stored images. ECR Enhanced Scanning does this natively. For other registries, a scheduled scan job against the full registry catalog isn't glamorous, but it closes the gap.

Conceptual: scheduled registry scan with alerting
# Simplified pattern for scheduled registry audit
# Run daily against all image tags in production repos

for image in $(get_production_images):
    results = trivy_scan(image, severity=["CRITICAL","HIGH"])
    
    if results.critical_count > 0:
        # Alert and create remediation ticket
        alert_security_team(image, results)
        create_ticket(image, results, sla_days=7)
    
    if results.image_age_days > 90:
        # Flag for rebuild regardless of CVE state
        flag_for_rebuild(image)

Layer 4: Runtime admission control

In Kubernetes environments, admission controllers like OPA/Gatekeeper or Kyverno let you enforce image policies at the point of deployment — blocking pods that reference images without a valid scan attestation, or that don't come from your approved registry. This is a powerful enforcement layer that catches anything that slips through earlier gates.

In ECS Fargate, you don't have the same admission controller model, but you can enforce similar controls through IAM policies on task execution roles and ECR repository policies that restrict which images can be pulled.

Standalone Docker environments are the hardest to enforce programmatically — you're typically relying more heavily on process controls, approved image lists, and manual review. This is a real limitation that's worth acknowledging rather than pretending you have the same control surface as a managed platform.

The organizational problem is harder than the technical one

I've spent more of this post on technical controls than organizational dynamics, but honestly the organizational side is where container supply chain security usually breaks down in practice.

The specific failure mode is ownership ambiguity. Security teams can build all the scanning infrastructure they want, but if there's no clear answer to "who remediates this finding in this service?" — nothing gets fixed. Findings accumulate in dashboards, technical debt compounds, and the next time you're asked to demonstrate your security posture, you're looking at hundreds of unacknowledged CVEs across dozens of images.

Define clear ownership for every image in your registry — a team, not just a repo.
Establish SLAs for remediation by severity and enforce them the same way you'd enforce any other security finding.
Make the remediation path as frictionless as possible — if rebuilding an image requires five manual steps, it won't happen promptly.
Separate the "we found a vulnerability" conversation from the "here's your approved waiver process" conversation. Both need to exist.

TL;DR

The architecture that actually works

Container supply chain security isn't a single control — it's a chain: owned base images → build-time scanning with hard gates → continuous registry rescanning → runtime admission enforcement → clear remediation ownership. Any gap in that chain is where your actual risk lives. The specific tooling matters less than ensuring no step in the chain is missing or purely advisory. And whichever platform you're on — Fargate, Kubernetes, standalone Docker — be honest about what your control surface actually allows rather than architecting for an environment you don't have.