DevOps ShmevOps

Lessons from Software Engineering in Multi-Tenant Infrastructure Why ITIL means 'idle', SRE Isn’t Enough, and DevOps Doesn’t Scale Across the Zoo Let’s Get Real. ITIL gave us structure Agile gave us cadence DevOps changed how we ship software SRE made reliability a product metric But in multi-tenant infrastructure engineering — especially across clients across industry verticals, countless endpoints, and wildly divergent architectures — none of them go far enough. They weren’t designed for: Stacked and conflicting compliance layers Systems that span generations of tooling and vendors Vertical-specific constraints that can’t be unified into a single pipeline Shared services where one bad ticket can impact ten environments Teams juggling business objectives, infrastructure, identity, support, security, and uptime with a staff of four This is not an app pipeline. This is a living system — messy, entangled across physical and digital layers, SLA-bound, and barely held together by the burnout cycle of one overextended unicorn engineer… handing off just enough notes for the next unicorn to pick up the pieces. We don’t grow engineers to sustain the system — we hunt unicorns to survive it. And as long as we build systems that depend on the rarity of engineering talent instead of resilience embedded in the architecture, we’ll always be short-staffed, over-leveraged, and one alert notification away from collapse. ITIL: Idle Time in Layers There’s something ironic about a framework built to “standardize service delivery” being so... abstract that it only works if everything is predictable, repeatable, and documented — which is to say: never in the actual environments it claims to govern. ITIL Assumes: You have a CMDB that reflects reality Incident categories are clear before triage Change windows are sacred and followed All actions are approved, ticketed, and categorized before they happen Now Compare to Reality: “What’s this server?” “No idea. But unplug it and three clients go offline.” “Why is this backup writing to Bob’s desktop?” “He retired in 2019. But his profile still runs as a scheduled task.” ITIL was designed for a system you don’t run. And more than that — for a system that doesn’t evolve. Agile and DevOps: Built for Builders — Not Operators of Chaos DevOps was built for: Unified, app-centric teams Infrastructure defined as code Environments you control Pipelines you own Product roadmaps and sprint boards But in an MSP, or a cross-tenant ops team, you don’t get to pick the stack. You inherit it. You glue it together. You support it — even when it makes no architectural sense. You can’t CI/CD your way out of: SonicWall configs from 2013 Shadow SaaS systems On-prem Exchange still handling business-critical mail Hybrid Azure AD + local GPOs + 3rd-party MFA DevOps tells you to “shift left.” But if you don’t own the architecture, the pipeline, or the culture? There is no left to shift to. SRE Helps — If You Own the System SRE gave us: SLIs, SLOs, and error budgets Golden signals Incident response rituals Reliability as a product spec But SRE assumes standardization and system ownership. In a zoo of fragmented infra: You don’t control telemetry end-to-end You might not even have full access to the systems SLAs exist, but SLOs are… aspirational Monitoring tools vary per client — if they exist at all You can’t reliability-engineer what you can’t observe, can’t access, and don’t fully own. Welcome to the Zoo Forget “pets vs. cattle.” That metaphor died with Windows 2008. Here’s what you're actually running: Legacy pets – fragile, undocumented, critical Sacred cows – architecturally absurd but politically untouchable Ghosts – no known owner, definitely in production Possums – look dead, reboot them and the lights go out Cattle – finally, something you can rebuild Zebras – look compliant, hide deep weirdness Chimeras – half-cloud, half-VM, all nightmare Invasive species – installed by Bob in Accounting, running with root This isn’t a platform. It’s a zoo — chaotic, multi-tenant, partially observable, duct-taped together — and you’re still on the hook for uptime. What You Actually Do: Engineer Interfaces, Not Just Systems In this world, discipline has to look different. And software engineering has patterns we can steal. 1. Everything Should Be an Interface You can’t fix what you can’t isolate. Define interfaces, even if the implementation is messy Encapsulate chaos — contain it, describe it, version it Use contracts as boundaries VPNs as APIs Client provisioning as request/response Even if the backend relies on folklore logic, the front should behave predictably It’s not about purity — it’s about containment. 2. GitOps for Infra, Not Just Code Apply GitOps to:

Apr 14, 2025 - 18:50
 0
DevOps ShmevOps

Lessons from Software Engineering in Multi-Tenant Infrastructure

Why ITIL means 'idle', SRE Isn’t Enough, and DevOps Doesn’t Scale Across the Zoo

Let’s Get Real.

  • ITIL gave us structure
  • Agile gave us cadence
  • DevOps changed how we ship software
  • SRE made reliability a product metric

But in multi-tenant infrastructure engineering — especially across clients across industry verticals, countless endpoints, and wildly divergent architectures — none of them go far enough.

They weren’t designed for:

  • Stacked and conflicting compliance layers
  • Systems that span generations of tooling and vendors
  • Vertical-specific constraints that can’t be unified into a single pipeline
  • Shared services where one bad ticket can impact ten environments
  • Teams juggling business objectives, infrastructure, identity, support, security, and uptime with a staff of four

This is not an app pipeline.

This is a living system — messy, entangled across physical and digital layers, SLA-bound, and barely held together by the burnout cycle of one overextended unicorn engineer… handing off just enough notes for the next unicorn to pick up the pieces.

We don’t grow engineers to sustain the system — we hunt unicorns to survive it.

And as long as we build systems that depend on the rarity of engineering talent instead of resilience embedded in the architecture, we’ll always be short-staffed, over-leveraged, and one alert notification away from collapse.

ITIL: Idle Time in Layers

There’s something ironic about a framework built to “standardize service delivery” being so... abstract that it only works if everything is predictable, repeatable, and documented — which is to say: never in the actual environments it claims to govern.

ITIL Assumes:

  • You have a CMDB that reflects reality
  • Incident categories are clear before triage
  • Change windows are sacred and followed
  • All actions are approved, ticketed, and categorized before they happen

Now Compare to Reality:

  • “What’s this server?”

    “No idea. But unplug it and three clients go offline.”

  • “Why is this backup writing to Bob’s desktop?”

    “He retired in 2019. But his profile still runs as a scheduled task.”

ITIL was designed for a system you don’t run.

And more than that — for a system that doesn’t evolve.

Agile and DevOps: Built for Builders — Not Operators of Chaos

DevOps was built for:

  • Unified, app-centric teams
  • Infrastructure defined as code
  • Environments you control
  • Pipelines you own
  • Product roadmaps and sprint boards

But in an MSP, or a cross-tenant ops team, you don’t get to pick the stack.

You inherit it. You glue it together. You support it — even when it makes no architectural sense.

You can’t CI/CD your way out of:

  • SonicWall configs from 2013
  • Shadow SaaS systems
  • On-prem Exchange still handling business-critical mail
  • Hybrid Azure AD + local GPOs + 3rd-party MFA

DevOps tells you to “shift left.”

But if you don’t own the architecture, the pipeline, or the culture?

There is no left to shift to.

SRE Helps — If You Own the System

SRE gave us:

  • SLIs, SLOs, and error budgets
  • Golden signals
  • Incident response rituals
  • Reliability as a product spec

But SRE assumes standardization and system ownership.

In a zoo of fragmented infra:

  • You don’t control telemetry end-to-end
  • You might not even have full access to the systems
  • SLAs exist, but SLOs are… aspirational
  • Monitoring tools vary per client — if they exist at all

You can’t reliability-engineer what you can’t observe, can’t access, and don’t fully own.

Welcome to the Zoo

Forget “pets vs. cattle.” That metaphor died with Windows 2008.

Here’s what you're actually running:

  • Legacy pets – fragile, undocumented, critical
  • Sacred cows – architecturally absurd but politically untouchable
  • Ghosts – no known owner, definitely in production
  • Possums – look dead, reboot them and the lights go out
  • Cattle – finally, something you can rebuild
  • Zebras – look compliant, hide deep weirdness
  • Chimeras – half-cloud, half-VM, all nightmare
  • Invasive species – installed by Bob in Accounting, running with root

This isn’t a platform.

It’s a zoo — chaotic, multi-tenant, partially observable, duct-taped together — and you’re still on the hook for uptime.

What You Actually Do: Engineer Interfaces, Not Just Systems

In this world, discipline has to look different.

And software engineering has patterns we can steal.

1. Everything Should Be an Interface

You can’t fix what you can’t isolate.

  • Define interfaces, even if the implementation is messy
  • Encapsulate chaos — contain it, describe it, version it
  • Use contracts as boundaries

  • VPNs as APIs

  • Client provisioning as request/response

  • Even if the backend relies on folklore logic, the front should behave predictably

It’s not about purity — it’s about containment.

2. GitOps for Infra, Not Just Code

Apply GitOps to:

  • DNS zones
  • Firewall rules
  • Patch flows
  • Client onboarding
  • Compliance checklists
  • Escalation workflows

If it’s procedural, version it.

If it’s versioned, you can refactor it.

3. Design Observability in Layers

You don’t own every telemetry pipeline.

So you build around that:

  • Sidecar collectors
  • Exporter proxies
  • Canonical internal formats
  • Semantic definitions for latency, throughput, error rate — across tools

Normalize the signal, not the tool.

Observability abstraction isn’t a luxury — it’s survival in a zoo.

4. Infrastructure is a Product. Build Like It.

If your internal tooling isn’t versioned, it’s implicit and passed on by water-cooler folklore.

If your procedures aren’t modular, you’re hand-coding ops.

Build:

  • Shared services: VPN-as-a-Service, onboarding templates
  • Training with the flow of a UX designer, not just forwarded docs
  • Reusable automation flows

Think like a platform team — even if your platform's definition is a one of a heterogeneous collection of many.

5. Runbooks Are Living Code

  • Version-controlled
  • Linked to telemetry
  • Reviewed and rehearsed
  • Tied to actual system state
  • Executable, testable, auditable

A runbook that’s not testable is just a wiki page getting old.

Billing Risk, SalesOps Gravity, and the Contractual Edge

You’re not just running infrastructure — you’re fulfilling contracts.

  • SLAs signed by sales
  • Custom deliverables for key accounts
  • Penalty clauses and billing triggers
  • Legacy exceptions you can’t rewrite

Every contract creates runtime boundaries:

  • Billing models tied to runtime behavior
  • Patch windows encoded in legal PDFs
  • Custom uptime guarantees
  • Support tiers bypassing queue logic

Engineering for Contractual Gravity

  • Version Your Commitments Expose them as API-like endpoints:
GET /client/acme_corp/sla → 99.99 / 5m
GET /client/acme_corp/billing_model → per-seat
  • Instrument What You Promise

    If it matters to a contract, it has to be in telemetry.

  • Invert Tooling Assumptions

    Make tools reflect contractual logic, not the other way around.

  • Policy Engines for Exceptions

client: BigGov
sla:
  availability: 99.99
billing_model: static_mrr_plus_usage
patch_window:
  days: [Mon, Wed]
  hours: [02:00–04:00 UTC]

Story Cards as Interface Contracts

In the zoo, a story card isn’t a task — it’s a boundary object.

Done right, it becomes a:

  • Negotiated contract
  • Shared context unit
  • Traceable spec for infra change
  • Trigger for ops and finance alignment

A Real Story in the Wild

Story: Provision compliance-aligned identity boundary for Org X

Context:

Org X signed a HIPAA-aligned MSA. This triggers Tier 2 controls by Q2.

Scope:

  • OU structure, GPOs
  • MFA policies
  • Log forwarding
  • SaaS federation

Constraints:

  • 30m downtime max
  • Sandbox test required
  • Patch window: Wed 01:00–03:00 UTC

Acceptance Criteria:

  • Controls in place
  • Logs flowing
  • Cutover <30m
  • Federation verified

Link to Contract: