DevOps ShmevOps

Lessons from Software Engineering in Multi-Tenant Infrastructure Why ITIL means 'idle', SRE Isn’t Enough, and DevOps Doesn’t Scale Across the Zoo Let’s Get Real. ITIL gave us structure Agile gave us cadence DevOps changed how we ship software SRE made reliability a product metric But in multi-tenant infrastructure engineering — especially across clients across industry verticals, countless endpoints, and wildly divergent architectures — none of them go far enough. They weren’t designed for: Stacked and conflicting compliance layers Systems that span generations of tooling and vendors Vertical-specific constraints that can’t be unified into a single pipeline Shared services where one bad ticket can impact ten environments Teams juggling business objectives, infrastructure, identity, support, security, and uptime with a staff of four This is not an app pipeline. This is a living system — messy, entangled across physical and digital layers, SLA-bound, and barely held together by the burnout cycle of one overextended unicorn engineer… handing off just enough notes for the next unicorn to pick up the pieces. We don’t grow engineers to sustain the system — we hunt unicorns to survive it. And as long as we build systems that depend on the rarity of engineering talent instead of resilience embedded in the architecture, we’ll always be short-staffed, over-leveraged, and one alert notification away from collapse. ITIL: Idle Time in Layers There’s something ironic about a framework built to “standardize service delivery” being so... abstract that it only works if everything is predictable, repeatable, and documented — which is to say: never in the actual environments it claims to govern. ITIL Assumes: You have a CMDB that reflects reality Incident categories are clear before triage Change windows are sacred and followed All actions are approved, ticketed, and categorized before they happen Now Compare to Reality: “What’s this server?” “No idea. But unplug it and three clients go offline.” “Why is this backup writing to Bob’s desktop?” “He retired in 2019. But his profile still runs as a scheduled task.” ITIL was designed for a system you don’t run. And more than that — for a system that doesn’t evolve. Agile and DevOps: Built for Builders — Not Operators of Chaos DevOps was built for: Unified, app-centric teams Infrastructure defined as code Environments you control Pipelines you own Product roadmaps and sprint boards But in an MSP, or a cross-tenant ops team, you don’t get to pick the stack. You inherit it. You glue it together. You support it — even when it makes no architectural sense. You can’t CI/CD your way out of: SonicWall configs from 2013 Shadow SaaS systems On-prem Exchange still handling business-critical mail Hybrid Azure AD + local GPOs + 3rd-party MFA DevOps tells you to “shift left.” But if you don’t own the architecture, the pipeline, or the culture? There is no left to shift to. SRE Helps — If You Own the System SRE gave us: SLIs, SLOs, and error budgets Golden signals Incident response rituals Reliability as a product spec But SRE assumes standardization and system ownership. In a zoo of fragmented infra: You don’t control telemetry end-to-end You might not even have full access to the systems SLAs exist, but SLOs are… aspirational Monitoring tools vary per client — if they exist at all You can’t reliability-engineer what you can’t observe, can’t access, and don’t fully own. Welcome to the Zoo Forget “pets vs. cattle.” That metaphor died with Windows 2008. Here’s what you're actually running: Legacy pets – fragile, undocumented, critical Sacred cows – architecturally absurd but politically untouchable Ghosts – no known owner, definitely in production Possums – look dead, reboot them and the lights go out Cattle – finally, something you can rebuild Zebras – look compliant, hide deep weirdness Chimeras – half-cloud, half-VM, all nightmare Invasive species – installed by Bob in Accounting, running with root This isn’t a platform. It’s a zoo — chaotic, multi-tenant, partially observable, duct-taped together — and you’re still on the hook for uptime. What You Actually Do: Engineer Interfaces, Not Just Systems In this world, discipline has to look different. And software engineering has patterns we can steal. 1. Everything Should Be an Interface You can’t fix what you can’t isolate. Define interfaces, even if the implementation is messy Encapsulate chaos — contain it, describe it, version it Use contracts as boundaries VPNs as APIs Client provisioning as request/response Even if the backend relies on folklore logic, the front should behave predictably It’s not about purity — it’s about containment. 2. GitOps for Infra, Not Just Code Apply GitOps to:

Apr 14, 2025 - 18:50

Lessons from Software Engineering in Multi-Tenant Infrastructure

Why ITIL means 'idle', SRE Isn’t Enough, and DevOps Doesn’t Scale Across the Zoo

Let’s Get Real.

ITIL gave us structure
Agile gave us cadence
DevOps changed how we ship software
SRE made reliability a product metric

But in multi-tenant infrastructure engineering — especially across clients across industry verticals, countless endpoints, and wildly divergent architectures — none of them go far enough.

They weren’t designed for:

Stacked and conflicting compliance layers
Systems that span generations of tooling and vendors
Vertical-specific constraints that can’t be unified into a single pipeline
Shared services where one bad ticket can impact ten environments
Teams juggling business objectives, infrastructure, identity, support, security, and uptime with a staff of four

This is not an app pipeline.

This is a living system — messy, entangled across physical and digital layers, SLA-bound, and barely held together by the burnout cycle of one overextended unicorn engineer… handing off just enough notes for the next unicorn to pick up the pieces.

We don’t grow engineers to sustain the system — we hunt unicorns to survive it.

And as long as we build systems that depend on the rarity of engineering talent instead of resilience embedded in the architecture, we’ll always be short-staffed, over-leveraged, and one alert notification away from collapse.

ITIL: Idle Time in Layers

There’s something ironic about a framework built to “standardize service delivery” being so... abstract that it only works if everything is predictable, repeatable, and documented — which is to say: never in the actual environments it claims to govern.

ITIL Assumes:

You have a CMDB that reflects reality
Incident categories are clear before triage
Change windows are sacred and followed
All actions are approved, ticketed, and categorized before they happen

Now Compare to Reality:

“What’s this server?”

“No idea. But unplug it and three clients go offline.”
“Why is this backup writing to Bob’s desktop?”

“He retired in 2019. But his profile still runs as a scheduled task.”

ITIL was designed for a system you don’t run.

And more than that — for a system that doesn’t evolve.

Agile and DevOps: Built for Builders — Not Operators of Chaos

DevOps was built for:

Unified, app-centric teams
Infrastructure defined as code
Environments you control
Pipelines you own
Product roadmaps and sprint boards

But in an MSP, or a cross-tenant ops team, you don’t get to pick the stack.

You inherit it. You glue it together. You support it — even when it makes no architectural sense.

You can’t CI/CD your way out of:

SonicWall configs from 2013
Shadow SaaS systems
On-prem Exchange still handling business-critical mail
Hybrid Azure AD + local GPOs + 3rd-party MFA

DevOps tells you to “shift left.”

But if you don’t own the architecture, the pipeline, or the culture?

There is no left to shift to.

SRE Helps — If You Own the System

SRE gave us:

SLIs, SLOs, and error budgets
Golden signals
Incident response rituals
Reliability as a product spec

But SRE assumes standardization and system ownership.

In a zoo of fragmented infra:

You don’t control telemetry end-to-end
You might not even have full access to the systems
SLAs exist, but SLOs are… aspirational
Monitoring tools vary per client — if they exist at all

You can’t reliability-engineer what you can’t observe, can’t access, and don’t fully own.

Welcome to the Zoo

Forget “pets vs. cattle.” That metaphor died with Windows 2008.

Here’s what you're actually running:

Legacy pets – fragile, undocumented, critical
Sacred cows – architecturally absurd but politically untouchable
Ghosts – no known owner, definitely in production
Possums – look dead, reboot them and the lights go out
Cattle – finally, something you can rebuild
Zebras – look compliant, hide deep weirdness
Chimeras – half-cloud, half-VM, all nightmare
Invasive species – installed by Bob in Accounting, running with root

This isn’t a platform.

It’s a zoo — chaotic, multi-tenant, partially observable, duct-taped together — and you’re still on the hook for uptime.

What You Actually Do: Engineer Interfaces, Not Just Systems

In this world, discipline has to look different.

And software engineering has patterns we can steal.

1. Everything Should Be an Interface

You can’t fix what you can’t isolate.

Define interfaces, even if the implementation is messy
Encapsulate chaos — contain it, describe it, version it
Use contracts as boundaries
VPNs as APIs
Client provisioning as request/response
Even if the backend relies on folklore logic, the front should behave predictably

It’s not about purity — it’s about containment.

2. GitOps for Infra, Not Just Code

Apply GitOps to:

DNS zones
Firewall rules
Patch flows
Client onboarding
Compliance checklists
Escalation workflows

If it’s procedural, version it.

If it’s versioned, you can refactor it.

3. Design Observability in Layers

You don’t own every telemetry pipeline.

So you build around that:

Sidecar collectors
Exporter proxies
Canonical internal formats
Semantic definitions for latency, throughput, error rate — across tools

Normalize the signal, not the tool.

Observability abstraction isn’t a luxury — it’s survival in a zoo.

4. Infrastructure is a Product. Build Like It.

If your internal tooling isn’t versioned, it’s implicit and passed on by water-cooler folklore.

If your procedures aren’t modular, you’re hand-coding ops.

Build:

Shared services: VPN-as-a-Service, onboarding templates
Training with the flow of a UX designer, not just forwarded docs
Reusable automation flows

Think like a platform team — even if your platform's definition is a one of a heterogeneous collection of many.

5. Runbooks Are Living Code

Version-controlled
Linked to telemetry
Reviewed and rehearsed
Tied to actual system state
Executable, testable, auditable

A runbook that’s not testable is just a wiki page getting old.

Billing Risk, SalesOps Gravity, and the Contractual Edge

You’re not just running infrastructure — you’re fulfilling contracts.

SLAs signed by sales
Custom deliverables for key accounts
Penalty clauses and billing triggers
Legacy exceptions you can’t rewrite

Every contract creates runtime boundaries:

Billing models tied to runtime behavior
Patch windows encoded in legal PDFs
Custom uptime guarantees
Support tiers bypassing queue logic

Engineering for Contractual Gravity

Version Your Commitments Expose them as API-like endpoints:

GET /client/acme_corp/sla → 99.99 / 5m
GET /client/acme_corp/billing_model → per-seat

Instrument What You Promise

If it matters to a contract, it has to be in telemetry.
Invert Tooling Assumptions

Make tools reflect contractual logic, not the other way around.
Policy Engines for Exceptions

client: BigGov
sla:
  availability: 99.99
billing_model: static_mrr_plus_usage
patch_window:
  days: [Mon, Wed]
  hours: [02:00–04:00 UTC]

Story Cards as Interface Contracts

In the zoo, a story card isn’t a task — it’s a boundary object.

Done right, it becomes a:

Negotiated contract
Shared context unit
Traceable spec for infra change
Trigger for ops and finance alignment

A Real Story in the Wild

Story: Provision compliance-aligned identity boundary for Org X

Context:

Org X signed a HIPAA-aligned MSA. This triggers Tier 2 controls by Q2.

Scope:

OU structure, GPOs
MFA policies
Log forwarding
SaaS federation

Constraints:

30m downtime max
Sandbox test required
Patch window: Wed 01:00–03:00 UTC

Acceptance Criteria:

Controls in place
Logs flowing
Cutover <30m
Federation verified

Link to Contract: