You Should Break Prod

Recently, I’ve seen people push back on something I strongly believe in: You only truly understand systems when you break them and fix them. Some folks argue that production should be sacred, that failure is bad, and that “breaking things” is reckless. And yes, breaking actual production is never the goal. But if you’re serious about learning how systems behave under pressure (how they fail, recover, or silently corrupt data), then breaking things in a safe environment is one of the best teachers you’ll ever have. Failure > Theory When everything works, your mental model doesn’t get tested. You build a service, deploy it, and it runs. Cool. But what happens when: a disk fills up? DNS breaks? a container gets OOM-killed? your message queue reorders events? If you haven’t seen these failures before, the first time will be painful. But if you’ve deliberately broken things before (if you’ve "explored modes of failure"), you’ll be calm, fast, and effective. Exploring Modes of Failure I don’t know where I first heard the phrase “explore modes of failure,” but it stuck with me. It perfectly captures the mindset. Don’t just build and test for the happy path. Go out of your way to understand how things behave when they don’t work as intended. You don’t need a chaos engineering team like Netflix. You don’t need fancy fault injection tools. You just need curiosity and a willingness to get your hands dirty. Try: Randomly killing containers during load Simulating packet loss or high latency Expiring TLS certificates and watching clients fail Turning off network interfaces Locking a DB table and observing retries or timeouts Deleting your .env file and restarting your app You’ll be surprised what breaks. And what doesn’t. Big Companies Do This, But You Can Too At scale, companies like Netflix, Google, and Shopify run structured failure simulations. They simulate region outages, API timeouts, cascading crashes. They do it with dashboards, playbooks, and postmortems. You don’t need all that. You can play chaos monkey solo, in dev or staging. Have a “break day” where you experiment with failure Write a shell script that deletes a random container every hour Kill your database mid-request and see what logs pop up For example, at sliplane.io we extensively test breaking non-critical parts to make sure that they dont take down critical parts. We once found a serious case of the Thundering Herds Problem that could've escalated if we wouldn't have found it! It’s fun. And educational. But it’s also important to remember: Know Where to Stop The law of diminishing returns applies. Once you’ve seen what a disk full error looks like, you don’t need to simulate it weekly. And there’s a difference between useful failure and pure fantasy. Sure, Netflix might simulate a whole AWS region going dark, but for most of us, simulating a host failure or a network partition is more realistic and relevant. Focus on what’s likely, not what’s cinematic. Pressure Builds Skill Here’s the thing about real outages. They’re not just technical. They’re psychological. Suddenly your Slack is blowing up, Grafana is red, and your brain is running at 120 percent. Everyone’s watching. Every second matters. And if you’ve never experienced that kind of pressure before, it’s easy to freeze, panic, or start guessing. But if you’ve seen worse (even in controlled environments), you stay cool. You follow the logs. You trust your instincts. You debug instead of flailing. And most importantly, you fix it faster. That experience also changes how you build. You write code assuming: the network is unreliable the database might go away retries will happen users will send malformed input things fail and recover You move from “it works on my machine” to “it fails gracefully on every machine.” No, you are not too good for that If you think to yourself this doesnt happen to me! I'm the best programmer!. Sorry, but no. Incidents are normal and happen to the best. Mistakes will happen, make sure you are prepared to handle them. Don't let your ego get in your way! Final Thoughts You don’t need to break production. But you should break something. Test environments, staging setups, even your own laptop are fair game. Break things. Learn how they fail. Fix them. You’ll become a better engineer, more confident, more resilient, and more empathetic toward your infrastructure. Don’t fear failure. Explore it. Cheers, Jonas Co-Founder sliplane.io

Apr 30, 2025 - 14:48
 0
You Should Break Prod

Recently, I’ve seen people push back on something I strongly believe in:

You only truly understand systems when you break them and fix them.

Some folks argue that production should be sacred, that failure is bad, and that “breaking things” is reckless. And yes, breaking actual production is never the goal. But if you’re serious about learning how systems behave under pressure (how they fail, recover, or silently corrupt data), then breaking things in a safe environment is one of the best teachers you’ll ever have.

Failure > Theory

When everything works, your mental model doesn’t get tested. You build a service, deploy it, and it runs. Cool. But what happens when:

  • a disk fills up?
  • DNS breaks?
  • a container gets OOM-killed?
  • your message queue reorders events?

If you haven’t seen these failures before, the first time will be painful. But if you’ve deliberately broken things before (if you’ve "explored modes of failure"), you’ll be calm, fast, and effective.

Exploring Modes of Failure

I don’t know where I first heard the phrase “explore modes of failure,” but it stuck with me.

It perfectly captures the mindset. Don’t just build and test for the happy path. Go out of your way to understand how things behave when they don’t work as intended.

You don’t need a chaos engineering team like Netflix. You don’t need fancy fault injection tools. You just need curiosity and a willingness to get your hands dirty.

Try:

  • Randomly killing containers during load
  • Simulating packet loss or high latency
  • Expiring TLS certificates and watching clients fail
  • Turning off network interfaces
  • Locking a DB table and observing retries or timeouts
  • Deleting your .env file and restarting your app

You’ll be surprised what breaks. And what doesn’t.

Big Companies Do This, But You Can Too

At scale, companies like Netflix, Google, and Shopify run structured failure simulations. They simulate region outages, API timeouts, cascading crashes. They do it with dashboards, playbooks, and postmortems.

You don’t need all that.

You can play chaos monkey solo, in dev or staging.

  • Have a “break day” where you experiment with failure
  • Write a shell script that deletes a random container every hour
  • Kill your database mid-request and see what logs pop up

For example, at sliplane.io we extensively test breaking non-critical parts to make sure that they dont take down critical parts. We once found a serious case of the Thundering Herds Problem that could've escalated if we wouldn't have found it!

It’s fun. And educational. But it’s also important to remember:

Know Where to Stop

The law of diminishing returns applies.

Once you’ve seen what a disk full error looks like, you don’t need to simulate it weekly.

And there’s a difference between useful failure and pure fantasy.

Sure, Netflix might simulate a whole AWS region going dark, but for most of us, simulating a host failure or a network partition is more realistic and relevant.

Focus on what’s likely, not what’s cinematic.

Pressure Builds Skill

Here’s the thing about real outages.

They’re not just technical. They’re psychological.

Suddenly your Slack is blowing up, Grafana is red, and your brain is running at 120 percent. Everyone’s watching. Every second matters. And if you’ve never experienced that kind of pressure before, it’s easy to freeze, panic, or start guessing.

But if you’ve seen worse (even in controlled environments), you stay cool. You follow the logs. You trust your instincts. You debug instead of flailing.

And most importantly, you fix it faster.

That experience also changes how you build. You write code assuming:

  • the network is unreliable
  • the database might go away
  • retries will happen
  • users will send malformed input
  • things fail and recover

You move from “it works on my machine” to “it fails gracefully on every machine.”

No, you are not too good for that

If you think to yourself this doesnt happen to me! I'm the best programmer!. Sorry, but no. Incidents are normal and happen to the best. Mistakes will happen, make sure you are prepared to handle them. Don't let your ego get in your way!

Final Thoughts

You don’t need to break production.

But you should break something.

Test environments, staging setups, even your own laptop are fair game. Break things. Learn how they fail. Fix them. You’ll become a better engineer, more confident, more resilient, and more empathetic toward your infrastructure.

Don’t fear failure.

Explore it.

Cheers,

Jonas Co-Founder sliplane.io