VOL.08 — FIELD NOTES · ENTRY 20/23

▲ 8,200M

EngineeringNovember 18, 20237 min read

The Deployment That Broke Everything and What I Did Next

I want to tell you about the deployment that broke everything, because the stories engineers tell are usually the clean ones, and the messy ones are where the real lessons live. This one is messy. It was a Friday. I deployed at 4pm. If you are an experienced engineer, you already know I had made my first mistake before anything went wrong.

The 4pm Friday deploy

There is a reason "no deploys on Friday" is a cliche. It is a cliche because it is true and we all ignore it anyway. I had a change that was ready, the pressure to ship was real, and 4pm felt fine. The change was not even that big. Small changes are how they get you. You let your guard down precisely when you should not.

The size of the change has nothing to do with the size of the explosion. The smallest deploys cause the biggest fires because nobody watches them closely.

The Slack message

Twenty minutes later, the Slack message. Something is broken. Then another. Then the tone shifted from "is something wrong?" to "everything is down." My stomach did the thing stomachs do. I switched into emergency mode, which is a specific state where your hands move faster than your fear, barely.

The rollback that did not work

Here is where it got bad. I went to roll back, which is supposed to be the safety net, the whole reason you can deploy with any confidence. And the rollback did not cleanly work. The new code had touched something that did not simply reverse, a data shape, a piece of state, the kind of change that a code rollback alone does not undo. So now I was not just broken, I was broken in a state that my safety net could not catch.

That is the moment the real panic arrives. Not when it breaks. When the thing you were counting on to save you also fails. For a few minutes I genuinely did not know how I was going to get out of it.

Three hours

What followed was three hours of the most focused, terrified work I have done. I stopped flailing and forced myself to be methodical:

  • First, stop the bleeding. Get the system into any stable state, even a degraded one, so users are not staring at errors.
  • Second, understand exactly what the deploy changed, including the side effects the rollback could not reverse.
  • Third, fix forward where rolling back was impossible. Sometimes the only way out is through.
  • Fourth, verify carefully before declaring it over, because a second false "it is fixed" would have destroyed what trust I had left.

Three hours later it was genuinely fixed and verified. I was drained in a way that is hard to describe. Not just tired. Hollowed out by sustained adrenaline.

What I learned about shipping

A few things, and they stuck permanently. Never deploy when you cannot babysit it, which means never at 4pm Friday. Your rollback is not a real safety net unless you have tested that it actually rolls back, including state and data, not just code. And the calm methodical approach during a crisis beats frantic action every time, even though every nerve in your body is screaming to just do something.

Why I tell this story

Because I see junior engineers terrified of breaking production, as if breaking it makes them frauds. It does not. Every engineer I respect has a story like this. The difference between a junior and a senior is not that the senior never breaks things. It is that the senior has broken things, survived, and built the habits that come only from that survival. I broke everything one Friday. I am a better engineer for it. You will be too, when it is your turn, and it will be your turn.

Saroj Prasad Mainali

Full-Stack Engineer · Kathmandu

WHAT I AM DOING NOW

MORE IN ENGINEERING

02