Tag: Incident Response

  • When Safeguards Fail: Optimize for recovery

    Meet Kevin.

    She (yes, Kevin, a story for another day) loves spending time on our third-floor balcony. While I trust her inherent cat skills, my wife is less convinced. It prompted me to install lattice-style garden netting around the balcony. I spent arguably too much time ensuring no holes, no weaknesses in this safeguard. Certainly this will protect her!

    Despite my meticulous efforts, one day I spotted Kevin on the other side of the fence. She calmly attempted to return to safety, while I hastily dismantled parts of the fence to help her get back to the other side. Unfortunately, she either didn’t trust or understand what I was doing, as if not knowing she was inches away from a three-story fall…

    As I reached over the railing, Kevin playfully swatted at my hands, lost her footing, and fell like a flying squirrel to the grassy lawn below. Thankfully, she landed unscathed but shaken.

    This incident made me wonder: Did the fence, acting as a safeguard, help or hinder the situation? The fence (arguably) reduced exposure to potential accidents. However it also hindered a swift and safe resolution to the situation once the safeguard was breached. Would she have fallen at all if it wasn’t there?

    Optimize for Recovery

    In software development, we employ numerous safeguards like code reviews, CI tests, and staging or canary deploys prevent errors from reaching production. These measures help maintain high-quality, reliable, and bug-free code, mitigating risks associated with deploying to production and are there to ultimately protect our users.

    As we know, bugs are inevitable. No amount of safeguards can guarantee flawless software. When issues do arise, it’s essential to swiftly address them and learn from the experience.

    Kevin’s balcony situation surfaced some relatable takeaways when setting up your safeguards in software development:

    • Expect the Unexpected: Be prepared for the occasional breach of safeguards, and have a plan in place to manage such incidents quickly.
    • Optimize for Recovery: Safeguards shouldn’t impede quick resolution or reversion. Prioritize incident response time.
    • Striking a Balance: It’s crucial to find the right equilibrium between safety checks and slowing down your team from responding to issues.

    Consider these additional tips to improve response time when the safeguards fail:

    • Have a clear “In Case of Emergency” document. Ensure everyone knows where to find it, and has a reflex to check there. Put the link to it into the alerting or monitoring itself.
    • Equip your team with revert commands that can bypass required CI checks or time-consuming builds, enabling a faster resolution.
    • Don’t let a single individual or team become a bottleneck in the incident response process. Ensure that anyone with deploy permissions can revert a bad commit, increasing your team’s overall agility.

    Kevin’s balcony adventure served me a reminder of the delicate balance between implementing safeguards and trusting our abilities, whether we’re dealing with cats and fences, or developers and code.