As more and more reports are being released, confirming initial suspicions that the recent Facebook, WhatsApp, and Instagram outage was a result of poor configuration management, I loathe to think how many teams may be encouraged to only look at the problem literally.
Having co-authored Configuration Management Best Practices: Practical Methods that Work in the Real World, I am far from skeptical of the importance of strong change control procedures and triple-testing software. However, as a seasoned psychology professional with expertise in the IT world, I know that there’s often more going on below the surface-level discovery. While people are often quick to turn the conversation to the more easily quantifiable variables which play a role in instigating or averting disaster, what appears to solely be the result of a small but critical error, is often also a natural outgrowth of individual and group behaviors.
As experts in helping you comply with federal regulations, it would be all too ironic if we failed to inform you that our articles may contain affiliate links. That means that Aiello Consulting and relevant contributors may receive commission on qualifying purchases, at no additional cost to you.
Sometimes when Aiello Consulting is called in to assess a company’s DevOps culture, we find that there simply aren’t well-defined Lean IT control procedures (such as rapid and reliable change control reviews, with all of the right stakeholders involved). Other times, they may be understood by everyone, and yet not followed by anyone. In the former case, as well, but especially so in the latter example, it is rarely a lack of computer science knowledge which is cause for concern. Rather, it is the lack of collaboration between employees which is troublesome. Developers working on the latest messaging feature may not regularly interact with DNS experts, and vice versa. Most of the time such siloed thinking won’t disrupt anything, but occasionally, it can mean that changes are not properly reviewed and approved by the relevant parties.
We cannot only talk to our own team members; we must establish and maintain cross-functional communication.
Almost too poetically, Facebook’s infrastructure, much like many of the team members’ we interview in our assessments, appears to be so self-reliant that it is not always reliable. Some articles even allege that the damage extended to company badges, which were temporarily unable to grant access to the corporate offices. Good, honest DevOps culture requires that we come to terms with the fact that incidents will occur. Rather than funnel all of our resources into denying their existence, we have a responsibility to our stakeholders to work together and interface with other products and services.
We cannot only use our own servers; we must appreciate other platforms for their utility, even if only as a Plan B.
Still, responsibly hoping for the best while preparing for the worst requires more than a backup plan. It’s crucial to always weigh the costs and benefits of making any change to our systems. While no release comes without a risk, neither does doing nothing.
Technology experts can be quick to dismiss the idea of “risk”, regularly insisting that they’ve eradicated all possibility of failure.
As any psychologist can tell you, denial is a defense mechanism. While denying possible dangers to our system may temporarily save us from discomfort, routinely doing so can be infinitely more painful in the long run.
Leaders who want to discourage siloed thinking and living in denial and instead want to encourage people to collaborate and confront the reality of risk can do so by training their team members in industry best practices. Effective training can promote a sense of confidence with uncertainty, because people feel empowered enough to tackle any hiccups which may arise.
Written by Leslie Sachs
0 Comments