CrowdStrike and the risks of the silent update.
There are many hot takes on social media about the CrowdStrike disaster and criticism of local governments that used CS extensively. Yes, a monoculture of software offers less resilience to unpredictable disasters than having a variety of solutions in place that will support operations. It’s smart to have more than one type of advanced threat detection (ATD) solution running in your environment to catch threats the other might miss. The notion that every endpoint should have had two ATDs running is a misunderstanding of what happened here and what potential solutions might be. CS operates at the kernel level of the operating system. The botched update put systems into a reboot loop. There was no opportunity for a failover to another ATD. Currently, running two ATDs on a single device creates conflicts and can significantly degrade performance.
There’s no single solution here. To start, enterprise-level SaaS providers (not just ADTs) need to examine their update release protocols. There’s a trade-off between the intended effortlessness for the customer and the negative consequences when an update is poorly tested and rushed. Many of us have seen an update to a SaaS product blow up configurations and disrupt workflows. Now we have an example where entire industries and nations were shut down. New rule #1: The rollout of an update, of any magnitude, needs to be limited in size and geographic scope, with ample notification to customers. No matter the urgency of the update, notification should be standard procedure.
This fiasco calls into question the value of the “silent” update and whether we should continue to expect this as the status quo. In my role as a big city CIO, I demanded local reps of key enterprise services brief my Ops teams about the nature and timing of updates so we could prepare ourselves and the user base to mitigate and respond to any potential downsides. Not all were willing to do so, but those who viewed the city as a partner complied. Collectively, governments can exert their purchasing power to demand these changes from SaaS providers, potentially deterring such widespread calamities in the future.
Secondly, this situation calls into question how large organizations should conduct a risk assessment of their IT operating environment. The all-eggs-in-one-basket scenario has its downsides. One might easily say, “Well, have two ATP systems running — staggered.” Or “Have two different OS systems or business productivity platforms running — staggered across your organization.” The cost and complexity of support, as well as the ability of teams to work cross-functionally, make this option fairly untenable, but some versions of this approach are not without precedent. In the most effective E-911 systems, there are redundant radio and cellular communications for first responders to fall back on. Responders are trained for these scenarios, and equipment is often designed for such contingencies. In fact, Contingency of Operations planning calls for organizations to envision, test, and implement procedures and (software/hardware) solutions for disaster recovery and continued services.
However, organizations will have to balance the costs and risks of such decisions. Buying CrowdStrike was one such risk calculation. It’s a the top of its quadrant for performance, and the threat risks of malware are extremely high. Now too is the risk of a single SaaS update taking down a city and an industry.