Friday, November 22, 2024

Fast and automated: Global tech outage shows hazards of cloud software updates

Must read

“Faulty patches and updates happen all the time. What’s different now is that the scale of these cloud services are so massive,” said Lee McKnight, an associate professor in the School of Information Studies at Syracuse University.

Friday’s CrowdStrike update, designed to refresh code in its threat-detection software, contained flaws that prevented Microsoft Windows-based systems from starting. The resulting outages disrupted businesses across the world, including airlines, hospitals, stock exchanges, banks and media companies, as well as some U.S. government agencies.

CrowdStrike said it was working with customers and issued steps to fix crashed systems. “We understand the gravity of the situation and are deeply sorry for the inconvenience and disruption,” the company said on its website. CrowdStrike says it has 29,000 customers, including nearly 300 members of the Fortune 500.

A spokesman said Microsoft was assisting customers with recovery efforts and the CrowdStrike issue is unrelated to an earlier outage of its Azure platform, which has been resolved.

Recovery will likely take several days at large organizations with hundreds or even thousands of computers down, said Mike Walters, president of Action1, which sells patch-management software. The steps CrowdStrike advised customers to take are manual, requiring 15 to 30 minutes for each machine, he said.

Concentration of risk

CrowdStrike held about 15% of the security software market in 2023 based on revenue, second to Microsoft’s roughly 40%, according to research firm Gartner.

“There’s always issues with concentration risk,” said Neil MacDonald, a Gartner vice president. “The vendor providing the capability has a responsibility to deliver service that’s resilient.”

For security and IT leaders, that means putting extra scrutiny on software updates, even if they come from “trusted” vendors, instead of turning off automatic updates altogether.

Many major vendors, including CrowdStrike, use a process known as continuous integration and continuous delivery—automation designed to cut the time in integrating code changes into products. Updates, even if they contain faulty code, are then sent out via cloud-based systems to thousands of customers at once.

Many customers, in turn, allow automatic updates, having neither resources nor time to properly check new patches and versions from every provider.

Friday’s outage raises questions about that process.

Global economies rely on a handful of tech providers to test updates while under pressure to send them quickly, said Glenn Gerstell, senior adviser to the Center for Strategic and International Studies, a think tank.

“We’ve created a system that relies on third parties to provide updates you can’t see for yourself,” said Gerstell, a former general counsel for the National Security Agency.

Working within the system

Chief information officers and chief information security officers “need to assess where manual intervention makes sense as a layer on top of auto-updates,” said Andy Sharma, CISO of Redwood Software.

But turning off automatic updates could also leave companies more susceptible to cyberattacks, said Chirag Mehta, a cybersecurity analyst at Constellation Research. Instead, companies should invest in better software testing and work more closely with their key suppliers.

Daniel Barchi, chief information officer of the health system CommonSpirit Health, said his team paused the update for its more than 150,000 devices early Friday, about one hour after the issue was detected.

While dispatching staff to manually fix computers, CommonSpirit Health directed employees to use pen and paper to place orders and take notes. Core systems, including lab, imaging and electronic medical records weren’t affected, Barchi said, but the health system couldn’t use them on computers running CrowdStrike.

Though it canceled some appointments and an elective surgery, CommonSpirit Health has restored operations to enough devices, and all of its hospitals and clinics are online, Barchi said.

“In some cases, we benefit by having vendor partners who are proactive in updating and upgrading,” he said. “Sometimes, in this case, it goes differently. And so the question is, what’s the trade off?”

Kim S. Nash contributed to this article.

Write to James Rundle at james.rundle@wsj.com and Belle Lin at belle.lin@wsj.com

Latest article