Sunday, September 8, 2024

Google Cloud deleted a large customer’s infrastructure – Blocks and Files

Must read

An Australian superannuation fund manager, using Google Cloud for an Infrastructure-as-a-Service contract, found it had no disaster recovery (DR) recourse when the entire infrastructure subscription was deleted.

UniSuper is owned by 37 universities, providing retirement savings for their staff. It had more than 615,000 members and A$124 billion ($82 billion) of managed funds in mid-2023. In June that year, the organization was migrating its VMware-based hardware infrastructure from two datacenters to Google Cloud, using the Google Cloud VMware Engine, via IT services provider Kasna. At the time, Sam Cooper, head of architecture at UniSuper, enthused: “With Google Cloud VMware Engine, migrating to the cloud is streamlined and extremely easy. It’s all about efficiencies that help us deliver highly competitive fees for our members.”

On May 2 this year, it experienced service disruption and data loss caused by an internal Google Cloud fault. This lasted for several days, with service restoration starting from May 9.

A May 7 statement by UniSuper and Google Cloud revealed: “The disruption of UniSuper services was caused by a combination of rare issues at Google Cloud that resulted in an inadvertent misconfiguration during the provisioning of UniSuper’s Private Cloud, which triggered a previously unknown software bug that impacted UniSuper’s systems. This was an unprecedented occurrence, and measures have been taken to ensure this issue does not happen again.”

“Google Cloud sincerely apologizes for the inconvenience this has caused, and we continue to work around the clock with UniSuper to fully remediate the situation, with the goal of progressively restoring services as soon as possible. We would like to stress again that this was an isolated incident and not the result of a malicious behavior or cyber-attack, and that no UniSuper data has been exposed to unauthorized parties.”

As part of its private cloud contract, UniSuper had its services and data duplicated across two Google Cloud regions – but this regional separation was effectively virtual, because copies in both regions went south due to the internal Google error. There was no external DR facility.

Google Cloud’s boss became involved, with UniSuper stating on May 8: “Google Cloud CEO Thomas Kurian has confirmed that the disruption arose from an unprecedented sequence of events whereby an inadvertent misconfiguration during provisioning of UniSuper’s Private Cloud services ultimately resulted in the deletion of UniSuper’s private cloud subscription.”

UniSuper’s Google Cloud subscription was used to provide its systems – involving some 1,900 virtual machines, databases, and applications – and store its data. The subscription deletion meant that all this Google-provided infrastructure went away. As an example of good practice, UniSuper had services duplicated in two geographies (Google Cloud regions) as protection against outages and loss. However, as it noted in a statement, “when the deletion of UniSuper’s private cloud subscription occurred, it caused deletion across both of these geographies.”

A UniSuper statement noted: “UniSuper had backups in place with an additional service provider. These backups have minimized data loss, and significantly improved the ability of UniSuper and Google Cloud to complete the restoration.” 

This is a partial example of good backup practice, but “minimized” is not “avoided.” It means some backed-up data has been lost, because UniSuper did not follow the 3-2-1 rule of having at least three versions of a backup. Also, it is quite apparent that UniSuper did not have a disaster recovery facility that could enable it to recover from this Google Cloud failure.

Google said: “This is an isolated, one-of-a-kind occurrence that has never before occurred with any of Google Cloud’s clients globally. This should not have happened. Google Cloud has identified the events that led to this disruption and taken measures to ensure this does not happen again.”

Peter Chun.

As of this writing, May 13, many UniSuper services are back online. UniSuper CEO Peter Chun noted in a statement: “My team are conducting a full review of the incident to ensure that wherever possible we minimize the risk of disruption in the future. We will assess this incident and ensure we are best positioned to deliver services members expect and deserve.”

What might have happened if UniSuper were not an A$124 billion behemoth with data stored in two Google Cloud regions, and was instead a $5-$10 million annual revenue medium business or smaller? How long after being contacted would it have taken Google to look into its operations and detect a misconfiguration somewhere in its vast IT estate? Would Google Cloud boss Thomas Kurian have become involved?

Ironically, a Google Cloud Platform note warns: “A business that has trouble resuming operations after an outage can suffer brand damage. For that reason, a solid DR plan is critical.” Well, yes, so it is, and GCP customer UniSuper did not have one. We have asked UniSuper to comment on this point.

The moral of this tale is that a solid disaster recovery plan should include the possibility of the IaaS supplier failing. GCP IaaS customers please take note.

Bootnote

See posts here for an account of the service outage and response.

Latest article