Saturday, November 2, 2024

Google Cloud explains how it accidentally deleted a customer account

Must read

Earlier this month, Google Cloud experienced one of its biggest blunders ever when UniSuper, a $135 billion Australian pension fund, had its Google Cloud account wiped out due to some kind of mistake on Google’s end. At the time, UniSuper indicated it had lost everything it had stored with Google, even its backups, and that caused two weeks of downtime for its 647,000 members. There were joint statements from the Google Cloud CEO and UniSuper CEO on the matter, a lot of apologies, and presumably a lot of worried customers who wondered if their retirement fund had disappeared.

In the immediate aftermath, the explanation we got was that “the disruption arose from an unprecedented sequence of events whereby an inadvertent misconfiguration during provisioning of UniSuper’s Private Cloud services ultimately resulted in the deletion of UniSuper’s Private Cloud subscription.” Two weeks later, Google Cloud’s internal review of the problem is finished, and the company has a blog post up detailing what happened.

Google has a “TL;DR” at the top of the post, and it sounds like a Google employee got an input wrong.

During the initial deployment of a Google Cloud VMware Engine (GCVE) Private Cloud for the customer using an internal tool, there was an inadvertent misconfiguration of the GCVE service by Google operators due to leaving a parameter blank. This had the unintended and then unknown consequence of defaulting the customer’s GCVE Private Cloud to a fixed term, with automatic deletion at the end of that period. The incident trigger and the downstream system behavior have both been corrected to ensure that this cannot happen again.

The most shocking thing about Google’s blunder was the sudden and irreversible deletion of a customer account. Shouldn’t there be protections, notifications, and confirmations in place to never accidentally delete something? Google says there are, but those warnings are for a “customer-initiated deletion” and didn’t work when using the admin tool. Google says, “No customer notification was sent because the deletion was triggered as a result of a parameter being left blank by Google operators using the internal tool, and not due to a customer deletion request. Any customer-initiated deletion would have been preceded by a notification to the customer.”

During its many downtime updates, UniSuper indicated it did not have access to Google Cloud backups and had to dig into a third-party (presumably less up-to-date) store to get back up and running. In the frenzy of the recovery period, UniSuper said that “UniSuper had duplication in two geographies as a protection against outages and loss. However, when the deletion of UniSuper’s Private Cloud subscription occurred, it caused deletion across both of these geographies… UniSuper had backups in place with an additional service provider. These backups have minimized data loss, and significantly improved the ability of UniSuper and Google Cloud to complete the restoration.”

In its post-mortem, Google now says, “Data backups that were stored in Google Cloud Storage in the same region were not impacted by the deletion, and, along with third-party backup software, were instrumental in aiding the rapid restoration.” It’s hard to square these two statements, especially with the two-week recovery period. The goal of a backup is to be quickly restored; so either UniSuper’s backups didn’t get deleted and weren’t effective, leading to two weeks of downtime, or they would have been effective had they not been partially or completely wiped out.

Google stressed many times in the post that this issue affected a single customer, has never happened before, should never happen again, and is not a systemic problem with Google Cloud. Here’s the entire “remediation” section of the blog post:

Google Cloud has since taken several actions to ensure that this incident does not and can not occur again, including:

  1. We deprecated the internal tool that triggered this sequence of events. This aspect is now fully automated and controlled by customers via the user interface, even when specific capacity management is required.
  2. We scrubbed the system database and manually reviewed all GCVE Private Clouds to ensure that no other GCVE deployments are at risk.
  3. We corrected the system behavior that sets GCVE Private Clouds for deletion for such deployment workflows.

Google says Cloud still has “safeguards in place with a combination of soft delete, advance notification, and human-in-the-loop, as appropriate,” and it confirmed these safeguards all still work.

Latest article