As announced by Google, the data center hosting one of its London cloud regions is suffering from the “simultaneous failure of multiple, redundant cooling systems” during the recent record heatwave in the UK.
Outages during the heatwave of the temperatures soar to a record 40C/104F have been experienced by giant organizations like Google, Oracle, and London-based Guy’s and St Thomas’ NHS Foundation Trust.
As per the incident report of Google on Tuesday, 19 July 2022, at 06:33 US/Pacific, a simultaneous failure of multiple redundant cooling systems in one of the data centers of London’s heatwave that hosts the zone Europe-west2 impacted multiple Google Cloud services. This results in the unavailability of the services as experienced by the customers, which has severely impacted the products.
It is also said that the respective customers whose businesses were impacted during this outage have been apologized to. The company has also mentioned that they are not even striving to offer the level of quality and reliability and are taking immediate steps, which have been detailed in the Remediation & Prevention section, in order to improvise the resilience of the region.
Google mentioned that one of the data centers that host the Europe-west2-a zone is not capable of maintaining a safe operating temperature due to the cooling failure combined with the extreme temperatures outside. Therefore, it shut down the facility to prevent further damage.
The company didn’t mention or open the nature of the failure but revealed that its engineers are conducting an analysis of the system that triggered this incident and will be auditing cooling system equipment and standards across the data centers that will be facilitating Google Cloud globally.
They powered down this part of the zone to prevent an even longer outage or damage to machines. With the effect of this, a partial failure of capacity in that zone is caused, which leads to instance terminations, service degradation, and networking issues for a subset of customers.
A large number of regional Google Cloud services have been impacted during this incident due to the fact its team “inadvertently modified traffic routing” for internal services in order to ignore all three zones in the Europe-west2 region. Thus, it was concluded that it does not only impact the Europe-west2-a zone.
Regional storage services, such as GCS and BigQuery, replicate customer data across multiple zones. Due to the changes in the regional traffic routing, they were unable to access any replica for a number of storage objects and prevented customers from reading these objects while the routing error was in place.
It has been claimed by Google that it would repair and carefully re-test its failover automation.
The company would investigate and develop more advanced methods to progressively slow down the thermal load within a single data center space, reducing the probability of the complete shutdown of the system.
Guy’s and St Thomas’ NHS Foundation Trust Chief digital information officer Beverley Bryant explained in an internal video call which has been detected by the BBC that the hospitals’ IT systems were removed by “ludicrous heat,” leading to a failure in the data center’s air-conditioning. “The servers are not capable enough to handle the heat and they collapsed in an unmanaged and uncoordinated way.”