EWR1 Partial Power Outage
Incident Report for Equinix Metal
Postmortem

Reason for Outage

Start Date: 18:14, 12/17/2020, EST

End Date: 22:52, 12/17/2020, EST

Problem Location: Parsippany, NJ (EWR1)

Problem Description: Partial power outage

Outage Details

Root Cause

On December 17, 2020, Equinix Metal experienced a partial power outage in our EWR1 facility affecting multiple racks of customers servers as well as site controller infrastructure supporting the facility. 

During the outage, affected network and server hardware lost power. Additionally, some servers experienced extended disruption as network and application services recovered from the outage.

After the restoration of network connectivity, the incident response team investigated and resolved issues with individual servers. 

Our Data Center provider, Cologix, has provided the root cause of this incident. During a scheduled maintenance to replace a fan on a UPS, several UPS fuses failed, resulting in a power disruption impacting the feed from power room E. Inspection showed an internal fan short circuited causing a unit shutdown. 

Customer Impact 

Customer servers in scope of this outage were unable to access the internet and experienced significant performance degradation. 

Customers not directly affected by this outage may have experienced provisioning failures or delays in provisioning hardware in the EWR1 facility.

Future Mitigation

Equinix Metal has conducted a post mortem and confirmed there are two legacy rows in our EWR1 facility which have dual phase power fed from a single room. We will work with the vendor to conduct a future maintenance to transition these two rows to receive power from two rooms (B and E), which is how all other rows in our EWR1 facility receive power. 

Equinix Metal will also conduct an audit of all other facilities to ensure all rows are receiving redundant power. 

Timeline

All times are in EST.

18:13: Engineers receive alerts from monitoring show that multiple services in the EWR1 facility are impacted

18:32: Engineers report power is being restored to the facility

19:22: Engineers confirm power is fully restored

20:24: Engineers restore customer VPN connectivity

21:50: Engineers confirm all Equinix Metal services have recovered, provisioning capability is restored

22:36: Engineers reload a single switch (esr1.b02.ewr1) which did not recover properly from the power disruption

22:43: Engineers return esr1.b02.ewr1 to service, restoring IPv4 and IPv6 connectivity

22:52: The incident team transitions the incident to a resolved state. Issues with individual servers are resolved outside of the incident

If you have any questions, please reach out to support@equinixmetal.com for further assistance.

Posted Dec 18, 2020 - 17:31 UTC

Resolved
This incident has been resolved.
Posted Dec 18, 2020 - 03:52 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Dec 18, 2020 - 02:52 UTC
Update
We are working to restore provisioning services in our EWR1 Facility.
Posted Dec 18, 2020 - 01:27 UTC
Update
All power has returned. Services are coming online.
Posted Dec 18, 2020 - 00:23 UTC
Identified
Services are starting to recover, we are still trying to get more details from our Facility managers for the root cause of the partial power outage and determine which specific racks are impacted.
Posted Dec 17, 2020 - 23:54 UTC
Investigating
We are currently investigating a partial power outage in our EWR1 Facility. Provisioning in this facility may fail or time out while we resolve the problem. This issue also appears to affect network connectivity in the said site.


Please visit status.equinixmetal.com for further updates or reach out to support@equinixmetal.com for further assistance.
Posted Dec 17, 2020 - 23:33 UTC
This incident affected: North America (EWR1 Provisioning).