Start Date: 18:14, 12/17/2020, EST
End Date: 22:52, 12/17/2020, EST
Problem Location: Parsippany, NJ (EWR1)
Problem Description: Partial power outage
On December 17, 2020, Equinix Metal experienced a partial power outage in our EWR1 facility affecting multiple racks of customers servers as well as site controller infrastructure supporting the facility.
During the outage, affected network and server hardware lost power. Additionally, some servers experienced extended disruption as network and application services recovered from the outage.
After the restoration of network connectivity, the incident response team investigated and resolved issues with individual servers.
Our Data Center provider, Cologix, has provided the root cause of this incident. During a scheduled maintenance to replace a fan on a UPS, several UPS fuses failed, resulting in a power disruption impacting the feed from power room E. Inspection showed an internal fan short circuited causing a unit shutdown.
Customer servers in scope of this outage were unable to access the internet and experienced significant performance degradation.
Customers not directly affected by this outage may have experienced provisioning failures or delays in provisioning hardware in the EWR1 facility.
Equinix Metal has conducted a post mortem and confirmed there are two legacy rows in our EWR1 facility which have dual phase power fed from a single room. We will work with the vendor to conduct a future maintenance to transition these two rows to receive power from two rooms (B and E), which is how all other rows in our EWR1 facility receive power.
Equinix Metal will also conduct an audit of all other facilities to ensure all rows are receiving redundant power.
All times are in EST.
18:13: Engineers receive alerts from monitoring show that multiple services in the EWR1 facility are impacted
18:32: Engineers report power is being restored to the facility
19:22: Engineers confirm power is fully restored
20:24: Engineers restore customer VPN connectivity
21:50: Engineers confirm all Equinix Metal services have recovered, provisioning capability is restored
22:36: Engineers reload a single switch (esr1.b02.ewr1) which did not recover properly from the power disruption
22:43: Engineers return esr1.b02.ewr1 to service, restoring IPv4 and IPv6 connectivity
22:52: The incident team transitions the incident to a resolved state. Issues with individual servers are resolved outside of the incident
If you have any questions, please reach out to firstname.lastname@example.org for further assistance.