Service Disruption: Gradient-Core NY2
Incident Report for Paperspace
Postmortem

Paperspace NY2 network outage Postmortem

Incident Summary

On June 16th, a significant number of virtual machines in the NY2 region became unavailable and went into Read-Only (RO) mode, which affected network availability. By 18:20 UTC on the same day, network connectivity was restored to the affected virtual machines. However, affected virtual machines continued to be unavailable until 12:23 UTC on June 17th as they were still in RO mode. The affected virtual machines were restored to Read-Write (RW) mode between 5:58 and 12:23 UTC on June 17th. By 12:24 UTC, affected virtual machines were available and network access to customers was restored. 

Incident Details

Root Cause

The root cause of this unexpected service unavailability is related to a core switch failure in NY2 that caused a large number of machines to go into RO mode.

Impact

As a result of the failure of the core switch, there was a spike in traffic that adversely impacted network performance. The virtual machines went into RO mode as a result of the loss of network access to the Network File System on which they resided. Since virtual machines were in RO mode, customers were unable to perform any write operations on them, resulting in service disruptions around 10:30 UTC on June 16th, 2024. 

Remediation Actions

A number of efforts are underway to try to prevent these types of failures from occurring again, including a network redesign and installation of new equipment.

On behalf of Paperspace, we apologize for the disruption to your services and appreciate your understanding.  

If you have any questions or concerns, please open a ticket with our Customer Support team.

Posted Jun 21, 2024 - 14:47 EDT

Resolved
Our engineers have resolved the issue. If you continue to experience issues, please contact our Support Team.
Posted Jun 17, 2024 - 02:05 EDT
Monitoring
We've implemented a series of fixes and are monitoring the results. While Core is already fixed, the Gradient platform will continue to have degraded performance since it is still taking time to recover. We will continue working on fixes for it.
Posted Jun 16, 2024 - 20:11 EDT
Identified
The issue has been identified, it appears to be an issue with our network in NY2. A fix is being implemented.
Posted Jun 16, 2024 - 17:55 EDT
Update
We are continuing to investigate this issue.
Posted Jun 16, 2024 - 15:03 EDT
Update
We are continuing to investigate this issue.
Posted Jun 16, 2024 - 11:23 EDT
Update
We are continuing to investigate this issue.
Posted Jun 16, 2024 - 08:22 EDT
Update
We are continuing to investigate this issue.
Posted Jun 16, 2024 - 08:13 EDT
Investigating
We are currently investigating a network issue that is preventing users from interacting with Gradient notebooks and Core VMs based in NY2 region.
Posted Jun 16, 2024 - 07:16 EDT
This incident affected: US (NY2) and Gradient.