Soul Stone Crafting Post-Mortem

Incident Summary

On May 6th, 2023, our engineering team noticed an issue with our system during a scheduled event at 8AM (PT). The issue was first identified through our internal monitoring systems, which registered an unexpected spike in a key performance metric.

Simultaneously, our user community began reporting errors across multiple communication channels, including Discord and various social media platforms. Within minutes of the issue arising, our team pinpointed the production database as the culprit, operating at full capacity and indicating a potential bottleneck.

The common issues observed were:

Our production database was experiencing high connection count, long query times, 100% CPU utilization of the writer node, and many SQL deadlocks.
An essential feature was not working, leading to a frustrating user experience.
There was a long API response time.

Despite immediate efforts to address the issue, we encountered additional performance impacts on May 7th. Our system was partially operational but was experiencing significant performance degradation and a high error rate. This was due to the system's inability to handle increased load effectively. We were unable to pull together a proper load testing on such short notice, which prevented us from fully resolving the issues encountered on May 6th.

The common issues observed on May 7th were:

Some queries were processing but with a noticeable performance impact.
Our production database continued to experience high connection count and prolonged query times.
Spikes of 100% CPU utilization of the writer node

These incidents underscored the need for a more robust and realistic testing environment, capable of simulating high-traffic conditions, and the necessity to review our system's ability to scale and handle increased load effectively.

Investigation and Mitigation

Upon noticing the issue on May 6th, our engineering team quickly began their investigation to identify the cause. While we were able to mitigate some of the issues, the system still suffered from performance degradation. On May 7th, we experienced further performance impacts, with some queries processing slowly and a high error rate.

The team continued their intensive investigation, performing further enhancements to tackle the problem. They focused on the SQL queries, which were showing signs of unnecessary complexity and high load on the system. This was a challenging aspect as the testing environment did not reflect the same issues that were arising in the production environment.

Eventually, our team discovered that a randomizing function in our code was causing a significant increase in computing load. This function was unintentionally off-loaded to a system not optimized for true randomness, and during the high-traffic situation, led to a bottleneck in our system during peak usage times.

Upon identifying the issue, our team worked diligently to retarget this function to an optimised system that was capable of efficient true randomness, which significantly reduced the system load and improved query processing times. Following these adjustments, our systems returned to normal operations.

These incidents have highlighted the need for us to enhance our load testing environment and methods. We are committed to improving our systems and processes to ensure that they can handle increased load effectively when new features are introduced.

Root Cause and Lessons Learned

The root cause of the incident was identified to be a combination of overly complex SQL queries and the inability of our database to scale under high load conditions. The queries contained unnecessary complexity. This placed an excessive load on the system which was not anticipated during the testing phase.

We learned valuable lessons from this incident. Despite having load testing measures in place, the incident highlighted areas for improvement in our testing capability to better anticipate and manage system load during high-traffic events. In particular, it brought to light the need to account for the impact of complex queries on system performance.

This incident also presents a valuable opportunity for us to undertake a comprehensive review of our system architecture. This is not a new initiative but rather a continuation of efforts that began several weeks ago. The goal is to ensure further resilience and extensibility of our platform, making it more adaptable to high-traffic situations and future requirements.

We see this incident as an important lesson in our ongoing journey towards system improvement and optimization. We are dedicated to learning from these experiences and continuously enhancing our systems and processes.

Going Forward

In response to the incident and as part of our commitment to continuous improvement, we are taking the following steps:

Crafting System Enhancement: We are working on enhancements to the crafting system, focusing on its performance under high-traffic conditions. Our goal is to ensure the system can efficiently handle large volumes of traffic without compromising on user experience or functionality.
Improved Load Testing Environment: While we have rigorous code and SQL reviews in place, this incident highlighted the need for an improved testing environment. We will enhance our testing processes to better emulate high-traffic conditions and thereby identify potential issues before they impact our production environment.
System Architecture Review: This incident has underlined the importance of our ongoing system architecture review efforts. We are dedicated to ensuring the resilience and extensibility of our platform to adapt to high-traffic situations and future requirements.

We deeply regret the inconvenience caused to our users due to this incident. Our commitment to improving our systems and processes remains stronger than ever. The steps we have taken in response to this incident, and the lessons we have learned, will significantly contribute to enhancing the reliability and performance of our platform.

Our engineering team will continue to monitor the situation closely and ensure all systems are functioning as expected. We appreciate your understanding and patience during this time.

Thank you for your continued support. We are determined to learn from these experiences and continuously enhance our systems and processes to provide you with the best possible service.

May 17, 2023

No tags

Incident Summary

Investigation and Mitigation

Root Cause and Lessons Learned

Going Forward

Keep up to date with all things VeVe