Elevated number of errors in Storefront

Incident Report for Shoplazza

Postmortem

Shoplazza incident on September 19, 2023

In the early hours of CST, September 19, 2023,Shoplazza platform experienced an unexpected service interruption.

The issue stemmed from a long time network device malfunctioning in multiple availability zones within Amazon Web Services (AWS) Oregon and Virginia regions. Given that Shoplazza platform is primarily deployed via the AWS facilities located in the Oregon region, we have been directly affected.

Incident Timeline:

Around 02:00 a.m. CST September 19, our Shoplazza technical team received the first system alert. They noticed the unusual service interruptions and high latencies. As soon as the issue arose, our on-duty engineers immediately took action, implementing emergency measures like increasing capacity, and restarting affected services. Simultaneously, we closely monitored AWS's progress in addressing the issue.

At 02:43, AWS issued a statement indicating that network disruptions had taken place in the Oregon server facility in the U.S. West and the Virginia server facility in the U.S. East. This incident impacted multiple availability zones simultaneously.

At 03:33, AWS provided an update clarifying that there were network outages in various availability zones(AZ) within the Oregon region. This included AZa and AZb, which are where our production servers and backup servers were located.

By 04:23, our Shoplazza engineers worked hard to manually failover the affected database instances to AZc. This marked the beginning of the gradual system recovery.

By 05:40, our engineers made efforts to reduce crawler requests, giving higher priority to normal customer access. As a result, the platform's functionality gradually improved, and the order volume reached 50% recovery.

By 07:00, our engineers continued to manually failover critical services to AZc. Meanwhile, AWS made progress in repairing the network failure. The platform's order volume recovered to 80%.

By 10:00, our engineers kept a close eye on any remaining issues and multiple fixes were implemented. This resulted in a 90% platform recovery, with most stores being fully restored. They were able to place orders without any issues.

At 11:30, our Shoplazza engineers attempted to reduce the load on certain high-demand services, like multi-language functions. The platform achieved a full 100% recovery. All the page access returned to normal, and all functions were again fully operational.

Then, at 2:08 in the afternoon, AWS provided another update stating that the network failure had been successfully resolved. Our diligent Shoplazza team on duty promptly verified the system's status and confirmed that the platform had completely returned to their normal state.

Although Shoplazza has invested significantly in our infrastructure to improve service availability, we clearly fell short of our customer expectations with this inexcusable incident. We understand that this service disruption caused inconvenience and impacted many of our valued merchants and partners. In response, we are committed to implementing a cross-region high-availability solution. This involves establishing backup server facilities in regions outside of the US West and enabling multi-region capabilities to prevent similar incidents from happening in the future.

At Shoplazza, the security and stability of our systems are paramount to our service. We are dedicated to learning from this incident and continuously improving the stability of our services to maintain the trust of our customers.

Again, we want to apologize to all the affected merchants and partners for the disruption they experienced. Your satisfaction remains our highest priority, and we are taking all necessary steps to ensure incidents like this are minimized in the future.

Shoplazza

September 27, 2023

Posted Sep 28, 2023 - 21:43 CST

Resolved

Between 02:00 - 12:30 CST customers reaching Storefront would have experienced an elevated number of errors due to vpc network issue in aws oregon.

Posted Sep 19, 2023 - 18:00 CST