Last week on November 25th, the cloud giant AWS experienced a major outage in US-EAST-1 cloud region, causing thousands of other online services to go offline for several hours. AWS explained the cause of the outage with a blog post they shared. As expected, the problem was caused by Amazon Kinesis.
Small addition of capacity to Amazon Kinesis
AWS stated that a small addition of capacity triggered the problem. The addition was made to the front-end fleet of Kinesis servers, that maintains a cache of information, including membership details and shard ownership for the back-end clusters, called a shard-map. With the addition of capacity, the servers will learn of new servers joining and establish the appropriate threads and it takes up to an hour to learn of new participants.
Although the additional capacity was the suspect, when the team began reviewing logs for errors, they found a number of errors that were unrelated to the new capacity and would likely persist even if the capacity were to be removed. While the diagnosis work was slowed down due to a variety of errors, the team began removing the capacity. A few hours later, the team managed to narrow the root cause to a couple of candidates and determined that any of the most likely sources of the problem would require a full restart of the front-end fleet.
After nearly 7 hours of additional capacity was made, the team could confirm the root cause, which was caused by the new capacity in the fleet to exceed the maximum number of threads allowed by an operating system configuration. Cache construction was failing to complete and servers were ending up with useless shard-maps that left them unable to route requests to back-end clusters as the limits were being exceeded. The team then removed the additional capacity that triggered the event and determined that the thread count would no longer exceed the operating system limit and proceeded with the restart.
AWS also stated,
“Finally, we want to apologize for the impact this event caused for our customers. While we are proud of our long track record of availability with Amazon Kinesis, we know how critical this service, and the other AWS services that were impacted, are to our customers, their applications and end-users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further.”