Google Next: Chaos Engineering (aka Game Days)
Craig Jones from Arundo attended Game Days at Google Next.
I was lucky enough to catch the talk on Chaos Engineering by Jason Yee and Jeremy Garcia, given at Google Next 2019. A link to a video recording can be found at the bottom of the post.
There were many discussions on reliability at Google Next 2019, but the talk that really jumped out to me was the one about Chaos Engineering. Most engineers in the industry have probably heard of Netflix's infamous Chaos Monkey, but such a practice isn't practical for critical services that cannot have unplanned downtime. However, the concept of Chaos Engineering can still be applicable as a learning and investigation tool for engineering groups of any size through the concept of “game days”.
WHAT ARE GAME DAYS?
The general gist of a game day is to plan and timebox a scoped investigation of failure modes for specific services. Three engineers will participate in the game day, a software engineer (SWE) working on the targeted service, a DevOps engineer, and a junior engineer. The game day is announced to all of the engineering team well in advance to ensure that it will not cause a conflict.
The engineers spend 30 minutes planning how they want to attack the service, and specific suggestions can be found in the talk. They then proceed to attack the service, record the results, and repair the service over a period of 50 minutes. Finally, the engineers should ensure the environment is repaired before spending 10 minutes summarizing the results and sending it out to all of engineering. I highly recommend that you watch the video for more detail about game days.
The concept of game days has a few major pros that are more than just improving the resiliency of a service. The first major benefit of game days is knowledge transfer. Services as complex as those at Arundo can be difficult to understand and thus difficult for DevOps to support. Having a SWE who worked on the service go through different failure modes with a DevOps engineer will give the DevOps team a deeper understanding of the service, hopefully preventing that SWE from getting a midnight call from DevOps. It will also highlight areas for improvement for the SWE to take back to the product team responsible for the service. Finally, it will provide mentorship to a junior engineer or cross pollination with another team.
The next major benefit of game days is global awareness of a service. By highlighting how teams are breaking their services, it can help other product teams understand potential use cases of existing services. Arundo has many product teams working in parallel, and the pace of development means we are constantly working on cross-team communication.
A game day report sent to all of engineering is yet another avenue for cross-team communication. If one team is struggling to handle a heavy load, they may reach out through existing channels. But a game day report about a separate service standing up under heavy load may give that team insight into solving their own problem. Inviting a team to a lunch and learn is another great way to share this information.
Finally, game days will help the organization as a whole validate their solutions when products have a small number of major customers. At Arundo, some of our products are designed for use cases with a small number of massive entities, for example chemical plants. In these situations, the first customer we onboard will be a scalability challenge in itself. Doing game days during the development or validation phases of the software lifecycle allows us to build more robust systems for our clients. Game days are designed to break systems, which make them distinctly different process from regular QA, who should be left to validate that new features meet the requirements of our customers. At Arundo, we don't want to build software that is just good enough, it should be resilient, robust, and battle hardened.
Hopefully, this has given you some insight into how small companies like Arundo can benefit from the concept of game days. I recommend you check out the video to learn more: OPS216: Chaos: Breaking Your Systems to Make the Unbreakable.