What is more important? A new feature or non-functional requirements like stability and resiliency of a system? Opinions on the topic fuel emotional discussions between product owners and developers. Decisions on what stories are part of the next sprint often favor new features because they pay the bills.
It is not relevant if this statement is true. You only can spend a certain amount of time on resiliency and stability. Hardly any company can invest time introducing something like chaos engineering to deal with this issue in the development workflow. We need a lightweight alternative that is time-efficient but still delivers value.
When I worked as Lead System Architect at Runtastic, I introduced the “Resilience Games.” The idea was to hold a meeting with all interested members of an agile team and let them act like chaos monkeys. The participants artificially introduced failures based on a drawing on a whiteboard or an existing architecture diagram, not on the actual production system.
Here is how Resilience Games work
You need a moderator. Software architects are great for that because they are not part of the team but know the system well.
The moderator must prepare the “playing board.” A high-level architectural diagram that is already available is a good start. Depending on the common knowledge of the group, additional annotations might be helpful (e.g., datacenter, rack, etc.)
- The moderator introduces the goal: “An ordered list of strategic topics that the team must address to improve the resilience and stability of the system.” The list must contain the likelihood of the event, the severity of the failure, and a time estimate needed for the fix.
- The moderator must insist that nobody is blamed for mistakes that team members made in the past. Look at the current state and find ways to improve it.
- The moderator must set a timebox. 2 hours is a good timebox for the first time.
- To get everybody on the same page, a participant should give a general introduction to the architecture of the part of the system that the team owns. Set a timebox for the introduction.
- Let the participants decide what to “takedown” first. Try to find the most likely and most severe cases first. Discuss what the issue is, what impact on the product it has, and roughly how to fix it. Set a timebox per issue.
- Take the last 10-15 minutes to order the list and estimate how much time it takes to resolve the issues.
Although this meeting will become quite technical, having non-technical roles like PO, designers, or QA in the discussion is beneficial. They will gain first-hand insights into the problems that developers see and fear. On the other hand, developers will explain the issues in more non-technical terms, making the outcome clearer to non-technical people.
The outcome that I saw in the “Resilience Games” was broader than I had expected. First, the list is of high value when the team needs to decide what to do next, based on facts. It helps to articulate the risk and importance of non-functional requirements to the PO.
The other thing that I experienced was a highly engaged meeting that aligned the whole team on implementation details and overall ideas. What surprised me most was that front-end specialists and back-end specialists started to align on the flow of features very frequently. This meeting seemed to give them a new perspective and basis for designing their implementations.
I guess that is an excellent example of the agile value: Individuals and interactions over processes and tools.
The “Resilience Games” are easy to do, fun, and don’t take much time. They provide value to the team and the product. Finding the most significant weaknesses of a system doesn’t require a full-fledged chaos engineering process. Give it a try.
If you want me to do a “Resilience Game” with your team, let me know. I’m happy to join you.