Failure is certain. Prepare for it.
Products today are often built as distributed systems. Specialized services communicating with each other over the network, built on top of the cloud provider's platform.
The more components you involve, the more likely it's the system will eventually break somewhere.
Although service providers meet their SLO and sign on SLA. No one is perfect, and furthermore no one claims to be.
A drop in reliability, although expected, can have a significant impact on your product, its capabilities, and effectively your users and the business. It may also have a detrimental effect on your team and you.
Cascading failure can spread through your system affecting all of its components. How does one strive in such an environment? Can you avoid failure?
Thinking about redundancies during system design helps, but won't be enough. Testers know how much you can learn about your product by exploring it. The same principle can be applied for finding nuances in infrastructure, platform, and network. The fundamentals that you build your distributed system on.
Working on a distributed system requires a change in mindset. Failure is not an option. Failure is default. We don't react, we actively look for problems.
By exploring your system, experimenting with various scenarios, you can find problems you missed. Learn things about your dependencies. Understand how users perceive failure, and how a feature you thought is optional, is treated as a core functionality.
Chaos engineering is one of approaches you can take to explore how failure affects your product and system. Starting from its "mechanical" parts, like services, to people who maintain the product like me and you, and finally reaching real users.
You know how to explore software, let's now learn how to explore system failures.