Chaos Engineering in Practice

Full-Day Tutorial (6 hours)

Distributed systems are built of unreliable parts. Learn how your product, system, and team reacts to failure. Learn how to become antifragile.

Timetable

9:00 a.m. – 5:00 p.m. Monday 21st

Room

F-,E- & D-Rooms

Audience

Testers, Reliability Engineers, Developers, Architects

Required

Laptop.

Key-Learning

  • Learn how to model your system and spot crucial vulnerabilities.
  • Learn vulnerabilities, modes of failure, and mitigation techniques that will help you make your system more reliable.
  • Learn about service level metrics, observability, and what steady state is.
  • Learn how to run chaos experiments, induce failure to your system and observe its impact on steady state.
  • Learn what Chaos Bash is and what other Chaos Engineering plays there are for you to choose from.

Failure is certain. Prepare for it.

Products today are often built as distributed systems. Specialized services communicating with each other over the network, built on top of the cloud provider's platform. The more components you involve, the more likely it's the system will eventually break somewhere. Although service providers meet their SLO and sign on SLA. No one is perfect, and furthermore no one claims to be. A drop in reliability, although expected, can have a significant impact on your product, its capabilities, and effectively your users and the business. It may also have a detrimental effect on your team and you. Cascading failure can spread through your system affecting all of its components. How does one strive in such an environment? Can you avoid failure? Thinking about redundancies during system design helps, but won't be enough. Testers know how much you can learn about your product by exploring it. The same principle can be applied for finding nuances in infrastructure, platform, and network. The fundamentals that you build your distributed system on. Working on a distributed system requires a change in mindset. Failure is not an option. Failure is default. We don't react, we actively look for problems. By exploring your system, experimenting with various scenarios, you can find problems you missed. Learn things about your dependencies. Understand how users perceive failure, and how a feature you thought is optional, is treated as a core functionality. Chaos engineering is one of approaches you can take to explore how failure affects your product and system. Starting from its "mechanical" parts, like services, to people who maintain the product like me and you, and finally reaching real users. You know how to explore software, let's now learn how to explore system failures.

Related Sessions

Virtual Pass session
1:20 p.m. – 1:30 p.m.
Room F1+F2+F3 - Plenary

Active Session

Virtual Pass session
4:00 p.m. – 4:45 p.m.
Room F2 - Track 2: Talks

25-minute Talk

9:00 a.m. – 5:00 p.m.

Full-Day Tutorial (6 hours)

Virtual Pass session
5:00 p.m. – 6:00 p.m.
Room F1+F2+F3 - Plenary

45-minute Keynote