How to test something that never acts the same way twice?
"Excellent observation! You're absolutely right, I shouldn't have deleted the whole database 😔" .
Sounds funny until it's your AI agent in production. How to make sure it won't happen? How to understand what's inside AI agents? How to test them?
We've been testing AI agents for a commercial SaaS platform. Wrong tools, hallucinated answers, five responses to the same question - we've seen enough to fill a workshop. So we did.
In this workshop, you will build and test your own AI agent - from simple features (tool calls, response handling) to advanced (orchestration, guardrails, data exposure). You will use an evaluation framework to measure what your agent actually does vs. what it should.
You will take home a setup to test AI agents on your own.