As a QA engineer, trends hit differently. While everyone is excited about AI taking over the roadmap, your mind goes: how am I going to test this?
Testing an LLM-powered chatbot is like trying to train a parrot raised in a home library. He's read so many books, can give a million answers but you never really know what to expect (I mean.. it’s just a parrot!). Then one day, you're expected to let him fly out the window and follow strangers’ directions. How are you supposed to feel confident about that?
Forget whether strangers even want to talk to a parrot. Right now, your job is to make sure it gives correct answers, doesn’t spill secrets and generally behaves like a decent bird.
But how do you measure decency? How do you evaluate responses when your chatbot can give five different (and still reasonable) answers to the same question? Being technically correct is not enough anymore. Now you have to ask:
- Is it personalized?
- Is it relevant?
- Is it aligned with our product?
And then there are all the fun edge cases to worry about:
- What if a user asks to add 500 pig emojis to every output - fun, sure, but how many tokens will it cost us?
- What if the chatbot starts leaking private data? We don’t want the parrot to end up in jail!
The challenge is defining quality in your context and finding the sweet spot between automation and human review. I’ll be happy to share how we approached testing, what worked (and what didn’t) and what it took to train our parrot to behave in the real world 🦜