AI Agent Evaluation: What, How, When
Most teams pick the AI agent evaluation tool their framework integrates with, then a quarter later notice that one corner of the evaluation space is covered and the rest is exposed. The market has 50+
Search for a command to run...
Most teams pick the AI agent evaluation tool their framework integrates with, then a quarter later notice that one corner of the evaluation space is covered and the rest is exposed. The market has 50+
Some incidents look minor on paper. A small single-digit percentage of instances affected, in a single AZ. And yet the user-visible outcome is that more than half of the service's transactions stop wo
Your system can handle 10,000 requests per second. But can it handle going from zero to 10,000 in one second? Peak traffic forces a design choice: what do you include in your scaling scope? Compute, d
Every architecture decision is a bet. A bet that your constraints won't change, that your assumptions will hold, that the trade-off you're making today won't haunt you in two years. Since 2022, I've b