Notes on building AI systems that survive contact with real users — evaluation, guardrails, and the unglamorous parts of shipping models into production.
Most AI feature launches skip the evaluation step entirely. They demo well, ship, and quietly hallucinate at customers. The eval doesn't have to be fancy. It does have to exist.
The agents that work in production tend to start tiny — one task, one human in the chair next to them, a tight feedback loop. The flashy demo can come after.
Picking the right LLM is more about your evaluation pipeline than about any single model's benchmarks. The model you can swap is more valuable than the model you can't.
Off-the-shelf chatbots hallucinate when asked about your business. The fix isn't a better model — it's retrieval, the plumbing around the model.
For most mid-sized businesses, 2025 isn't going to be the year of AI adoption — it's going to be the year of AI audit. The tools have already arrived. Nobody's counted them yet.