Building Evals that Work

How do you encode human judgment into a machine, and how do you know when it’s working?

May 14, 2026

Over the past few months, I’ve heard the same complaint from half a dozen AI founders across different sectors: their eval infrastructure doesn’t tell them why a customer interaction failed. The dashboards they wired up catch regressions but can’t explain them. When something goes wrong, the team ends up digging through raw traces by hand, trying to work out what went wrong.

I’ve been trying to understand why, and what the teams who’ve solved it are doing differently. Building reliable evals is a stress test for the deeper problem every AI team faces: how do you encode human judgment into a machine, and how do you know when it’s working?

If you’ve built evals that hold up in production, I’d love to hear from you. I’ll publish a longer piece on this in the coming weeks.

Links

I like this idea of teams learning from each other by prompting in public. Will figure out how to implement it at Airtree and report back:

tobi lutke@tobi

https://t.co/5MJ7u1VHwf

2:33 PM · May 9, 2026 · 2.67M Views

170 Replies · 495 Reposts · 4.28K Likes

The future of onboarding:

Jean-Michel Lemieux@jmwind

Joined a new AI-native company this week and it’s kind of wild how different it feels already. The laptop arrived, I logged in, and an agent basically took over from there. It set up my dev env, pulled repos, fixed dependency issues, got permissions approved, pointed me at the

7:43 PM · May 13, 2026 · 275K Views

172 Replies · 216 Reposts · 3.45K Likes

Earth@earthcurated

Forever be the person who always gets excited when the sky is in beautiful colors.

10:09 AM · May 13, 2026 · 44K Views

58 Replies · 477 Reposts · 1.82K Likes

Insights from China’s AI labs:

Nathan Lambert@natolambert

Visiting most of the leading Chinese AI labs, I'm struck by a culture that's extremely well suited to building LLMs with fewer resources, but one happening in a very different ecosystem, more companies at play, almost no data industry, etc. Full report: interconnects.ai/p/notes-from-i…

3:49 PM · May 7, 2026 · 657K Views

45 Replies · 227 Reposts · 1.6K Likes

Jim Fan on the path for Physical AI, I liked the chart (pasted below) where he builds the parallel with LLM evolution:

Jim Fan@DrJimFan

I promise this will be the best 20 min you spend today! Robotics: Endgame, the sequel to my last year's Sequoia AI Ascent talk, "Physical Turing Test". I laid out the roadmap for solving Physical AGI as a simple parallel to the LLM success story. Be a good scientist, copy

2:32 PM · May 8, 2026 · 458K Views

127 Replies · 514 Reposts · 3.22K Likes

LB@L_A_A_S_2002

💙

2:49 PM · Mar 22, 2026 · 53K Views

23 Replies · 669 Reposts · 4.23K Likes

This is how I built my Personal CRM:

Garry Tan@garrytan

This is the simplest distillation of what I have learned about agentic engineering this year Push smart fuzzy operations humans do into markdown skills. Fat skills. Push must-be-perfect deterministic operations into code. Fat code. The harness? Keep it thin.

5:45 AM · Apr 13, 2026 · 219K Views

104 Replies · 213 Reposts · 2.76K Likes

The L1-L5 scale for embodied robotics (analogous to the AV scale):

Tracks

I can’t believe Spotify was already 3 years old when I joined, it had about 100 (terrible) songs when I signed up. My all time tracks here - probably weighted towards 2016 listening:

Love Thylacine (h/t Nick Crocker who introduced me to him)

Making Connections by Jax

Discussion about this post

Ready for more?