Skip to main content

Command Palette

Search for a command to run...

I've Built AI Agents. Now I'm Learning to Build Them Properly - From the Ground Up.

Starting a build-in-public series on agent engineering - the craft, not just the code.

Updated
3 min read
I've Built AI Agents. Now I'm Learning to Build Them Properly - From the Ground Up.
A
I'm an engineer who builds systems and writes about what actually happens when I do. I'm currently working through the craft of AI agent engineering from the ground up - documenting the experiments, the failures, and the moments that change my mental model. No polished takes. Just the build log.

I've been working with AI agents for a while now. I can get something working. I can ship a demo.

But there's a gap between "it works" and "I understand why it works" - and an even bigger gap between that and "I'd trust this with a real customer." I've never fully closed either of those gaps in a way I could explain to someone else.

So I'm starting fresh. Not because everything I've built was wrong - some of it was fine. But I want to build the foundation properly this time. Understand the pieces, not just the outcome.

I'm building Conductor - a technical co-pilot for data integration - from scratch. Not to ship a product (though eventually, maybe). Mostly to understand the craft properly. What does a well-built agent actually look like? What does it take to know it's working? What breaks first when it meets the real world?

I'll be sharing everything as I go - the experiments, the failures, the things that surprised me. If you're on a similar journey, or you've already solved some of these problems, I'd love to have you along. Your experience and opinions are as useful to me as anything I'll build.


What is Conductor?

Conductor helps users connect data sources, troubleshoot when things break, and answer "how do I..." questions about their data stack. Four modes: setup guidance, technical onboarding, troubleshooting, and knowledge Q&A.

It's the kind of agent that would sit in front of real users. Which means it can't just work - it needs to work reliably, handle credentials without leaking them, know when to escalate to a human, and not hallucinate an answer just because a user pushes back.

That constraint is the point. Building something easy to demo is, well, easy. Building something you'd trust is different.


How this series works

I'm breaking the build into 12 sprints. Each one focuses on a specific concept - tool design, memory, RAG, security, multi-tenancy - and produces something real: working code, a test suite, and an honest write-up of what happened.

The format is always the same:

  • What I wanted to understand

  • What I built to test it

  • What broke

  • What I actually learned

I won't smooth over the failures. The failures are usually where the useful stuff is.


Where I'm starting: before writing any code

The first thing I'm building isn't the agent.

It's the eval dataset that will tell me whether the agent is any good.

40 cases, written before a single line of agent code exists. Covering all four of Conductor's modes. Including 9 adversarial cases designed to break it in specific ways.

That might sound backwards. It kind of is. I'll explain why in the next post - and show what I found when I actually tried it.


I'm starting this chapter to learn properly, not to perform expertise I don't have. If you're on a similar journey - or you've already figured out the parts I'm about to struggle with - follow along. Point out where I went wrong, what I should focus on next, or how to improve what I just did. That's exactly the kind of conversation I'm after.

H

The "it works vs. I understand why it works" gap is the one that actually separates people who can debug agents from people who can only demo them. My honest take: the understanding doesn't come from reading more — it comes from deliberately breaking things. Strip the framework, write the loop by hand once, watch where it falls apart.

One habit that accelerated this for me: when I hit a behavior I couldn't explain, I'd run the same prompt across a few different models side by side (I use MultipleChat for this) and compare how each one reasoned through it. Seeing where they diverged usually exposed which part was the model and which part was my scaffolding — that contrast taught me more than any single explanation did.

Looking forward to following the ground-up series. Are you rebuilding with a framework stripped out, or starting from raw API calls?

A

Hansjörg Wyss , that "which layer to blame" problem is exactly where I kept getting stuck - and your cross-model comparison trick is a clean way to isolate it. Going to find a way to build that into the debugging workflow here - probably as a standard step when a sprint turns up a behavior I can't explain.

To answer your question: raw API calls. First sprint is literally while True, tools as plain functions, state as a dict. No framework until I have something hand-rolled to compare against - otherwise I won't know what the framework is actually buying me.

The "break it on purpose" instinct is baked into every sprint here - each one has an explicit failures section and tests designed to find breakage, not just pass. But I'm curious what your experience was when you finally did add a framework back - did it actually solve the things that broke, or just hide them?

Building Conductor

Part 1 of 3

A build-in-public series on agent engineering. Each lab produces working code, passing tests, and an honest write-up of what broke. Building Conductor - an AI agent for data integration - from scratch.

Up next

I Wrote 40 Test Cases Before Writing Any Agent Code. Here's What Happened.

Eval-first agent development: how writing tests before code changed every product decision I made.

More from this blog

A

Agent Build Log

3 posts

Agent Build Log is a build-in-public series on the craft of AI agent engineering. Each post documents one sprint of building Conductor - a technical co-pilot for data integration - from scratch: what was built, what broke, and what actually changed the mental model. No polished retrospectives. No theory without evidence. Just the experiment, the failure, and the learning.