Media

Max and Corey Discuss AI

March 2, 2026·54 min listen

0:00/53:35

Transcript

Max & Corey

I'm Max. I'm Corey. And this is a as of yet unnamed podcast. Yeah, we will. And we're going to talk about a bunch of things related to AI and code and policy and stuff like that.

Corey

We will come up with an undoubtedly brilliant name for this.

Max & Corey

Yeah. So brilliantly named podcast episode zero. There we go. That's right. Spot on. So so I don't know what we've got this list of topics to potentially discuss.

Max

Do you want to grab one and see where the conversation takes us?

Corey

Yeah, we kinda cut ourselves off here a minute ago.

Corey

We were chatting through yeah,

Max

we were talking about agent swarms.

Corey

Right. Yeah. So you had you had mentioned that um you um you know you host host your own uh coding models and uh Pay a tidy sum for the privilege of doing it.

Corey

But before we hit the button here, you were told that last night you ran 3,000 sub-agents on Kim E2.5 in OpenCode. Tell me what you were doing.

Corey

What were you working on?

Max

I was just doing some data processing where I wanted the sub-agent basically took the place of a call to a model. Where I've got some context.

Max

I'm basically trying to make some business decision based on some context. And I've got like three different data points.

Max

And I wanted to add some labels based on a combination of unstructured and structured data for every single one of those data points.

Max

There are three data 3,000 data points. Each one of them has some background data. And so I'm spawning a separate sub-agent for every single data point.

Corey

But this was literally like in lieu of you sitting down and writing code to iterate over your data set and call the model individually with each data point. This is like just talk to open code and say, hey, for each data point, spin up a sub-agent and get me an answer.

Max

Yes, exactly. And write the answer to file. And here's the data structure. Here's the directory structure that I want to see when it's done.

Max

And it did it, and it worked.

Corey

And you're sure that it actually was full-on sub-agent and it didn't itself say, oh, well, I can do that. I can just. It wouldn't work.

Corey

Write myself some code to if it over the device and make a call.

Max

If it if it tries to oh well, if it could it could write some code that then called the language model, but actually I don't think it has access to that in my setup.

Max

Um it does it doesn't have the language model endpoint. Just I run it in this like very locked down sandbox where it has very limited access to things.

Max

So no, I think I'm pretty sure the only way that it can call a language model is by spinning a subagent out. But yeah, it seems it works pretty well.

Max

It's like quite a bit more ergonomic than writing a function. Just tell the language model to do a thing and it does it.

Corey

Does the thing, yeah, that's um I mean three thousand is just such a large number. That's um I I've never I've never spun up that many subagents before.

Corey

Um although I wonder if, you know, as narrowly scoped as that was, that certainly would have helped, right? Um in in in converging to an actual outcome?

Corey

Was there was there any aggregation that you kind of had the master agent performing or was it just data transformation on three thousand data points?

Max

Just data transformation on three thousand data points and then when it's done, give me a report on the net outcome and all all those data points.

Corey

Right on. Um yeah, no, I was uh I was intrigued by by the mention of three thousand sub agents. That uh

Max

It's the first time I've done that. The like subagent thing is like fairly new. I guess I've been been mostly coding through this like agentic loop now for over a year.

Max

And I guess Claude Code added subagents a few months ago, right?

Corey

I mean, it's been at least, it's been, I feel like it's been at least six months. It's um Okay,

Max

six months-ish, yeah, that sounds right to me. And so, sub-agents have existed for like six months, but I haven't really used them that much.

Max

And then I was talking to my team yesterday about some stuff. We we just got this got this new sandbox tool working, which just feels like like a like a just a wave of relief.

Max

Like it solves all these ergonomics problems that I had before around using. Agents because I just felt that they were so insecure.

Max

And, you know, I'm since I'm a security and data privacy fanatic, I'm like, I just cannot, I cannot bring myself like that.

Max

I don't know if it's because I grew up, you know, because I used to be a professional poker player and I just think adversarially all the time, but like, I just cannot bring myself to put myself in a position where I'm like handing over control of an execution environment on my local machine to some other actor that I don't know anything about, right?

Max

Like, I have no, I have no cameras inside of

Max & Corey

Anthropic. I don't know. It's a high-anxiety environment, right? So, I don't know what's going on in there, right? Like, I know what I have control over.

Max

And so, I can't bring myself to just, you know, hand the keys to an Anthropic AI agent and say, go hog wild, here you go.

Max

And so, finally, we have this tool. Where I'm able to spin up a sandbox environment that is connected to that that connects all outbound traffic to a proxy through an eBPF module, and that therefore, at the kernel level, all network is inter is intercepted basically, uh and with this little man in the middle tool, which tells me every single thing that the Process that's running is trying to do, and then shows me all of the requests in a little HUD, and then blocks anything that doesn't fit some rule right in real time.

Max

And so, I can actually change the rules dynamically. So, we're in the middle lane, all traffic, and I can dynamically change the rules.

Max

And so, what's awesome about this is if you know, if my AI agent tries to do a thing, it tries to hit some endpoint on the network that I didn't make a rule for, I can actually see.

Max

That in my dashboard and tell it, yes, that's allowed. And then it's just as if the network took a few seconds, right, to respond from the point of view of the model.

Max

So it doesn't actually interrupt the flow in any meaningful way, which is such an improvement over what I had before, which was where every single time I had to make a change to the rules that I made for my agent, I had to restart the whole process, throw myself out of whatever context I was in. And then also, This sandbox environment is implemented at the software level using namespaces or a subset of namespaces, which allows me to use the tooling that I already have on my machine, right?

Max

And restrict the process in specifically the ways that I decide to, rather than our previous solution, which was using Docker.

Max

But then, you know, in Docker, you've got this like fresh environment every time. And so you need like your entire tool chain needs to be re-implemented for every Docker container.

Max

And blah blah blah. It's really not ergonomic, right? And so now this is like a big breakthrough. This is the thing that we finally got working this past week.

Max

And I'm just, I'm going hog wild. Like I can't, I can't stop doing Agentic stuff because it just feels so good. Like I feel like I feel so powerful.

Max

I had a, you know, you talk about the vibe coding hangover. You know, like I feel like I'm going through a similar. A similar intoxication phase again because I just feel so powerful.

Max

It reminds me of like back last summer after I forget which model it was, one of the Cloud models took like a big leap forward and suddenly I just felt so powerful.

Max

I was like, oh, it's got like I got like a team of juniors that are that are like the fastest coders in history and I'm and I'm implementing you know the kinds of things that used to take me a month and like an hour and it's unbelievable.

Max

And so you know, I got this rush and I got so amped up and I was like coding all day. I'm having a similar experience now with this, like, you know, with the sandbox environment that we set up, because I feel so powerful. I can control the model exactly the way I want. It's anyway.

Corey

Yeah, it was probably Sonnet 3.5. And yeah, we should actually do a special episode or something on this because there's so much.

Corey

I know we've each kind of come at that same problem from different angles. And like this big breakthrough that you've made here recently, I mean, I think it's.

Corey

It's substantive enough that we should kind of pull it apart and show it. Yeah. Yeah. Explain to everyone why it is that you're just this pumped about it, you know?

Max

It's really cool. But yeah, I mean, I would say, just in general, I like the just high level of the motivation there is that it's very difficult to know what a process on your machine is doing, right?

Max

A the interactions between a process and your operating system are not designed to be transparent, right? And so that normally is not a big deal, but like I kind of changes the game a little bit because it's kind of like you have an actor inside of your computer now, right?

Max

A given process is like an independent actor that has its own, you know, I'm anthropomorphizing a bit, but it's as if you have like a external actor.

Max

In your computer acting on your behalf as a process. And so I don't think Linux was really designed with that security model, right?

Max

It wasn't designed from a point of view where it's like, oh, yeah, any given process is like an external user.

Corey

It definitely wasn't designed that way, right? Like invite someone rando over to take the keyboard and have, you know, go ham, right?

Corey

Like, absolutely.

Max

And that's effectively what running an AI agent. In a process with your user permissions, is right. It's equivalent to handing over your keyboard and mouse to some person that you don't know, but also they are so fast that they can do thousands of things per second and you would never possibly keep up with them, right?

Max

And they are being controlled by somebody that you've never met, that like, and the and there's no way to hold them ever accountable for anything they do.

Max

So it's actually like quite a bit worse than. Giving a rando access to your message board. So it's a bit crazy. And this is a thing that like is becoming more and more just a thing that everybody sort of reluctantly does, or in some cases, you know, with giddy excitement does.

Max

But it's not something that I can accept, right? It it just seems like a it's a fundamentally bad idea that is leading to just a lot of security vulnerabilities in the world right now.

Max & Corey

The s the surface area for bad outcomes is massive. Yeah, yeah, yeah, exactly. I it seems like a much seems like a much more pressing AI alignment problem or whatever, right?

Max

Like there's all this doomer stuff that comes out of San Francisco, but it's very oriented around telling a certain kind of story where, you know, your only hope is to invest a lot of money in Anthropic, right?

Max

But realistically, the present threats, you know, this cybersecurity. Problem that is universal now because everybody's just running agents with their user permissions on their on that.

Max

That's a that's a way bigger, much more urgent problem right now than a model misalignment or whatever. And like maybe great model alignment can mitigate some of the damage caused by this promiscuous behavior by users, but uh but it's certainly not going to solve it.

Corey

Well, but even even if look, even if you could completely trust There would be no scenario where it would not be a trust but verify thing, right?

Corey

Like, and so, so yeah, like it, it is trust is, trust is, is essential, but not sufficient.

Max

And, and, and I actually think that trust and verification are like causally interconnected, right? Like, I think you cannot, the act of verifying a process is what makes it trustable, right?

Max

Um, Because in practice, you don't really totally understand a thing that you're not observing. And so you are inherently going to end up in situations where the behavior of the system is unexpected because you're not verifying what it's doing, right?

Max

And unexpected is risk. Not trusted. Exactly. And so you have to have, you have to have observability, right? You have to have clarity into what an AI system is doing in order to be able to trust it.

Max

Um and it

Corey

which is which is like I mean it's that's so spot on and it is paradoxical because like interpretability of this like size neural network is impossible, right?

Corey

And so like the the technology at its core is by that definition untrustworthy, right?

Max

Yeah, and I don't I don't want to make it sound like the Interpretability work that these big labs are doing. And Anthropic is really leading in this.

Max

And I think it's awesome that work that they're doing, but it's not sufficient, right? You do certainly want to do the work to try to understand the model and to try to design models that are better aligned with the expectations and interests of the users.

Max

That's absolutely a thing that you that is like required, right? For a, for a, for the, for the happy path future. And also, you need to have systems that are observable and.

Max

Put control in the hands of the user rather than rather than these the standard right now. With, I mean, I don't want to use the, I want to drop a hard O and mention OpenClaw on this on this podcast.

Max

That's brilliant. But like, you know, the sort of the cultural zeitgeist right now is in a is in a bad place when it comes to cybersecurity.

Max

And in a, in a bad place that's like, Borderline flirting with disaster, right? It's like it's really asking for big problems.

Corey

Well, yeah, I mean, it's not, it is not a culture. It may be the cultural zeitgeist, but the catastrophe will not be a cultural one.

Corey

It'll be a technical one with very, very severe real world outcomes, not just technical, like economic, you know? It is, yeah, I'm absolutely with you.

Max

Yeah, so you have to have. You have to have a sandbox, right, that's ergonomic for users to be able to use AI agents in a way that they control.

Max

And I've seen a bunch of different attempts at solving this problem, and none of them have approached the problem in a way that satisfies me.

Max

I've seen some nice projects, and we've actually definitely taken some inspiration from. And borrowed ideas directly from some of them.

Max

Like, you know, I want to shout out: like, there's this project, Use Tusk, out there that built this thing called Fence, which I think in turn is inspired by the work that the Claude Code team did actually last year.

Max

They built this kind of software-defined sandbox around Claude Code, and Fence took that idea a little bit further. And that's a good idea. I do think inherently the company providing the model. And the company providing the security infrastructure around that model should not be the same.

Max

I think that that better aligns incentives, right? Ultimately, you want market pressure to act in favor of the user. And so if you've got the company who's providing the model and getting you to pay for the model is also the one who's building the scaffolding around protecting what the model does, of course they're going to have incentives to get the model get you, the user, to do the things that are in the best interest of the company.

Max

Who's providing the model?

Corey

And that's not a cynical point of view. I mean, it's the reason for the external audit, right? Yeah, exactly. You want independent verification, independent sandboxing here.

Max

Yeah, I mean, you could it doesn't take any kind of a malicious intent, right? Like, I am going to be much more personally motivated to solve a certain problem.

Max

If it affects me in a direct way, right, then I am gonna be if not solving the problem benefits me, right? So ultimately, yeah, a solution like this has to exist.

Max

I've seen Geovisor, Google develop this thing that I think is really cool. There are several others. I think I've seen some attempts at many, many, many attempts that look something like, hey, We've provided a sandbox, a Docker-like sandbox environment as a service with an orchestrator on top, which is what we built as well at first.

Max

Ultimately, I think I already talked about why that falls short. But I think ultimately it was a shift in perspective that led to this, which is that a lot of the sandbox tools are designed from the perspective of how do I create a system that's safe for the agent, as opposed to how do I build the system.

Max

That transfers agency to the user. And so that second way of thinking about the problem is what led us to, okay, well, ultimately, like we need to find a way to give to insert a network proxy in between the model and everything it's interacting with and man in the middle it and then give you the user control at that choke point.

Corey

Well, but it's like, I think you guys have. You guys have addressed this across the three core dimensions of the problem, right?

Max

So it's clearly that there's network access, but it's also file system access and operating system access, right? And like, it's the way that you've, and I love that reframing, by the way.

Corey

Like, and I think that's part of why the ergonomics of what you've done here are just so much nicer. Like, it gets, it gets, you've gotten out of the way.

Corey

And it comes from that paradigm shift of rather than. Fence in, box in the agent. It's no, no, no. Like inject the agent in a way that is like wholly, wholly transparent and contained in my way that I'm like inside of my website,

Max

right? Yeah. Yeah, exactly. Exactly. So, so yeah, we're, this is why I'm giddy. Yeah, I get it. Just I feel like I can finally use an AI agent run loop and not, you know.

Max

jump through all these hoops. And we're adding all kinds of features around that, right? Now that we have the core infrastructure in place, there are all kinds of cool things that we can do.

Max

Like, hey, let's just put token IDs in place of secrets inside of the container. And then the language model never actually touches any of the secrets. It just sends the ID out and then the network interrupt interrupts the outbound request and replaces the ID with the token value and then vice versa on the way back it puts the ID back in and so then the agent can use credentials without ever knowing what they are and so then that that just eliminates a whole class of data exfiltration problems right um right off the bat yeah

Corey

that's super nice you're uh you you take what you take what can be misunderstood as some something somewhat ominous the the man in the middle proxy right and you've actually got that man doing real work for you now Right?

Max & Corey

Exactly. It doesn't feel good.

Max

And so, you know, that's that's an obvious one, right? Where you're you have like a specific ID and then it's it's really easy to scan for specific IDs and replace them.

Max

But then the next stage after that that I think is really interesting is, you know, I could design a model. I could design my own model that I'm hosting that is trained to understand what my IP is, right?

Max

And what constitutes. Information that I'm willing to share with my dollar provider and what constitutes information that I'm not willing to share, and then route a request accordingly, right?

Max

And say, Hey, you know, this is this is this is commercial IP. This should never be touching an external model, or hey, this person is just doing research on the on the internet, or this model is just trying to like collect data from an endpoint.

Max

Sure, go ahead and hit whatever, you know, use use Claude for that. Use Use ChatGPT, whatever. Or even in the future, I could imagine a situation where it can replace IP with a broad description of what's in the IP and use the more powerful model in cases where the more powerful model doesn't necessarily need the actual contents of a secure document, let's say, but it needs to know for context what's in the document broadly.

Max

And you could swap in, you could have a Model that like sits at that choke point and replaces sensitive information with whatever placeholders that are that work as functional replacements and all kinds of other things that you can do.

Max

And so I'm really excited about what kinds of workflows this technology unlocks.

Corey

Yeah, it's like this foundational bit of kit upon which you can build and extend to kind of end up with a rich, deliberate Sovereign, if you will, AI ecosystem.

Max & Corey

Yeah, exactly. Yeah, I gotta do. That's uh that's very cool.

Max

It reminds this whole thought process kind of like reminds me a little bit of your sculpting metaphor, where it's like to a certain extent with with AI, you're actually like chiseling away at the possibility space rather than trying to build on it, right?

Max

Um That the model can do everything and the problem is that it can do everything. And if your if your possibility space is too large, it's not gonna be solving the problem that you actually want.

Max

And so understanding how best to restrict the space of possibility is what engineering is in software three point zero, as you like to say, right?

Max

It is more an exercise in understanding how to restrain rather than how to augment a lot of the time.

Corey

Yeah, no, I think that's I think and and you know, the the two are the two are so directly related, right? Like addition through subtraction, whether whether it's whether it's reducing the possibility space or otherwise like specifically engineering the context, right?

Corey

Like those are kind of the two sides of the same coin.

Max

There's this information theory concept that I use a lot when onboarding engineers onto my team. And the idea is basically that negative feedback, and this is, you'll see why this is related in a minute.

Max

Negative feedback is less helpful than positive feedback. And we all sort of emotionally feel this way, right? Like, I would rather you tell me.

Max

What I've done is good, then you tell me what I've done that's bad. And I have this little like evolutionary, you know, psychology pet theory about this, which is that inherently the solution space in the real world is infinitely sized.

Max

And so negative feedback is reducing an infinitely large space. And so it's actually not giving you much information, right?

Max

Whereas positive information is reducing an infinite space to a finite space. You're saying, hey, this. you know, what the the direction of what you're doing is a good one.

Max

And so you don't need to look in every direction anymore. Look in the direction that you're already pointing. This is good.

Max

And that's high like very, very rich in information. And so I think that we're all kinda like baked into it baked into our brains is this like aversion to negative feedback because the negative feedback does it doesn't actually help us understand what the correct solution is.

Max

I think similarly with AI engineering, the solution space is infinite. And if you tell the model, do the thing, it can do so many different things, and most of them are wrong.

Max

And so when you tell it, like, hey, do this specific thing, and you give it like a very narrow spec, you've narrowed the space of exploration down to this tiny, tiny window, right?

Max

And you've eliminated so much of the possibility space. And so you've made the problem so much easier. And so. When you're dealing with real-world solution space, which is so huge, you're eliminating so much when you give good specs, when you give good decisions, when you design a solution really well.

Max

You're eliminating, right? Yeah, no, in a very

Corey

real way. I catch your drift now. And I'm glad the way that you described it too, because it's a complete specification, is in fact two things.

Corey

It is. To your point, right? The sort of very, very clear, positive information of do this in this specific way. And the feedback loop of here's how you'll know if you've done that wrong, i.e., the like defining the tests, right? And it's with both that you end up with something now that. As a specification, that can almost kind of get to replacing source code because you've eliminated all that infinite negative space and you've got something now that can provide the feedback loop to test have I done the thing right?

Corey

And now you've enabled autonomy in the model or in the agent harness to be able to do real work in ways that are very consistent with the expectations that you have as the engineer.

Max

Yeah. Yeah. A direction and a test, a spec and an eval, that that that's all you need, really. That's you have if

Corey

you have no that's that's exactly right. That's a a a nice distillation of um the principles and process that I talked about in that vibe coding

Max & Corey

hangover concept. Yeah, I love that talk. No, right on well, I'm working on Refining that down to something far more digestible in this sort of software 3.0

Corey

engineer perspective. But that thought of the specification being two parts, the positive declaration of like exactly what it is that you want, and the way to define how to test if that's been correctly implemented is a Key concept in that forthcoming refinement.

Corey

More to come.

Max

Yeah, I've been, I feel like the idea that I don't think this is broadly true right now, and I don't know if it ever will be true, but I increasingly have conversations about the idea that source code could be thrown away if you can keep the sort of specification and the evaluation criteria.

Max

And maybe that's. Maybe the evaluation criteria is source code and what I'm really saying is that like certain parts of source code are less important than they used to be.

Max

Or maybe what I'm trying to say is that increasingly I want to keep both in practice right now and that maybe in the future when language models are just so good at implementation, you will, you know, the source code will be decreasingly important as sort of artifacts in the engineering process.

Max

But I really do think that there's something to it. Like, I'm increasingly asking my engineers for the chain of prompts that led to a spec, right?

Max

I'm increasingly asking them for the thought process that went behind the solution that they provided rather than the solution itself.

Max

And I think that that's always sort of been important, right? I used to tell my engineers, I used to tell my senior engineers, I still tell my senior engineers when they join the company that it's forbidden to give more junior engineers solutions when they're working on a problem because what you should do is ask them questions or help lead their thought process to the solution.

Max

Because if you give them the solution itself, it's not sufficient. They need to understand why, right? They need to be able to work their way to the solution.

Max

The path to the solution is more important than the solution itself. And so that's sort of always been the case, but I'm finding more and more that.

Max

Our internal processes when interacting with language models are like making this explicit in a way. I'm increasingly finding value in the artifacts, the language artifacts that led to the destination rather than the kind of final artifact itself.

Corey

Well, and you would have gotten there, like in software 1.0, you would have gotten there. Through the process of planning with the team, right?

Corey

Like you together would have spent the time, but like it's so expensive, and it takes so much time to get to that shared headspace where now we're moving so quickly and working with these models and agent harnesses to write software so fast.

Corey

You're almost, you want the same artifacts, but you're getting them in reverse order, right? Like you're kind of, now it's here is a solution, and also here is the.

Corey

Process of getting to the shared headspace. And instead of it being the whole team, it's now just this individual engineer and the model, right?

Corey

But like you get to this shared headspace. And so those artifacts are still as important as ever. And just the order in which we're getting to them indicates that you're sharing here of like, hey, don't just send me the app, the working software, right?

Corey

Like I want specs and I want the Chain of prompts, as you were saying. You know, that order is changed. And I think that's fine. I think we're here though.

Corey

Like, we're already here, right? Like, we see all the time just how almost magically effective the model can be when taking very complex software that is itself really well covered with executable tests and porting it to any target.

Corey

Language or environment or whatever, right? I think of a couple of things that Simon Willison has done super recently, right?

Corey

Like he kind of loves to take hot off the press implementations and then like port them and port them in a in like a publicized way.

Corey

So like, you know, he'll he's he's written two or three of these things now where like, um, You know, he took, there was like an HTML parser that was written in Python and it was released, I don't know, towards the end of December, just HTML, I think is what it was called.

Corey

And like in four and a half hours of clawed code time, ported it perfectly to JavaScript because of the fact that it had the test suite, was good enough and existed already.

Corey

That, like, I don't know. I think we're here. I think we're already here in that, like, With with the spec, the spec being in those two parts, the description of the thing to do and how to evaluate whether you've done it correctly, you can kind of throw away the code because if especially if the tests themselves do accurately evaluate the functional implementation of what it is that you're after and those paths, like the specific details of the implementation at the source code level kinda don't matter, right?

Corey

Like, if it's functionally accurate, isn't that good enough?

Max

Yeah, although I will say in practice, the reality is always like way messier. You know what I mean? Like, I'm simultaneously blown away by models and disappointed by them every day.

Max

Like, I don't know what it is about the jaggedness of model capabilities. Well, it's so uncanny how, you know, it will. Write to spec, but then there will be something so obviously implied in the spec that it will miss and flub somehow.

Max

That in practice, we're not definitely not at the point where, like, language model writes code and then I'm confident that the code does what I actually want it to in practice, unless it's like a very simple thing, right?

Max

I don't know if that matches with your experience, but the

Corey

I mean, look, the stochasticity of it, right? Like the non-deterministic nature of the models. Means is that, yeah, on any given run, certainly it may trip up on this or that.

Corey

I guess two things. There's that, there's that, but there's also then like to the point where you're expecting the model to infer your specifications in plea, right?

Corey

Like, part of the challenge for us now is like, and it's a very different way of getting to that shared headspace because.

Corey

Because the model has only one single channel for communication with you, and that is written text. Whereas when you sit with the team, you've got all this nonverbal stuff, and you've got the shaping influence of different perspectives that come, and like you all get to over a period of time.

Corey

Plus, you have the shared history, right, of looking in that place. So, like, you can be a lot more loose in specification when you work with a team of engineers, especially a team that you've worked with for some period of time or you've delivered things before, you know what I mean? Like, I, yeah, yeah,

Max

yes, and also the models are strangely imbalanced, where they are just so superhuman in certain ways and like just shockingly deficient in surprising ways as well.

Max & Corey

So, if you're talking about the

Corey

frustration of like, Having a genius and a toddler work on the same thing. And like, yes, I'm absolutely with you there.

Max

There is something very difficult to describe about the way that these models that are so superhuman are so woefully deficient in shocking ways.

Max

And it's becoming, as they get better and better, it's becoming harder and harder to describe what it is that they're missing.

Corey

But they're Karpathy. I think it was in Karpathy's blog at one point. He showed these little ability surface diagrams and it showed it was like a spiky surface.

Corey

And it was like, oh, here's a competent human's ability. And here's competent AI's ability. And they overlap in a lot of ways, but they're completely not overlapping in other ways.

Max

And the edges of the capability space are surprisingly jagged. Maybe if you were used to what models can do. You'd be shocked at the ways in which humans are like so woefully deficient at all these things that the models are so good at.

Max

But since we're all used to what humans can do, we're constantly surprised, or at least I don't know. I certainly am constantly surprised at how superhuman they are in things that, you know, anything that's verifiable, right?

Max

Any kind of any kind of thing for which I can write a cost function. My models are just so superhuman. And everything that I can't quite.

Max

Put my finger on. They're like, they're just so shockingly dumb. So it's, I don't know. It's, it's, I think that there's a sort of way in which the, the, the capability space is uneven.

Max

And I don't know if this means, I don't know. I, I don't want to get too, I don't want to go down too far, another philosophical rabbit hole, but I do feel like there is this, there's this concept of like general intelligence, and people use the term AGI all the time.

Corey

Oh my goodness, yeah.

Corey

It's like math

Max

they overused. But you know, there's this implicit idea in the i in the concept of general intelligence that there is such a thing as general intelligence, that there's like a generic thing that is intelligence that you can have more or less of, which I think language models seem like pretty pretty much put the nail in the coffin for that theory to me.

Max

I almost feel like the idea of intelligence is as precise as the concept of athleticism, right? Like, what's more athletic, a cheetah or a gorilla?

Corey

Probably not the right question to ask, right? Like, and general intelligence itself may not be the right thing to build towards, and certainly not in the short term, right?

Corey

I mean, like, to wit, in the day-to-day work. I'm finding more and more that in the day-to-day work that I'm doing, adhering to well-defined engineering principles for working successfully with coding agents, there is not really a limitation to what I can build.

Corey

Like and and by what I like, what I can build, I mean what I can build from scratch, what I can build on top of things that exist already.

Corey

And so if there is no practical, and let's be a little hand wavy here too, right? And like, so back to the back to the issue you were describing earlier, like you can sample the model, right? So like instead of instead of you running a single agent hierarchy to achieve the outcome because you know that he's going to stub his toe or misinterpret or whatever, right?

Corey

Like you run three agent hierarchies for really like for things that are mission critical that you want to make sure you get exactly right, you kind of expand outward and And through what is inherently the process of sampling the model, overcome the non-deterministic rough edges of this thing, right?

Corey

So that, like, in general, on average, you do get a solution that works. So, like, let's be a little hand weighty and let's say that in the coming X years, right, months, 12, 24, whatever, the models themselves kind of improve, whether they internalize that sampling art, which we're seeing, by the way, we see all the time, whether it's the agent harnesses from the Frontier Labs or or the Kimmy agent swarm, right?

Corey

Like the the model providers are themselves leaning more and more on this kind of sampling approach to smoothing out the non-deterministic rough edges here.

Corey

We can be a little hand-wavy and kind of say that like 12 to 24 months, those annoyances get buffed away. Like the question kind of becomes, and this is, I think, part of what you're getting at.

Corey

Like, does AGI Really matter? Or like, isn't there kind of a point of saturation where, like, for the thing that I'm trying to use it for, it is, for all intents and purposes, at that point of like, it's so much better than me, and it's so much better than like the general human or even aggregate humans, where like talking about general intelligence is not useful and rather like focusing now on doing real work with the thing in its current state as opposed to like trying to evolve.

Corey

The thing beyond its current state into some generalized form of superintelligence. I don't know. I think that's, it feels pretty bloody close, at least for software engineering, coupled with proper engineering discipline for it.

Corey

Like, it feels really close, man.

Max

You know, I'm increasingly consulting with businesses that are trying to modernize their engineering practices. And so I have a lot of these kinds of conversations.

Max

And I do feel in, My real world experience that there is this significant gap between the ability that it's almost like the language models are so good at prototyping so fast, and the degree to which models are superhuman is way like it's heavily weighted towards the prototyping phase of software.

Max

And so you can get things that are. Sufficiently solving a problem for prototype purposes so quickly, and then you can get kinda I mean, this is to a certain extent, you talked about this in the VOP coding hangover video video that you did.

Max

But you end up in this situation where you have you have so much kind of working code, but maintaining it, let's say you get like a like a 30x speed up on prototyping, and you get like a like an 80-80% speed up on maintenance-oriented.

Max

You know, software activity. This is just a number that I'm coming up with off the top of my head based on my personal experience.

Max

I think what that leads to is this situation where people who you are spending more of your time, more and more of your time on that prototyping part of the problem.

Max

And so language models feel so incredible. And then the people who are actually maintaining software day to day have a kind of mismatched experience where they're like, yeah, language models are.

Max

Are really helping me, but they're not earth-shattering in the same kind of way. And I think that that, I don't know, I think there's something about that tension that leads to these like really mismatched kind of opinions about the capabilities of language models from person to person that I talk to.

Max

But you know, all that's to say, like, I'm not trying to say the language models are not incredible, right? I do think that there's.

Max

There's contexts in which they're much less incredible and contexts in which they're much more incredible. And I do think that like the holy moly, look how good this language model is, is much more of like a prevalence message that you hear in a day-to-day discourse because people are more likely to talk about something that they think they're really excited about than people who are like, yeah, it's pretty, it's pretty good, right?

Max

So I don't know. I don't know if I'm if I'm really making that coherent of a point here.

Corey

No, I mean, I was reacting to what I thought you were saying, which was general intelligence is not a useful concept and may not even be a real concept.

Corey

And therefore, specific intelligence is the opposite. Specific intelligence is quite useful. And then, point, I thought I heard, I thought that.

Corey

The progress of thinking was, and look at how, in certain dimensions of specific intelligence, the models are demonstrably superhuman.

Corey

And like, you can bring them up, but like, I mean, look, the Math Olympiad questions, like, I mean, there's like a whole litany of benchmarks.

Corey

And set aside for a moment whether or not, you know, set aside the question of benchmark saturation in the training of the models.

Corey

I think it is sufficiently well documented that the models are superhuman in things that we as humans hold as measures of specific intelligence, Math Olympic Add and like all of these, you know, there's literally benchmarks, right?

Corey

And so those benchmarks are useful up to a point. I mean, by the way, like the OpenAI guys just came out today and said, just as a tangent, they just came out today and said that SWEBench Verified, which for those listening, those who hopefully may still be listening, SuiBench Verified is the OpenAI significantly cleaned up version of the original Princeton University SWEBench.

Corey

Benchmarking data set, which was this coding benchmarking data set for coding models and coding agents. And they took some large number, it was thousands, two hundred maybe twenty-eight hundred GitHub issues from big public open source projects that had ultimately been solved with contributions.

Corey

And they bundled up a really useful real-world applied data set for then testing the efficacy of coding models and coding agents.

Corey

On solving real world coding issues. But it turned out that it's worth going and reading what they've just written about kind of doing away with SWE Bench Verified.

Corey

But it turned out that as they were training O3, O3 was getting some significant number, 150, maybe 190, I can't recall the exact number, of these.

Corey

Of these just flat out wrong. And so, like, they started wondering, well, hang on a sec, like, is it that the model's failing, or is it that maybe the GitHub issues themselves are not like set up in a way that is that a non-human would reasonably be able to solve these? And so, they went and they went and cleaned it up, cleaned it all up, right? And so, like, SWE Bench Verified has for some time now been the improved version of SWE Bench that you use for.

Corey

Benchmark. They've come out today and they've said, Well, you know, we're actually like, we think that that's there's so much contamination on this benchmark in not just our models, but the models provided by Anthropic and Google also that like we're shifting very heavily towards a private benchmark or private evals set, which, you know, shocker, right?

Corey

Like private evals, who'd have thought. All of that as a tangent off of like, let's kind of, you know, set aside benchmark or eval saturation in the models themselves.

Max & Corey

Think that's a really interesting sidebar. So, if you don't mind, I kind of want to respond. Yeah, go on, yeah. Don't lose your train of thought, but uh, but I actually think it's really interesting that we do some internal evaluations of the different benchmarks of the different models, just based on my own background in machine learning, right?

Max

Like, I know. From personal experience, that models are very good at overfitting to whatever benchmark you are measuring them against, right?

Max

And in practice, you have to separate the things that you're testing the model on from the things you train the model on, or else in practice, the model will basically just memorize the test, right?

Max

Yeah, train, test, split. Who to Thor, right? Yeah, yeah, yeah. And that is increasingly difficult for. The language model providers, because they're training on like the entirety of everything that's ever been published on the internet.

Max

And the data sets are too big for them to be like, you know, curating them to the point where they know for a fact that everything's been taken out of the training set. And so I think it's very difficult to produce data sets that don't have any of the SWE data in them.

Max

Right. And so even if you're trying really hard, To not contaminate your data set, I think you're going to end up with contaminated data sets.

Max

But I also find that there are pretty clearly some labs that are more, let's say, saturated in terms of benchmarks than others.

Max

Like from just in our own both combination of internal evaluations and vibe-based sort of evals, I would say that Anthropics models always perform worse on public benchmarks than they do on our private.

Max

Benchmarks relative to the other models, right? Like this has consistently been the case for the entire history of Claude.

Max

It did better on our internal evaluations relative to other models than it did on those public benchmarks. And so maybe they're just better at cleaning up their data sets.

Max

Maybe the other labs are actively training on and trying to tune their models to those benchmarks. I don't know. But there's clearly, clearly a difference in from lab to lab that seems to.

Max

Consistently, like Qwen, the Quinn models to me just are always better at the measurable benchmark stuff compared and worse on private eval stuff that we run them on.

Max

Always. The Qwen models have always been worse than the public benchmarks suggest that they are. On the other hand, Qwen models seem to be really good at correctly producing JSON, for example, which is a

Corey

sorry. They've engineered specifically for that. Structured outputs are one of the things, especially in the the Qwen3 models that like I mean they've they spent a lot of time in training to produce really reliable structured output. But yeah, keep going. Okay.

Max

So so I mean I mean maybe that's actually related to their performance on these on these public benchmarks, right? Like it's it's possible that a significant chunk of the error rate of some of the models is that they just produce Things with like syntax errors.

Corey

Well, and that's, and that was kind of why, like, that that was kind of why OpenAI spent so much time on prod, like, I mean, they spent like serious, serious time and money producing the verified data set because it was exactly things like that.

Corey

Like, they call them, they call them narrow misses or something, right? Where it's like the test. require test some very narrow definition of correctness.

Corey

And even though the model produced a functionally accurate result, like it didn't test as accurate and so was counted wrong to your point.

Max

Yeah, yeah, exactly. Exactly. So so I don't know. Um but I do think it's interesting that the the different labs have like different characteristics in this in this way and that they seem to seem to emphasize different capabilities.

Max

Like, you know, Claude has always been better at coding relative to other skills.

Corey

I was that was exactly what I was gonna bring up because like we who've been using Claude for coding for 18 months or more now, right?

Corey

Like since Sonnet three five effectively, but it's never it's never shown up on like the benchmarks. Like if you read artificial analysis and their benchmarking, they've got they've got and have always had, right?

Corey

The latest OpenAI model, the latest Gemini model ranked higher than the latest Anthropic model, and that has never been.

Corey

My experience, or frankly, like most like AI engineers' experience has never been consistent.

Max

Yeah, yeah, yeah, that's exactly right. I mean, that's that's sort of what I was getting at is, and I don't think it's I don't think that's limited to Anthropic.

Max

I think that there are some models, some labs that have like sort of always over performed on these public benchmarks relative to their performance.

Max

I mean, uh, another one is uh Those GPT OSS models, right? The open AI open models, they do so well on all these benchmarks.

Max

And, and they're like, they're even when they came out, they were, they were already, yeah, they were, they were not good models.

Max

They were never good models at any point. The benchmarks are objective, right? Whereas, you know, my act anecdotal evidence is, is obviously strongly biased by whatever filters I have.

Corey

Yeah, but that's that's kind of like my point. Like, It's not to be dismissive of anecdotal. Anecdotal is the only thing that matters.

Corey

The only thing that matters is the way that you need to do your work with these models. Like that it works really well for someone else on a thing that you don't do.

Corey

Why does that at all matter? It's back to, I think, in principle, the same question of like general intelligence versus specific intelligence.

Corey

Like, if, if, if, if the best available model tomorrow was exceptionally good at like radiology. We as humanity would be better off for it, but I, as an individual, do not have any immediate benefit from that.

Corey

Like, I don't work in that space. Tomorrow, I'm not going to do anything with x-rays. I'm not going to be reading any MRI outputs, you know?

Corey

I just don't have any immediate benefit. There's no immediate benefit to me in the work that I do. Anecdotal really matters in doing real work with these models and formalizing it so that it's not anecdotal for you, but to your point, actually spending the time as you come across where the model falls down, or conversely, where the model does really well for the thing that you're doing, and spending that extra bit of time to extract something that you can preserve and use to evaluate going forward is time that is tremendously well spent, that pays compound returns in short order.

Max

That's it for today. Thanks for listening to the end here. Corey and I will record a few more of these in the coming weeks.

Max

Stay tuned and have a wonderful day. Until next time.