Max and Corey Discuss AI


I'm Max. I'm Corey. And this is a as of yet unnamed podcast. Yeah, we will. And we're going to talk about a bunch of things related to AI and code and policy and stuff like that.
We will come up with an undoubtedly brilliant name for this.
Yeah. So brilliantly named podcast episode zero. There we go. That's right. Spot on. So so I don't know what we've got this list of topics to potentially discuss.
Do you want to grab one and see where the conversation takes us?
Yeah, we kinda cut ourselves off here a minute ago.
We were chatting through yeah,
we were talking about agent swarms.
Right. Yeah. So you had you had mentioned that um you um you know you host host your own uh coding models and uh Pay a tidy sum for the privilege of doing it.
But before we hit the button here, you were told that last night you ran 3,000 sub-agents on Kim E2.5 in OpenCode. Tell me what you were doing.
What were you working on?
I was just doing some data processing where I wanted the sub-agent basically took the place of a call to a model. Where I've got some context.
I'm basically trying to make some business decision based on some context. And I've got like three different data points.
And I wanted to add some labels based on a combination of unstructured and structured data for every single one of those data points.
There are three data 3,000 data points. Each one of them has some background data. And so I'm spawning a separate sub-agent for every single data point.
But this was literally like in lieu of you sitting down and writing code to iterate over your data set and call the model individually with each data point. This is like just talk to open code and say, hey, for each data point, spin up a sub-agent and get me an answer.
Yes, exactly. And write the answer to file. And here's the data structure. Here's the directory structure that I want to see when it's done.
And it did it, and it worked.
And you're sure that it actually was full-on sub-agent and it didn't itself say, oh, well, I can do that. I can just. It wouldn't work.
Write myself some code to if it over the device and make a call.
If it if it tries to oh well, if it could it could write some code that then called the language model, but actually I don't think it has access to that in my setup.
Um it does it doesn't have the language model endpoint. Just I run it in this like very locked down sandbox where it has very limited access to things.
So no, I think I'm pretty sure the only way that it can call a language model is by spinning a subagent out. But yeah, it seems it works pretty well.
It's like quite a bit more ergonomic than writing a function. Just tell the language model to do a thing and it does it.
Does the thing, yeah, that's um I mean three thousand is just such a large number. That's um I I've never I've never spun up that many subagents before.
Um although I wonder if, you know, as narrowly scoped as that was, that certainly would have helped, right? Um in in in converging to an actual outcome?
Was there was there any aggregation that you kind of had the master agent performing or was it just data transformation on three thousand data points?
Just data transformation on three thousand data points and then when it's done, give me a report on the net outcome and all all those data points.
Right on. Um yeah, no, I was uh I was intrigued by by the mention of three thousand sub agents. That uh
It's the first time I've done that. The like subagent thing is like fairly new. I guess I've been been mostly coding through this like agentic loop now for over a year.
And I guess Claude Code added subagents a few months ago, right?
I mean, it's been at least, it's been, I feel like it's been at least six months. It's um Okay,
six months-ish, yeah, that sounds right to me. And so, sub-agents have existed for like six months, but I haven't really used them that much.
And then I was talking to my team yesterday about some stuff. We we just got this got this new sandbox tool working, which just feels like like a like a just a wave of relief.
Like it solves all these ergonomics problems that I had before around using. Agents because I just felt that they were so insecure.
And, you know, I'm since I'm a security and data privacy fanatic, I'm like, I just cannot, I cannot bring myself like that.
I don't know if it's because I grew up, you know, because I used to be a professional poker player and I just think adversarially all the time, but like, I just cannot bring myself to put myself in a position where I'm like handing over control of an execution environment on my local machine to some other actor that I don't know anything about, right?
Like, I have no, I have no cameras inside of
Anthropic. I don't know. It's a high-anxiety environment, right? So, I don't know what's going on in there, right? Like, I know what I have control over.
And so, I can't bring myself to just, you know, hand the keys to an Anthropic AI agent and say, go hog wild, here you go.
And so, finally, we have this tool. Where I'm able to spin up a sandbox environment that is connected to that that connects all outbound traffic to a proxy through an eBPF module, and that therefore, at the kernel level, all network is inter is intercepted basically, uh and with this little man in the middle tool, which tells me every single thing that the Process that's running is trying to do, and then shows me all of the requests in a little HUD, and then blocks anything that doesn't fit some rule right in real time.
And so, I can actually change the rules dynamically. So, we're in the middle lane, all traffic, and I can dynamically change the rules.
And so, what's awesome about this is if you know, if my AI agent tries to do a thing, it tries to hit some endpoint on the network that I didn't make a rule for, I can actually see.
That in my dashboard and tell it, yes, that's allowed. And then it's just as if the network took a few seconds, right, to respond from the point of view of the model.
So it doesn't actually interrupt the flow in any meaningful way, which is such an improvement over what I had before, which was where every single time I had to make a change to the rules that I made for my agent, I had to restart the whole process, throw myself out of whatever context I was in. And then also, This sandbox environment is implemented at the software level using namespaces or a subset of namespaces, which allows me to use the tooling that I already have on my machine, right?
And restrict the process in specifically the ways that I decide to, rather than our previous solution, which was using Docker.
But then, you know, in Docker, you've got this like fresh environment every time. And so you need like your entire tool chain needs to be re-implemented for every Docker container.
And blah blah blah. It's really not ergonomic, right? And so now this is like a big breakthrough. This is the thing that we finally got working this past week.
And I'm just, I'm going hog wild. Like I can't, I can't stop doing Agentic stuff because it just feels so good. Like I feel like I feel so powerful.
I had a, you know, you talk about the vibe coding hangover. You know, like I feel like I'm going through a similar. A similar intoxication phase again because I just feel so powerful.
It reminds me of like back last summer after I forget which model it was, one of the Cloud models took like a big leap forward and suddenly I just felt so powerful.
I was like, oh, it's got like I got like a team of juniors that are that are like the fastest coders in history and I'm and I'm implementing you know the kinds of things that used to take me a month and like an hour and it's unbelievable.
And so you know, I got this rush and I got so amped up and I was like coding all day. I'm having a similar experience now with this, like, you know, with the sandbox environment that we set up, because I feel so powerful. I can control the model exactly the way I want. It's anyway.
Yeah, it was probably Sonnet 3.5. And yeah, we should actually do a special episode or something on this because there's so much.
I know we've each kind of come at that same problem from different angles. And like this big breakthrough that you've made here recently, I mean, I think it's.
It's substantive enough that we should kind of pull it apart and show it. Yeah. Yeah. Explain to everyone why it is that you're just this pumped about it, you know?
It's really cool. But yeah, I mean, I would say, just in general, I like the just high level of the motivation there is that it's very difficult to know what a process on your machine is doing, right?
A the interactions between a process and your operating system are not designed to be transparent, right? And so that normally is not a big deal, but like I kind of changes the game a little bit because it's kind of like you have an actor inside of your computer now, right?
A given process is like an independent actor that has its own, you know, I'm anthropomorphizing a bit, but it's as if you have like a external actor.
In your computer acting on your behalf as a process. And so I don't think Linux was really designed with that security model, right?
It wasn't designed from a point of view where it's like, oh, yeah, any given process is like an external user.
It definitely wasn't designed that way, right? Like invite someone rando over to take the keyboard and have, you know, go ham, right?
Like, absolutely.
And that's effectively what running an AI agent. In a process with your user permissions, is right. It's equivalent to handing over your keyboard and mouse to some person that you don't know, but also they are so fast that they can do thousands of things per second and you would never possibly keep up with them, right?
And they are being controlled by somebody that you've never met, that like, and the and there's no way to hold them ever accountable for anything they do.
So it's actually like quite a bit worse than. Giving a rando access to your message board. So it's a bit crazy. And this is a thing that like is becoming more and more just a thing that everybody sort of reluctantly does, or in some cases, you know, with giddy excitement does.
But it's not something that I can accept, right? It it just seems like a it's a fundamentally bad idea that is leading to just a lot of security vulnerabilities in the world right now.
The s the surface area for bad outcomes is massive. Yeah, yeah, yeah, exactly. I it seems like a much seems like a much more pressing AI alignment problem or whatever, right?
Like there's all this doomer stuff that comes out of San Francisco, but it's very oriented around telling a certain kind of story where, you know, your only hope is to invest a lot of money in Anthropic, right?
But realistically, the present threats, you know, this cybersecurity. Problem that is universal now because everybody's just running agents with their user permissions on their on that.
That's a that's a way bigger, much more urgent problem right now than a model misalignment or whatever. And like maybe great model alignment can mitigate some of the damage caused by this promiscuous behavior by users, but uh but it's certainly not going to solve it.
Well, but even even if look, even if you could completely trust There would be no scenario where it would not be a trust but verify thing, right?
Like, and so, so yeah, like it, it is trust is, trust is, is essential, but not sufficient.
And, and, and I actually think that trust and verification are like causally interconnected, right? Like, I think you cannot, the act of verifying a process is what makes it trustable, right?
Um, Because in practice, you don't really totally understand a thing that you're not observing. And so you are inherently going to end up in situations where the behavior of the system is unexpected because you're not verifying what it's doing, right?
And unexpected is risk. Not trusted. Exactly. And so you have to have, you have to have observability, right? You have to have clarity into what an AI system is doing in order to be able to trust it.
Um and it
which is which is like I mean it's that's so spot on and it is paradoxical because like interpretability of this like size neural network is impossible, right?
And so like the the technology at its core is by that definition untrustworthy, right?
Yeah, and I don't I don't want to make it sound like the Interpretability work that these big labs are doing. And Anthropic is really leading in this.
And I think it's awesome that work that they're doing, but it's not sufficient, right? You do certainly want to do the work to try to understand the model and to try to design models that are better aligned with the expectations and interests of the users.
That's absolutely a thing that you that is like required, right? For a, for a, for the, for the happy path future. And also, you need to have systems that are observable and.
Put control in the hands of the user rather than rather than these the standard right now. With, I mean, I don't want to use the, I want to drop a hard O and mention OpenClaw on this on this podcast.
That's brilliant. But like, you know, the sort of the cultural zeitgeist right now is in a is in a bad place when it comes to cybersecurity.
And in a, in a bad place that's like, Borderline flirting with disaster, right? It's like it's really asking for big problems.
Well, yeah, I mean, it's not, it is not a culture. It may be the cultural zeitgeist, but the catastrophe will not be a cultural one.
It'll be a technical one with very, very severe real world outcomes, not just technical, like economic, you know? It is, yeah, I'm absolutely with you.
Yeah, so you have to have. You have to have a sandbox, right, that's ergonomic for users to be able to use AI agents in a way that they control.
And I've seen a bunch of different attempts at solving this problem, and none of them have approached the problem in a way that satisfies me.
I've seen some nice projects, and we've actually definitely taken some inspiration from. And borrowed ideas directly from some of them.
Like, you know, I want to shout out: like, there's this project, Use Tusk, out there that built this thing called Fence, which I think in turn is inspired by the work that the Claude Code team did actually last year.
They built this kind of software-defined sandbox around Claude Code, and Fence took that idea a little bit further. And that's a good idea. I do think inherently the company providing the model. And the company providing the security infrastructure around that model should not be the same.
I think that that better aligns incentives, right? Ultimately, you want market pressure to act in favor of the user. And so if you've got the company who's providing the model and getting you to pay for the model is also the one who's building the scaffolding around protecting what the model does, of course they're going to have incentives to get the model get you, the user, to do the things that are in the best interest of the company.
Who's providing the model?
And that's not a cynical point of view. I mean, it's the reason for the external audit, right? Yeah, exactly. You want independent verification, independent sandboxing here.
Yeah, I mean, you could it doesn't take any kind of a malicious intent, right? Like, I am going to be much more personally motivated to solve a certain problem.
If it affects me in a direct way, right, then I am gonna be if not solving the problem benefits me, right? So ultimately, yeah, a solution like this has to exist.
I've seen Geovisor, Google develop this thing that I think is really cool. There are several others. I think I've seen some attempts at many, many, many attempts that look something like, hey, We've provided a sandbox, a Docker-like sandbox environment as a service with an orchestrator on top, which is what we built as well at first.
Ultimately, I think I already talked about why that falls short. But I think ultimately it was a shift in perspective that led to this, which is that a lot of the sandbox tools are designed from the perspective of how do I create a system that's safe for the agent, as opposed to how do I build the system.
That transfers agency to the user. And so that second way of thinking about the problem is what led us to, okay, well, ultimately, like we need to find a way to give to insert a network proxy in between the model and everything it's interacting with and man in the middle it and then give you the user control at that choke point.
Well, but it's like, I think you guys have. You guys have addressed this across the three core dimensions of the problem, right?
So it's clearly that there's network access, but it's also file system access and operating system access, right? And like, it's the way that you've, and I love that reframing, by the way.
Like, and I think that's part of why the ergonomics of what you've done here are just so much nicer. Like, it gets, it gets, you've gotten out of the way.
And it comes from that paradigm shift of rather than. Fence in, box in the agent. It's no, no, no. Like inject the agent in a way that is like wholly, wholly transparent and contained in my way that I'm like inside of my website,
right? Yeah. Yeah, exactly. Exactly. So, so yeah, we're, this is why I'm giddy. Yeah, I get it. Just I feel like I can finally use an AI agent run loop and not, you know.
jump through all these hoops. And we're adding all kinds of features around that, right? Now that we have the core infrastructure in place, there are all kinds of cool things that we can do.
Like, hey, let's just put token IDs in place of secrets inside of the container. And then the language model never actually touches any of the secrets. It just sends the ID out and then the network interrupt interrupts the outbound request and replaces the ID with the token value and then vice versa on the way back it puts the ID back in and so then the agent can use credentials without ever knowing what they are and so then that that just eliminates a whole class of data exfiltration problems right um right off the bat yeah
that's super nice you're uh you you take what you take what can be misunderstood as some something somewhat ominous the the man in the middle proxy right and you've actually got that man doing real work for you now Right?
Exactly. It doesn't feel good.
And so, you know, that's that's an obvious one, right? Where you're you have like a specific ID and then it's it's really easy to scan for specific IDs and replace them.
But then the next stage after that that I think is really interesting is, you know, I could design a model. I could design my own model that I'm hosting that is trained to understand what my IP is, right?
And what constitutes. Information that I'm willing to share with my dollar provider and what constitutes information that I'm not willing to share, and then route a request accordingly, right?
And say, Hey, you know, this is this is this is commercial IP. This should never be touching an external model, or hey, this person is just doing research on the on the internet, or this model is just trying to like collect data from an endpoint.
Sure, go ahead and hit whatever, you know, use use Claude for that. Use Use ChatGPT, whatever. Or even in the future, I could imagine a situation where it can replace IP with a broad description of what's in the IP and use the more powerful model in cases where the more powerful model doesn't necessarily need the actual contents of a secure document, let's say, but it needs to know for context what's in the document broadly.
And you could swap in, you could have a Model that like sits at that choke point and replaces sensitive information with whatever placeholders that are that work as functional replacements and all kinds of other things that you can do.
And so I'm really excited about what kinds of workflows this technology unlocks.
Yeah, it's like this foundational bit of kit upon which you can build and extend to kind of end up with a rich, deliberate Sovereign, if you will, AI ecosystem.
Yeah, exactly. Yeah, I gotta do. That's uh that's very cool.
It reminds this whole thought process kind of like reminds me a little bit of your sculpting metaphor, where it's like to a certain extent with with AI, you're actually like chiseling away at the possibility space rather than trying to build on it, right?
Um That the model can do everything and the problem is that it can do everything. And if your if your possibility space is too large, it's not gonna be solving the problem that you actually want.
And so understanding how best to restrict the space of possibility is what engineering is in software three point zero, as you like to say, right?
It is more an exercise in understanding how to restrain rather than how to augment a lot of the time.
Yeah, no, I think that's I think and and you know, the the two are the two are so directly related, right? Like addition through subtraction, whether whether it's whether it's reducing the possibility space or otherwise like specifically engineering the context, right?
Like those are kind of the two sides of the same coin.
There's this information theory concept that I use a lot when onboarding engineers onto my team. And the idea is basically that negative feedback, and this is, you'll see why this is related in a minute.
Negative feedback is less helpful than positive feedback. And we all sort of emotionally feel this way, right? Like, I would rather you tell me.
What I've done is good, then you tell me what I've done that's bad. And I have this little like evolutionary, you know, psychology pet theory about this, which is that inherently the solution space in the real world is infinitely sized.
And so negative feedback is reducing an infinitely large space. And so it's actually not giving you much information, right?
Whereas positive information is reducing an infinite space to a finite space. You're saying, hey, this. you know, what the the direction of what you're doing is a good one.
And so you don't need to look in every direction anymore. Look in the direction that you're already pointing. This is good.
And that's high like very, very rich in information. And so I think that we're all kinda like baked into it baked into our brains is this like aversion to negative feedback because the negative feedback does it doesn't actually help us understand what the correct solution is.
I think similarly with AI engineering, the solution space is infinite. And if you tell the model, do the thing, it can do so many different things, and most of them are wrong.
And so when you tell it, like, hey, do this specific thing, and you give it like a very narrow spec, you've narrowed the space of exploration down to this tiny, tiny window, right?
And you've eliminated so much of the possibility space. And so you've made the problem so much easier. And so. When you're dealing with real-world solution space, which is so huge, you're eliminating so much when you give good specs, when you give good decisions, when you design a solution really well.
You're eliminating, right? Yeah, no, in a very
real way. I catch your drift now. And I'm glad the way that you described it too, because it's a complete specification, is in fact two things.
It is. To your point, right? The sort of very, very clear, positive information of do this in this specific way. And the feedback loop of here's how you'll know if you've done that wrong, i.e., the like defining the tests, right? And it's with both that you end up with something now that. As a specification, that can almost kind of get to replacing source code because you've eliminated all that infinite negative space and you've got something now that can provide the feedback loop to test have I done the thing right?
And now you've enabled autonomy in the model or in the agent harness to be able to do real work in ways that are very consistent with the expectations that you have as the engineer.
Yeah. Yeah. A direction and a test, a spec and an eval, that that that's all you need, really. That's you have if
you have no that's that's exactly right. That's a a a nice distillation of um the principles and process that I talked about in that vibe coding
hangover concept. Yeah, I love that talk. No, right on well, I'm working on Refining that down to something far more digestible in this sort of software 3.0
engineer perspective. But that thought of the specification being two parts, the positive declaration of like exactly what it is that you want, and the way to define how to test if that's been correctly implemented is a Key concept in that forthcoming refinement.
More to come.
Yeah, I've been, I feel like the idea that I don't think this is broadly true right now, and I don't know if it ever will be true, but I increasingly have conversations about the idea that source code could be thrown away if you can keep the sort of specification and the evaluation criteria.
And maybe that's. Maybe the evaluation criteria is source code and what I'm really saying is that like certain parts of source code are less important than they used to be.
Or maybe what I'm trying to say is that increasingly I want to keep both in practice right now and that maybe in the future when language models are just so good at implementation, you will, you know, the source code will be decreasingly important as sort of artifacts in the engineering process.
But I really do think that there's something to it. Like, I'm increasingly asking my engineers for the chain of prompts that led to a spec, right?
I'm increasingly asking them for the thought process that went behind the solution that they provided rather than the solution itself.
And I think that that's always sort of been important, right? I used to tell my engineers, I used to tell my senior engineers, I still tell my senior engineers when they join the company that it's forbidden to give more junior engineers solutions when they're working on a problem because what you should do is ask them questions or help lead their thought process to the solution.
Because if you give them the solution itself, it's not sufficient. They need to understand why, right? They need to be able to work their way to the solution.
The path to the solution is more important than the solution itself. And so that's sort of always been the case, but I'm finding more and more that.
Our internal processes when interacting with language models are like making this explicit in a way. I'm increasingly finding value in the artifacts, the language artifacts that led to the destination rather than the kind of final artifact itself.
Well, and you would have gotten there, like in software 1.0, you would have gotten there. Through the process of planning with the team, right?
Like you together would have spent the time, but like it's so expensive, and it takes so much time to get to that shared headspace where now we're moving so quickly and working with these models and agent harnesses to write software so fast.
You're almost, you want the same artifacts, but you're getting them in reverse order, right? Like you're kind of, now it's here is a solution, and also here is the.
Process of getting to the shared headspace. And instead of it being the whole team, it's now just this individual engineer and the model, right?
But like you get to this shared headspace. And so those artifacts are still as important as ever. And just the order in which we're getting to them indicates that you're sharing here of like, hey, don't just send me the app, the working software, right?
Like I want specs and I want the Chain of prompts, as you were saying. You know, that order is changed. And I think that's fine. I think we're here though.
Like, we're already here, right? Like, we see all the time just how almost magically effective the model can be when taking very complex software that is itself really well covered with executable tests and porting it to any target.
Language or environment or whatever, right? I think of a couple of things that Simon Willison has done super recently, right?
Like he kind of loves to take hot off the press implementations and then like port them and port them in a in like a publicized way.
So like, you know, he'll he's he's written two or three of these things now where like, um, You know, he took, there was like an HTML parser that was written in Python and it was released, I don't know, towards the end of December, just HTML, I think is what it was called.
And like in four and a half hours of clawed code time, ported it perfectly to JavaScript because of the fact that it had the test suite, was good enough and existed already.
That, like, I don't know. I think we're here. I think we're already here in that, like, With with the spec, the spec being in those two parts, the description of the thing to do and how to evaluate whether you've done it correctly, you can kind of throw away the code because if especially if the tests themselves do accurately evaluate the functional implementation of what it is that you're after and those paths, like the specific details of the implementation at the source code level kinda don't matter, right?
Like, if it's functionally accurate, isn't that good enough?
Yeah, although I will say in practice, the reality is always like way messier. You know what I mean? Like, I'm simultaneously blown away by models and disappointed by them every day.
Like, I don't know what it is about the jaggedness of model capabilities. Well, it's so uncanny how, you know, it will. Write to spec, but then there will be something so obviously implied in the spec that it will miss and flub somehow.
That in practice, we're not definitely not at the point where, like, language model writes code and then I'm confident that the code does what I actually want it to in practice, unless it's like a very simple thing, right?
I don't know if that matches with your experience, but the
I mean, look, the stochasticity of it, right? Like the non-deterministic nature of the models. Means is that, yeah, on any given run, certainly it may trip up on this or that.
I guess two things. There's that, there's that, but there's also then like to the point where you're expecting the model to infer your specifications in plea, right?
Like, part of the challenge for us now is like, and it's a very different way of getting to that shared headspace because.
Because the model has only one single channel for communication with you, and that is written text. Whereas when you sit with the team, you've got all this nonverbal stuff, and you've got the shaping influence of different perspectives that come, and like you all get to over a period of time.
Plus, you have the shared history, right, of looking in that place. So, like, you can be a lot more loose in specification when you work with a team of engineers, especially a team that you've worked with for some period of time or you've delivered things before, you know what I mean? Like, I, yeah, yeah,
yes, and also the models are strangely imbalanced, where they are just so superhuman in certain ways and like just shockingly deficient in surprising ways as well.
So, if you're talking about the
frustration of like, Having a genius and a toddler work on the same thing. And like, yes, I'm absolutely with you there.
There is something very difficult to describe about the way that these models that are so superhuman are so woefully deficient in shocking ways.
And it's becoming, as they get better and better, it's becoming harder and harder to describe what it is that they're missing.
But they're Karpathy. I think it was in Karpathy's blog at one point. He showed these little ability surface diagrams and it showed it was like a spiky surface.
And it was like, oh, here's a competent human's ability. And here's competent AI's ability. And they overlap in a lot of ways, but they're completely not overlapping in other ways.
And the edges of the capability space are surprisingly jagged. Maybe if you were used to what models can do. You'd be shocked at the ways in which humans are like so woefully deficient at all these things that the models are so good at.
But since we're all used to what humans can do, we're constantly surprised, or at least I don't know. I certainly am constantly surprised at how superhuman they are in things that, you know, anything that's verifiable, right?
Any kind of any kind of thing for which I can write a cost function. My models are just so superhuman. And everything that I can't quite.
Put my finger on. They're like, they're just so shockingly dumb. So it's, I don't know. It's, it's, I think that there's a sort of way in which the, the, the capability space is uneven.
And I don't know if this means, I don't know. I, I don't want to get too, I don't want to go down too far, another philosophical rabbit hole, but I do feel like there is this, there's this concept of like general intelligence, and people use the term AGI all the time.
Oh my goodness, yeah.
It's like math
they overused. But you know, there's this implicit idea in the i in the concept of general intelligence that there is such a thing as general intelligence, that there's like a generic thing that is intelligence that you can have more or less of, which I think language models seem like pretty pretty much put the nail in the coffin for that theory to me.
I almost feel like the idea of intelligence is as precise as the concept of athleticism, right? Like, what's more athletic, a cheetah or a gorilla?
Probably not the right question to ask, right? Like, and general intelligence itself may not be the right thing to build towards, and certainly not in the short term, right?
I mean, like, to wit, in the day-to-day work. I'm finding more and more that in the day-to-day work that I'm doing, adhering to well-defined engineering principles for working successfully with coding agents, there is not really a limitation to what I can build.
Like and and by what I like, what I can build, I mean what I can build from scratch, what I can build on top of things that exist already.
And so if there is no practical, and let's be a little hand wavy here too, right? And like, so back to the back to the issue you were describing earlier, like you can sample the model, right? So like instead of instead of you running a single agent hierarchy to achieve the outcome because you know that he's going to stub his toe or misinterpret or whatever, right?
Like you run three agent hierarchies for really like for things that are mission critical that you want to make sure you get exactly right, you kind of expand outward and And through what is inherently the process of sampling the model, overcome the non-deterministic rough edges of this thing, right?
So that, like, in general, on average, you do get a solution that works. So, like, let's be a little hand weighty and let's say that in the coming X years, right, months, 12, 24, whatever, the models themselves kind of improve, whether they internalize that sampling art, which we're seeing, by the way, we see all the time, whether it's the agent harnesses from the Frontier Labs or or the Kimmy agent swarm, right?
Like the the model providers are themselves leaning more and more on this kind of sampling approach to smoothing out the non-deterministic rough edges here.
We can be a little hand-wavy and kind of say that like 12 to 24 months, those annoyances get buffed away. Like the question kind of becomes, and this is, I think, part of what you're getting at.
Like, does AGI Really matter? Or like, isn't there kind of a point of saturation where, like, for the thing that I'm trying to use it for, it is, for all intents and purposes, at that point of like, it's so much better than me, and it's so much better than like the general human or even aggregate humans, where like talking about general intelligence is not useful and rather like focusing now on doing real work with the thing in its current state as opposed to like trying to evolve.
The thing beyond its current state into some generalized form of superintelligence. I don't know. I think that's, it feels pretty bloody close, at least for software engineering, coupled with proper engineering discipline for it.
Like, it feels really close, man.
You know, I'm increasingly consulting with businesses that are trying to modernize their engineering practices. And so I have a lot of these kinds of conversations.
And I do feel in, My real world experience that there is this significant gap between the ability that it's almost like the language models are so good at prototyping so fast, and the degree to which models are superhuman is way like it's heavily weighted towards the prototyping phase of software.
And so you can get things that are. Sufficiently solving a problem for prototype purposes so quickly, and then you can get kinda I mean, this is to a certain extent, you talked about this in the VOP coding hangover video video that you did.
But you end up in this situation where you have you have so much kind of working code, but maintaining it, let's say you get like a like a 30x speed up on prototyping, and you get like a like an 80-80% speed up on maintenance-oriented.
You know, software activity. This is just a number that I'm coming up with off the top of my head based on my personal experience.
I think what that leads to is this situation where people who you are spending more of your time, more and more of your time on that prototyping part of the problem.
And so language models feel so incredible. And then the people who are actually maintaining software day to day have a kind of mismatched experience where they're like, yeah, language models are.
Are really helping me, but they're not earth-shattering in the same kind of way. And I think that that, I don't know, I think there's something about that tension that leads to these like really mismatched kind of opinions about the capabilities of language models from person to person that I talk to.
But you know, all that's to say, like, I'm not trying to say the language models are not incredible, right? I do think that there's.
There's contexts in which they're much less incredible and contexts in which they're much more incredible. And I do think that like the holy moly, look how good this language model is, is much more of like a prevalence message that you hear in a day-to-day discourse because people are more likely to talk about something that they think they're really excited about than people who are like, yeah, it's pretty, it's pretty good, right?
So I don't know. I don't know if I'm if I'm really making that coherent of a point here.
No, I mean, I was reacting to what I thought you were saying, which was general intelligence is not a useful concept and may not even be a real concept.
And therefore, specific intelligence is the opposite. Specific intelligence is quite useful. And then, point, I thought I heard, I thought that.
The progress of thinking was, and look at how, in certain dimensions of specific intelligence, the models are demonstrably superhuman.
And like, you can bring them up, but like, I mean, look, the Math Olympiad questions, like, I mean, there's like a whole litany of benchmarks.
And set aside for a moment whether or not, you know, set aside the question of benchmark saturation in the training of the models.
I think it is sufficiently well documented that the models are superhuman in things that we as humans hold as measures of specific intelligence, Math Olympic Add and like all of these, you know, there's literally benchmarks, right?
And so those benchmarks are useful up to a point. I mean, by the way, like the OpenAI guys just came out today and said, just as a tangent, they just came out today and said that SWEBench Verified, which for those listening, those who hopefully may still be listening, SuiBench Verified is the OpenAI significantly cleaned up version of the original Princeton University SWEBench.
Benchmarking data set, which was this coding benchmarking data set for coding models and coding agents. And they took some large number, it was thousands, two hundred maybe twenty-eight hundred GitHub issues from big public open source projects that had ultimately been solved with contributions.
And they bundled up a really useful real-world applied data set for then testing the efficacy of coding models and coding agents.
On solving real world coding issues. But it turned out that it's worth going and reading what they've just written about kind of doing away with SWE Bench Verified.
But it turned out that as they were training O3, O3 was getting some significant number, 150, maybe 190, I can't recall the exact number, of these.
Of these just flat out wrong. And so, like, they started wondering, well, hang on a sec, like, is it that the model's failing, or is it that maybe the GitHub issues themselves are not like set up in a way that is that a non-human would reasonably be able to solve these? And so, they went and they went and cleaned it up, cleaned it all up, right? And so, like, SWE Bench Verified has for some time now been the improved version of SWE Bench that you use for.
Benchmark. They've come out today and they've said, Well, you know, we're actually like, we think that that's there's so much contamination on this benchmark in not just our models, but the models provided by Anthropic and Google also that like we're shifting very heavily towards a private benchmark or private evals set, which, you know, shocker, right?
Like private evals, who'd have thought. All of that as a tangent off of like, let's kind of, you know, set aside benchmark or eval saturation in the models themselves.
Think that's a really interesting sidebar. So, if you don't mind, I kind of want to respond. Yeah, go on, yeah. Don't lose your train of thought, but uh, but I actually think it's really interesting that we do some internal evaluations of the different benchmarks of the different models, just based on my own background in machine learning, right?
Like, I know. From personal experience, that models are very good at overfitting to whatever benchmark you are measuring them against, right?
And in practice, you have to separate the things that you're testing the model on from the things you train the model on, or else in practice, the model will basically just memorize the test, right?
Yeah, train, test, split. Who to Thor, right? Yeah, yeah, yeah. And that is increasingly difficult for. The language model providers, because they're training on like the entirety of everything that's ever been published on the internet.
And the data sets are too big for them to be like, you know, curating them to the point where they know for a fact that everything's been taken out of the training set. And so I think it's very difficult to produce data sets that don't have any of the SWE data in them.
Right. And so even if you're trying really hard, To not contaminate your data set, I think you're going to end up with contaminated data sets.
But I also find that there are pretty clearly some labs that are more, let's say, saturated in terms of benchmarks than others.
Like from just in our own both combination of internal evaluations and vibe-based sort of evals, I would say that Anthropics models always perform worse on public benchmarks than they do on our private.
Benchmarks relative to the other models, right? Like this has consistently been the case for the entire history of Claude.
It did better on our internal evaluations relative to other models than it did on those public benchmarks. And so maybe they're just better at cleaning up their data sets.
Maybe the other labs are actively training on and trying to tune their models to those benchmarks. I don't know. But there's clearly, clearly a difference in from lab to lab that seems to.
Consistently, like Qwen, the Quinn models to me just are always better at the measurable benchmark stuff compared and worse on private eval stuff that we run them on.
Always. The Qwen models have always been worse than the public benchmarks suggest that they are. On the other hand, Qwen models seem to be really good at correctly producing JSON, for example, which is a
sorry. They've engineered specifically for that. Structured outputs are one of the things, especially in the the Qwen3 models that like I mean they've they spent a lot of time in training to produce really reliable structured output. But yeah, keep going. Okay.
So so I mean I mean maybe that's actually related to their performance on these on these public benchmarks, right? Like it's it's possible that a significant chunk of the error rate of some of the models is that they just produce Things with like syntax errors.
Well, and that's, and that was kind of why, like, that that was kind of why OpenAI spent so much time on prod, like, I mean, they spent like serious, serious time and money producing the verified data set because it was exactly things like that.
Like, they call them, they call them narrow misses or something, right? Where it's like the test. require test some very narrow definition of correctness.
And even though the model produced a functionally accurate result, like it didn't test as accurate and so was counted wrong to your point.
Yeah, yeah, exactly. Exactly. So so I don't know. Um but I do think it's interesting that the the different labs have like different characteristics in this in this way and that they seem to seem to emphasize different capabilities.
Like, you know, Claude has always been better at coding relative to other skills.
I was that was exactly what I was gonna bring up because like we who've been using Claude for coding for 18 months or more now, right?
Like since Sonnet three five effectively, but it's never it's never shown up on like the benchmarks. Like if you read artificial analysis and their benchmarking, they've got they've got and have always had, right?
The latest OpenAI model, the latest Gemini model ranked higher than the latest Anthropic model, and that has never been.
My experience, or frankly, like most like AI engineers' experience has never been consistent.
Yeah, yeah, yeah, that's exactly right. I mean, that's that's sort of what I was getting at is, and I don't think it's I don't think that's limited to Anthropic.
I think that there are some models, some labs that have like sort of always over performed on these public benchmarks relative to their performance.
I mean, uh, another one is uh Those GPT OSS models, right? The open AI open models, they do so well on all these benchmarks.
And, and they're like, they're even when they came out, they were, they were already, yeah, they were, they were not good models.
They were never good models at any point. The benchmarks are objective, right? Whereas, you know, my act anecdotal evidence is, is obviously strongly biased by whatever filters I have.
Yeah, but that's that's kind of like my point. Like, It's not to be dismissive of anecdotal. Anecdotal is the only thing that matters.
The only thing that matters is the way that you need to do your work with these models. Like that it works really well for someone else on a thing that you don't do.
Why does that at all matter? It's back to, I think, in principle, the same question of like general intelligence versus specific intelligence.
Like, if, if, if, if the best available model tomorrow was exceptionally good at like radiology. We as humanity would be better off for it, but I, as an individual, do not have any immediate benefit from that.
Like, I don't work in that space. Tomorrow, I'm not going to do anything with x-rays. I'm not going to be reading any MRI outputs, you know?
I just don't have any immediate benefit. There's no immediate benefit to me in the work that I do. Anecdotal really matters in doing real work with these models and formalizing it so that it's not anecdotal for you, but to your point, actually spending the time as you come across where the model falls down, or conversely, where the model does really well for the thing that you're doing, and spending that extra bit of time to extract something that you can preserve and use to evaluate going forward is time that is tremendously well spent, that pays compound returns in short order.
That's it for today. Thanks for listening to the end here. Corey and I will record a few more of these in the coming weeks.
Stay tuned and have a wonderful day. Until next time.