I thought this article was going to be a bunch of security theater nonsense - maybe the relatively bland title - but after reading I found it to be incredibly insightful, particularly this:
> MCP discards this lesson, opting for schemaless JSON with optional, non-enforced hints. Type validation happens at runtime, if at all. When an AI tool expects an ISO-8601 timestamp but receives a Unix epoch, the model might hallucinate dates rather than failing cleanly. In financial services, this means a trading AI could misinterpret numerical types and execute trades with the wrong decimal precision. In healthcare, patient data types get coerced incorrectly, potentially leading to wrong medication dosing recommendations. Manufacturing systems lose sensor reading precision during JSON serialization, leading to quality control failures.
Having worked with LLMs every day for the past few years, it is easy to see every single one of these things happening.
I can practically see it playing out now: there is some huge incident of some kind, in some system or service with an MCP component somewhere, with some elaborate post-mortem revealing that some MCP server somewhere screwed up and output something invalid, the LLM took that output and hallucinated god knows what, its subsequent actions threw things off downstream, etc.
It would essentially be a new class of software bug caused by integration with LLMs, and it is almost sure to happen when you combine it with other sources of bug: human error, the total lack of error checking or exception handling that LLMs are prone to (they just hallucinate), a bunch of gung-ho startups "vibe coding" new services on top of the above, etc.
I foresee this being followed by a slew of Twitter folks going on endlessly about AGI hacking the nuclear launch codes, which will probably be equally entertaining.
Before 2023 I always thought that all the bugs and glitches of technology in Star Trek were totally made up and would never happen this way.
Post-LLM I am absolutely certain that they will happen exactly that way.
I am not sure what LLM integrations have to do with engineering anymore, or why it makes sense to essentially put all your company's infrastructure into external control. And that is not even scratching the surface with the lack of reproducibility at every single step of the way.
When I look at how AI and LM systems, sw, platforms, have been built over the last decade (and are still being built), I can't help but to think that what really every mattered was the response the system produced.
Never mind the quality or if it's even going to work in production.
And maybe that's all that's needed, I don't really know.
I'm sure that's just me being the old curmudgeon of a software engineer I am, wishing people thought about more than one user using a system and 2 engineers supporting it.
Consider this - everything will "somehow work" if the system has been there for generations and is complex enough that no single human brain can keep everything about it in the brain at any given time.
It is easy to keep a system high quality, well maintained, well understood for a year with a small team, but imagine doing that for 100+ years with a system constantly evolving in complexity with generations of maintainers, people being rotated.
The computer, at least aboard the enterprise, is kind of portrayed as a singular monolithic AI that can access the majority of the ship's subsystems (different networks, other computer/control units, etc) and functions. It can control nearly every aspect of the ship while talking with its human crew / commanding officers.
So very much like an LLM accessing multiple pieces of functionality across different tools and API endpoints (if you want to imagine it that way).
While it is seemingly very knowledgeable, it is rather stupid. It gets duped by nefarious actors or has a class of bugs that are elementary that put the crew into awkward positions.
Most professional software engineers might have previously looked as these scenarios as implausible, given the "failure model" of current software is quite blunt, and especially given how far into the future the series took place.
Now we see that computational tasks are becoming less predictable, less straight-forward, with cascading failures instead of blunt, direct failures. Interacting with an LLM might be compared to talking with a person in psychosis when it starts to hallucinate.
Excellent comment, couldn't have described it better.
I wanted to add that in Star Trek they always talk with techno babble things like "Computer, create a matrix from a historic person who was knowledgeable in a specialized surgery field" and then the Hologram room creates that avatar's approximation, with the programming and simulated/hallucinated expertise.
The holodeck is a special kind of weird because sooo many accidents happen because of sloppy coding that the AI of the ship's computer created as flawed programs that later then hurt the crew members because of failing or ignored/bypassed safety protocols, which we see now as the rising field of prompt engineering in redteams.
Additionally, in Star Trek instead of coding on tablets, they usually just show analytics data or debug views of what the ship's computer created. The crew never actually code on a computer, and if they do they primarily just "vibe code" it by saying absurd things like "Computer, analyze the enemy ship's frequency and create a phasing shield emitter to block their phasers" (or something like that) and the computer generates those programs on the fly.
The cool part that I liked the most is when Voyager's neural packs (think of them as the AI-to-system control adapters) actually got sick with a biological virus because they were essentially made out of brain matter.
These are such great points. I'm truly mind boggled how they got those ideas so right while people previously wouldn't have believed this direction to be correct at all. Because people would think if we reach AGI there's nothing we have to worry about because AI will be able to handle it, but whether we reach AGI or what kind of steps are there in between, for a period of time the behaviour displayed in Star Trek will be very plausible. Asking AI to create elaborate debug views is something that I definitely spend a lot of time doing when vibe coding. And trying to orchestrate seemingly ridiculous scenarios to either keep AI in its tracks, or brainstorm future directions, etc. AI generates close to 100% of my code, but I have to also ask it to create guardrails for itself, special linting rules that I would never use myself so it avoids the common errors it does. It can generate the code, but it out of the box won't keep itself in its tracks, which leads to very interesting scenarios and potential absurd stories.
> The cool part that I liked the most is when Voyager's neural packs (think of them as the AI-to-system control adapters) actually got sick with a biological virus because they were essentially made out of brain matter.
I liked that part too. I hadn't paid attention much before but that was a fun revelation that the computer is run by a bunch of brain tissue pouches. The LLM "guts" to speak is pretty much a collection of brain tissue clumps semantically, with weights and connections as opposed, to some database of logical assertions, like an expert systems people envisioned in the 1980s.
I don't understand this criticism by the author. MCP supports JSON Schema, and server responses must conform to the schema. If the schema requires an ISO-8601 timestamp (ex by specifying a "date" format in the schema) but the server sends a Unix epoch timestamp, then it is violating the protocol.
The author even later says that MCP supports JSON Schema, but also claims "you can't generate type-safe clients". Which is plainly untrue, there exist plenty of JSON Schema code generators.
> Cast an integer into the type of a 2023 Toyota Yaris using Javascript
(GPT-4o mini)
> To cast an integer into the type of a 2023 Toyota Yaris in JavaScript, you would typically create a class or a constructor function that represents the Toyota Yaris. Then, you can create an instance of that class using the integer value. Here's an example of how you might do this:
// Define a class for the 2023 Toyota Yaris
class ToyotaYaris {
constructor(modelYear, value) {
this.modelYear = modelYear;
this.value = value;
}
displayInfo() {
console.log(`Model Year: ${this.modelYear}, Value: ${this.value}`);
}
}
// Function to cast an integer into a Toyota Yaris object
function castToYaris(integerValue) {
const modelYear = 2023; // Set the model year
return new ToyotaYaris(modelYear, integerValue);
}
// Example usage
const integerValue = 20000; // Example integer value
const yaris = castToYaris(integerValue);
yaris.displayInfo(); // Output: Model Year: 2023, Value: $20000
It works in this instance. On this run. It is not guaranteed to work next time. There is a error percentage here that makes it _INEVITABLE_ that eventually, with enough executions, the validation will pass when it should fail.
It will choose not to pass this to the validator, at some point in the future. It will create its own validator, at some point in the future. It will simply pretend like it did any of the above, at some point in the future.
This might be fine for your B2B use case. It is not fine for underlying infrastructure for a financial firm or communications.
Every time the LLM uses this tool, the response schema is validated--deterministically. The LLM will never see a non-integer value as output from the tool.
I write these as part of my job, I know how they work. I'm not going to spend more time explaining to you (and demonstrating!) what is in the spec. Read the spec and let the authors know that they don't understand what they wrote. I've run out of energy in this conversation.
llm tool call -> mcp client validates the schema -> mcp client calls the tool -> mcp server validates the schema -> mcp server responds with the result -> mcp client passes the tool result into llm
Can you guarantee it will validate it every time ? Can you guarantee the way MCPs/tool calling are implemented (which is already an incredible joke that only python brained developers would inflict upon the world) will always go through the validation layer, are you even sure of what part of Claude handles this validation ? Sure, it didn't cast an int into a Toyota Yaris. Will it cast "70Y074" into one ? Maybe a 2022 one. What if there are embedded parsing rules into a string, will it respect it every time ? What if you use it outside of Claude Code, but just ask nicely through the API, can you guarantee this validation still works ? Or that they won't break it next week ?
The whole point of it is, whichever LLM you're using is already too dumb to not trip when lacing its own shoes. Why you'd trust it to reliably and properly parse input badly described by a terrible format is beyond me.
> Can you guarantee it will validate it every time ?
Yes, to the extent you can guarantee the behavior of third party software, you can (which you can't really guarantee no matter what spec the software supposedly implements, so the gaps aren't an MCP issue), because “the app enforces schema compliance before handing the results to the LLM” is deterministic behavior in the traditional app that provides the toolchain that provides the interface between tools (and the user) and the LLM, not non-deterministic behavior driven by the LLM. Hence, “before handing the results to the LLM”.
> The whole point of it is, whichever LLM you're using is already too dumb to not trip when lacing its own shoes. Why you'd trust it to reliably and properly parse input badly described by a terrible format is beyond me.
The toolchain is parsing, validating, and mapping the data into the format preferred by the chosen models promot template, the LLM has nothing to do with doing that, because that by definition has to happen before it can see the data.
>The toolchain is parsing, validating, and mapping the data into the format preferred by the chosen models promot template, the LLM has nothing to do with doing that
The LLM has everything to do with that. The LLM is literally choosing to do that. I don't know why this point keeps getting missed or side-stepped.
It WILL, at some point in the future and given enough executions, as a matter of statistical certainty, simply not do that above, or pretend to do the above, or do something totally different at some point in the future.
> The LLM has everything to do with that. The LLM is literally choosing to do that.
No, the LLM doesn't control on a case-by-caae basis what the toolchain does between the LLM putting a tool call request in an output message and the toolchain calling the LLM afterwards.
If the toolchain is programmed to always validate tool responses against the JSON schema provided by MCP server before mapping into the LLM prompt template and calling the LLM again to handle the response, that is going to happen 100% of the time. The LLM doesn't choose it. It CAN'T because the only way it even knows that the data has come back from the tool call is that the toolchain has already done whatever it is programmed to do, ending with mapping the response into a prompt and calling the LLM again.
Even before MCPs or even models specifically trained and with vendor-provided templates for tool calling (but after the ReAct architecture was described), it was like a weekend project to implement a basic framework supporting tooling calling around a local or remote LLM. I don't think you need to do that to understand how silly the claim that the LLM controls what the toolchain does with each response and might make it not validate it is, but certainly doing it will give you a visceral understanding of how silly it is.
I think you are, for whatever reason, missing a fact of causality here and I'm not sure I can fix that over text. I mean that in the most respectful way possible.
Are you two talking at cross-purposes because you don't have a shared understanding of control and data flow?
The pieces here are:
* Claude Code, a Node (Javascript) application that talks to MCP server(s) and the Claude API
* The MCP server, which exposes some tools through stdin or HTTP
* The Claude API, which is more structured than "text in, text out".
* The Claude LLM behind the API, which generates a response to a given prompt
Claude Code is a Node application. CC is configured in JSON with a list of MCP servers. When CC starts up, CC"s Javascript initialises each server and as part of that gets a list of callable functions.
When CC calls the LLM API with a user's request, it's not just "here is the user's words, do it". There are multiple slots in the request object, one of which is a "tools" block, a list of the tools that can be called. Inside the API, I imagine this is packaged into a prefix context string like "you have access to the following tools: tool(args) ...". The LLM API probably has a bunch of prompts it runs through (figure out what type of request the user has made, maybe using different prompts to make different types of plan, etc.) and somewhere along the way the LLM might respond with a request to call a tool.
The LLM API call then returns the tool call request to CC, in a structured "tool_use" block separate from the freetext "hey good news, you asked a question and got this response". The structured block means "the LLM wants to call this tool."
CC's JS then calls the server with the tool request and gets the response. It validates the response (e.g., JSON schemas) and then calls the LLM API again bundling up the success/failure of the tool call into a structured "tool_result" block. If it validated and was successful, the LLM gets to see the MCP server's response. If it failed to validate, the LLM gets to see that it failed and what the error message was (so the LLM can try again in a different way).
The idea is that if a tool call is supposed to return a CarMakeModel string ("Toyota Tercel") and instead returns an int (42), JSON Schemas can catch this. The client validates the server's response against the schema, and calls the LLM API with
So the LLM isn't choosing to call the validator, it's the deterministic Javascript that is Claude Code that chooses to call the validator.
There are plenty of ways for this to go wrong: the client (Claude Code) has to validate; int vs string isn't the same as "is a valid timestamp/CarMakeModel/etc"; if you helpfully put the thing that failed into the error message ("Expect string, got integer (42)") then the LLM gets 42 and might choose to interpret that as a CarMakeModel if it's having a particularly bad day; the LLM might say "well, that didn't work, but let's assume the answer was Toyota Tercel, a common car make and model", ... We're reaching here, yet these are possible.
But the basic flow has validation done in deterministic code and hiding the MCP server's invalid responses from the LLM. The LLM can't choose not to validate. You seemed to be saying that the LLM could choose not to validate, and your interlocutor was saying that was not the case.
>Are you two talking at cross-purposes because you don't have a shared understanding of control and data flow?
No they're literally just skipping an entire step into how LLM's actually "use" MCP.
MCP is just a standard, largely for humans. LLM's do not give a singular fuck about it. Some might be fine tuned for it to decrease erroneous output, but at the end of the day it's just system prompts.
And respectfully, your example misunderstands what is going on:
>* The Claude API, which is more structured than "text in, text out".
>* The Claude LLM behind the API, which generates a response to a given prompt
No. That's not what "this" is. LLM's use MCP to discover tools they can call, aka function/tool calling. MCP is just an agreed upon format, it doesn't do anything magical; it's just a way of aligning the structure across companies, teams, and people.
There is not an "LLM behind the API", while a specific tool might implement its overall feature set using LLM's, that's totally irrelevant to what's being discussed and the principle point of contention.
Which is this: an LLM interacting with other tools via MCP still needs system prompts or fine tuning to do so. Both of those things are not predictable or deterministic. They will fail at some point in the future. That is indisputable. It is a matter of statistical certainty.
It's not up for debate. And an agreed upon standard between humans that ultimately just acts as convention is not going to change that.
It is GRAVELY concerning that so many people are trying to use technical jargon of which they clearly are ill-equipped to do so. The magic rules all.
> No they're literally just skipping an entire step into how LLM's actually "use" MCP.
No,you are literally misunderstanding the entire control flow of how an LLM toolchain uses both the model and any external tools (whether specified via MCP or not, but the focus of the conversation is MCP.)
> MCP is just a standard, largely for humans.
The standard is for humans implementing both tools and the toolchains that call them.
> LLM's do not give a singular fuck about it.
Correct. LLM toolchains, which if they can connect to tools via MCP, are also MCP clients care about it. LLMs don't care abojt it because the toolchain is the thing that actually calls both the LLM and the tools. And that's true whether the toolchain is a desktop frontend with a local, in process llama.cpp backend for running the LLM or if its the Claude Desktop app with a remote connection to the Anthropic API for calling the LLM or whatever.
> Some might be fine tuned for it to decrease erroneous output,
No, they aren't. Most models that are used to call tools now are specially trained for tool calling with a well-defined format for requesting tool calls from the toolchain a mnd receiving results back from it (though this isn't necessary for tool calling to work, people were using the ReAct pattern in toolchains to do it with regular chat models without any training or prespecified prompt/response format for tool calls just by having the toolchain inject tool-related instructions in the prompt, and read LLM responses to see if it was asking for tool calls), none of them that exist now are fine tuned for MCP, nor do they need to be because they literally never see it. The toolchain reads LLM responses, identifies tool call requests, takes any that map to tools defined via MCP and routes them down the channel (http or subprocess stdio) specified by the MCP, and does the reverse woth responses from the MCP server, validating responses and then mapping them into a prompt template that specifies where tool responses go and how they are formatted. It does the same thing (minus the MCP parts) for tools that aren’t specified by MCP (frontends might have their own built-tools, or have other mechanisms for custom tools that predate MCP support.) The LLM doesn't see any difference between MCP tools and other tools or a human reading the message with the tool request and manually creating a response that goes directly back.
> LLM's use MCP to discover tools they can call,
No, they don't. LLM frontends, which are traditional deterministic programs, use MCP to do that, and to find schemas for what should be sent to and expected from the tools. LLMs don’t see the MCP specs, and get information from the toolchain in prompts in formats that are model-specific and unrelated to MCP that tell them what tools they can request calls be made to and what they can expect back.
> an LLM interacting with other tools via MCP still needs system prompts or fine tuning to do so. Both of those things are not predictable or deterministic. They will fail at some point in the future. That is indisputable.
That's not, contrary to your description, a point of contention.
The point of contention is that the validation of data returned by an MCP server against the schema provided by the server is not predictable or deterministic. Confusing these two issues can only happen if you think the model does something with each response that controls whether or not the toolchain validates it, which is impossible, because the toolchain does whatever validation it is programmed to do before the model sees the data. The model has no way to know there is a response until that happens.
Now,can the model make requests that the don't fit the toolchain’s expectations due to unpredictable model behavior? Sure. Can the model do dumb things with the post-validation reaponse data after the toolchain has validated it and mapped it into the models prompt template and called the model with that prompt, for the same reason? Abso-fucking-lutely.
Can the model do anything to tell the toolchain not to validate response data for a tool call that it did decide to make on behalf of the model if the toolchain is programmed to validate the response data against the schema provided by the tool server? No, it can't. It can't even know that the tool was provided by an MCP and that that might be an issue, not can it know that the toolchain made the request, nor can it know that the toolchain received a response until the toolchain has done what it is programmed to do with the response through the point of populating the prompt template and calling the model with the resulting prompt, by which point any validation it was programmed to do has been done and is an immutable part of history.
>No, they don't. LLM frontends, which are traditional deterministic programs, use MCP to do that, and to find schemas for what should be sent to and expected from the tools.
You are REALLY, REALLY misunderstanding how this works. Like severely.
You think MCP is being used for some other purpose despite the one it was explicitly designed for... which is just weird and silly.
>Confusing these two issues can only happen if you think the model does something with each response that controls whether or not the toolchain validates it
No, you're still just arguing against something no one is arguing for the sake of pretending like MCP is doing something it literally cannot do or fundamentally fix about how LLM's operate.
I promise you if you read this a month from now with a fresh pair of eyes you will see your mistake.
What do you think the `tools/call` MCP flow is between the LLM and an MCP server? For example, if I had the GitHub MCP server configured on Claude Code and prompted "Show me the most recent pull requests on the torvalds/linux repository".
Hum, I'm not sure if everyone is simply unable to understand what you are saying, including me, but if the MCP client validates the MCP server response against the schema before passing the response to the LLM model, the model doesn't even matter, your MCP client could choose to report an error and interrupt the agentic flow.
That will depend on what MCP client you are using and how they've handled it.
How does the AI bypass the MCP layer to make the request? The assumption is (as I understand it) the AI says “I want to make MCP request XYZ with data ABC” and it sends that off to the MCP interface which does the heavy lifting.
If the MCP interface is doing the schema checks, and tossing errors as appropriate, how is the AI routing around this interface to bypass the schema enforcement?
>How does the AI bypass the MCP layer to make the request
It doesn't. I don't know why the other commenters are pretending this step does not happen.
There is a prompt that basically tells the LLM to use the generated manifest/configuration files. The LLM still has to not hallucinate in order to properly call the tools with JRPC and properly follow MCP protocol. It then also has to make sense of the structured prompts that define the tools in the MCP manifest/configuration file.
Why this fact is seemingly being lost in this thread, I have no idea, but I don't have anything nice to say about it so I won't :). Other than we're all clearly quite screwed, of course.
MCP is to make things standard for humans, with expected formats. The LLM's really couldn't give a shit and don't have anything super special about how the interact with MCP configuration files or the protocol (other than some additional fine-tuning, again, to make it less likely to get the wrong output).
> There is a prompt that basically tells the LLM to use the generated manifest/configuration files.
No, there isn't. The model doesn't see any difference between MCP-supplied tools, tools built in to the toolchain, and tools supplied by any other method. The prompt simply provides tool names, arguments, and response types to the model. The toolchain, a conventional deterministic program, reads the model response, finds things that meet the models defined format for tool calls, parses out the call names and arguments, looks up in its own internal list of tools to find matching names and see if they are internal, MCP supplied, or other tools, and routes the calls appropriately, gathers responses, does any validation it is designed to do, then mals the validated results into where the model's prompt template specifies tool results should go, and calls the model again with an new message appended to the previous conversation context containing the tool results.
What you described is essentially how it works. The LLM has no control over how the inputs & outputs are validated, nor in how the result is fed back into it.
The MCP interface (Claude Code in this case) is doing the schema checks. Claude Code will refuse to provide the result to the LLM if it does not pass the schema check, and the LLM has no control over that.
> > The LLM has no control over how the inputs & outputs are validated
> Which is completely fucking irrelevant to what everyone else is discussing.
Not sure what you think is going on, but that is literally the question this subthread is debating, starting with an exchange in which the salient claims were:
This is deterministic, it is validating the response using a JSON Schema validator and refusing to pass it to an LLM inference.
I can't gaurantee that behavior will remain the same more than any other software. But all this happens before the LLM is even involved.
> The whole point of it is, whichever LLM you're using is already too dumb to not trip when lacing its own shoes. Why you'd trust it to reliably and properly parse input badly described by a terrible format is beyond me.
You are describing why MCP supports JSON Schema. It requires parsing & validating the input using deterministic software, not LLMs.
>This is deterministic, it is validating the response using a JSON Schema validator and refusing to pass it to an LLM inference.
No. It is not. You are still misunderstanding how this works. It is "choosing" to pass this to a validator or some other tool, _for now_. As a matter of pure statistics, it will simply not do this at some point in the future on some run.
You are quite wrong. The LLM "chooses" to use a tool, but the input (provided by the LLM) is validated with JSON Schema by the server, and the output is validated by the client (Claude Code). The output is not provided back to the LLM if it does not comply with the JSON Schema, instead an error is surfaced.
I think the others are trying to point out that statistically speaking, in at least one run the LLM might do something other than choose to use the correct tool. i.e 1 out of (say) 1 million runs it might do something else
No, the discussion is about whether validation is certain to happen when the LLM makes something where the frontend recognizes aa a tool request and calls a tool on behalf of the LLM, not whether the LLM can choose not to make a tool call at all.
The question is whether havign observed Claude Code validating a tool response before handing the response back to the LLM, you can count on that validation on future calls, not whether you can count on the LLM calling a tool in a similar situation.
The LLM chooses to call a tool, it doesn't choose how the frontend handles anything about that call between the LLM making a tool request and the frontend, after having done its processing of the response (including any validation), mapping the result into a new prompt and calling the LLM with it.
MCP requires that servers providing tools must deterministically validate tool inputs and outputs against the schema.
LLMs cannot decide to skip this validation. They can only decide not to call the tool.
So is your criticism that MCP doesn't specify if and when tools are called? If so then you are essentially asking for a massive expansion of MCP's scope to turn it into an orchestration or workflow platform.
> . It is "choosing" to pass this to a validator or some other tool, _for now_.
No, its not. The validation happens at the frontend before the LLM sees the response. There is no way for the LLM to choose anything about what happens.
The cool thing about having coded a basic ReAct pattern implementation (before MCP, or even models trained on any specific prompt format for tool calls, was a thing, but none of that impacts the basic pattern) is that it gives a pretty visceral understanding of what is going on here, and all that's changed since is per model standardization of prompt and response patterns on the frontend<->LLM side and, with MCP, of the protocol for interacting on the frontend<->tool side.
Claude Code isn't a pure LLM, it's a regular software program that calls out to an LLM with an API. The LLM is not making any decisions about validation.
imho it's a fantasy to expect type safe protocols except in the case that both client and server are written in the same (type safe) language. Actually even that doesn't work. What language actually allows a type definition for "ISO-8601 timestamp" that's complete? Everything ends up being some construction of strings and numbers, and it's often not possible to completely describe the set of valid values except by run-time checking, certainly beyond trivial cases like "integer between 0 and 10".
> What language actually allows a type definition for "ISO-8601 timestamp" that's complete?
It is absolutely possible to do this, and to generate client code which complies with ISO-8601 in JS/TS. Large amounts of financial services would not work if this was not the case.
You've misunderstood his statement and proven his point.
`DateTime` is not an ISO-8601 type. It can _parse_ an ISO-8601 formatted string.
And even past that, there are Windows-specific idiosyncrasies with how the `DateTime` class implements the parsing of these strings and how it stores the resulting value.
> `DateTime` is not an ISO-8601 type. It can _parse_ an ISO-8601 formatted string.
This is exactly the point: a string is just a data interchange format in the context of a DateTime, and C# provides (as far as I can tell) a complete way of accessing the ISO-8601 specification on the language object. It also supports type-safe generation of clients and client object (or struct) generation from the ISO-8601 string format.
> And even past that, there are Windows-specific idiosyncrasies with how the `DateTime` class implements the parsing of these strings and how it stores the resulting value.
Not really. The windows statements on the article (and I use this on linux for financial services software) are related to automated settings of the preferences for generated strings. All of these may be set within the code itself.
Generally you'd use a time library to model ISO-8601 dates in a typesafe way. Some fancier languages might have syntactic support for it, but they ultimately serve the same purpose.
At its core, the article was just ramblings from someone being upset that LLMs didn't make things more complicated so that they could charge more billable hours to solve invented corporate problems... Which some people built their career on.
The merchants of complexity are disappointed. It turns out that even machines don't care for 'machine-readable' formats; even the machines prefer human-readable formats.
The only entities on this planet who appreciate so-called 'machine-readability' are bureaucrats; and they like it for the same reason that they like enterprise acronyms... Literally the opposite of readability.
> I can practically see it playing out now: there is some huge incident of some kind, in some system or service with an MCP component somewhere, with some elaborate post-mortem revealing that some MCP server somewhere screwed up
MCP focuses on transport and managing context and doesn't absolve the user for sensibly implementing the interface (i.e. defining a schema and doing schema validation)
this is like saying "HTTP doesn't do json validation", which, well, yeah.
> In healthcare, patient data types get coerced incorrectly, potentially leading to wrong medication dosing recommendations.
May have changed, but unlikely. I worked with medical telemetry as a young man and it was impressed upon me thoroughly how important parsing timestamps correctly was. I have a faint memory, possibly false, of this being the first time I wrote unit tests (and without the benefit of a test framework).
We even accounted for lack of NTP by recalculating times off of the timestamps I. Their message headers.
And the reasons I was given were incident review as well as malpractice cases. A drug administered three seconds before a heart attack starts is a very different situation than one administered eight seconds after the patient crashed. We saw recently with the British postal service how lives can be ruined by bad data, and in medical data a minute is a world of difference.
> May have changed, but unlikely. I worked with medical telemetry as a young man and it was impressed upon me thoroughly how important parsing timestamps correctly was.
I also work in healthcare, and we've seen HL7v2 messages with impossible timestamps. (E.g., in the spring-forward gap.)
Since we were getting low latency data inside HTTP responses we could work off of the response header clock skew to narrow origination time down to around one second, and that’s almost as good as NTP can manage anyway.
As RPC mechanisms go, HTTP is notable for how few of the classic blunders they made in 1.0 of the spec. Clock skew correction is just my favorite. Technically it exists for cache directives, but it’s invaluable for coordination across machines. There are reasons HTTP 2.0 waited decades to happen. It just mostly worked.
I can offer a hacking/penetrarion testing perspective to this as a security researcher at a security consultint firm: this type of hallucination and trust is one of the largest things we exploit in our new LLM testing service. Overly agentic systems (one of the top 10 OWASP LLM vulns) is the most profound and commonly exploited issue that we've been able to leverage.
If we can get an internal, sensitive-data-handling agent to ingest a crafted prompt, either via direct prompt injection against a more abstract “parent” agent, or by tainting an input file/URL it’s told to process, we can plant what I have internally coined an “unfolding injection.”
The injection works like a parasitic goal, it doesn’t just trick one agent, it rewrites the downstream intent. As the orchestrator routes tasks to other agents, each one treats the tainted instructions as legitimate and works toward fulfilling them.
Because many orchestrations re-summarize, re-plan, or synthesize goals between steps, the malicious instructions can actually gain fidelity as they propagate. By the time they reach a sensitive action (exfiltration, privilege escalation, external calls), there’s no trace of the original “weird” wording, just a confidently stated, fully-integrated sub-goal.
It’s essentially a supply-chain attack on the orchestration layer: you compromise one node in the agent network, and the rest “help” you without realizing it. Without explicit provenance tracking and policy enforcement between agents, this kind of unfolding injection is almost trivial to pull off, and we've been able to compromise entire environments based on the information the agentic system provided us, or just gave us either a bind or reverse shell in the case it has cli access and ability to figure out its own network constraints.
SSRF has been making a HUGE return in agentic systems, and Im sad defcon and black hat didnt really have many talks on this subject this year, because it is a currently evolving security domain and entirely new method of exploitation. The entire point of agentic systems is non determinism, but it also makes it a security nightmare. As a researcher though, this is basically a gold mine of all sorts of new vulnerabilities we'll be seeing. If you work as a bugbounty hunter and see a new listing for an AI company I can almost assuredly say you can get a pretty massive payout just by exploiting the innate trust between agents and the internal tools they are leveraging. Even if you dont have the architecture docs of the agentic system you can likely prompt inject the initial task enough to taint the further agents to have them list out the orchestration flow by creatively adjusting your prompt for different types of orchestration and how the company might be doing prompt engineering on the agents persona and task its designed to work on and then submit report on to parent agent, and the limited input validation between them.
When desktop OSes came out, hardware resources were scarce so all the desktop OSes (DOS, Windows, MacOS) forgot all the lessons from Unix: multi user, cooperative multitasking, etc. 10 years later PC hardware was faster than workstations from the 90s yet we're still stuck with OSes riddled with limitations that stopped making sense in the 80s.
When smartphones came out there was this gold rush and hardware resources were scarce so OSes (iOS, Android) again forgot all the lessons. 10 years later mobile hardware was faster than desktop hardware from the 00s. We're still stuck with mistakes from the 00s.
AI basically does the same thing. It's all lead by very bright 20 and 30 year olds that weren't even born when Windows was first released.
It's all gold rushes and nobody does Dutch urban infrastructure design over decades. Which makes sense as this is all driven by the US, where long term plan I is anathema.
I’ve been successfully using AI for several months now, and there’s still no way I’d trust it to execute trades, or set the dose on an XRay machine. But startups gonna start. Let them.
i mean isnt all this stuff up to the mcp author to return a reasonable error to the agent and ask for it to repeat the call with amendments to the json?
Yes. And this is where culture comes in. The culture of discipline of the C++ and the JavaScript communities are at extreme odds of the spectrum. The concern here is that the culture of interfacing with AI tools, such as MCP, is far closer to the discipline of the JavaScript community than to the C++ community.
The fundamental difference is the JS community believe in finding the happy path that results in something they can sell before they have filled in all those annoying problem areas around it.
If an LLM can be shown to be useful 80% of the time to the JS mindset this is fine, and the remaining 20% can be resolved once we're being paid for the rest, Pareto principle be damned.
Mostly, no. Whether its the client sending (statically) bad data or the server returning (statically) bad data, schema validation on the other end (assuming somehow it is allowed by the toolchain on the sending end) should reject it before it gets to the custom code of the MCP server or MCP client.
For arguments that are the right type but wrong because of the state of the universe, yes, the server receiving it should send a useful error message back to the client. But that's a different issue.
This is no different than the argument that C is totally great as long as you just don’t make mistakes with pointers or memory management or indexing arrays.
At some point we have to decide as a community of engineers that we have to stop building tools that are little more than loaded shotguns pointed at our own feet.
no, it's not because the nature of llms mean that even if you fully validate your communications with the llm statistically anything can happen, so any usage/threat model must already take nasal demons into account.
To me, the article was just rambling about all sorts of made up issues which only exist in the minds of people who never spent any time outside of corporate environments... A lot of 'preventative' ideas which make sense in some contexts but are mis-applied in different contexts.
The stuff about type validation is incorrect. You don't need client-side validation. You shouldn't be using APIs you don't trust as tools and you can always add instructions about the LLM's output format to convert to different formats.
MCP is not the issue. The issue is that people are using the wrong tools or their prompts are bad.
If you don't like the format of an MCP tool and don't want to give formatting instructions the LLMs, you can always create your own MCP service which outputs data in the correct format. You don't need the coercion to happen on the client side.
> MCP promises to standardize AI-tool interactions as the “USB-C for AI.”
Ironically, it's achieved this - but that's an indictment of USB-C, not an accomplishment of MCP. Just like USB-C, MCP is a nigh-universal connector with very poorly enforced standards for what actually goes across it. MCP's inconsistent JSON parsing and lack of protocol standardization is closely analogous to USB-C's proliferation of cable types (https://en.wikipedia.org/wiki/USB-C#Cable_types); the superficial interoperability is a very leaky abstraction over a much more complicated reality, which IMO is worse than just having explicitly different APIs/protocols.
I'd like to add that the culmination of USB-C failure was Apple's removal of USB-A ports from the latest M4 Mac mini, where an identical port on the exact same device, now has vastly different capabilities, opaque to the final user of the system months past the initial hype on the release date.
Previously, you could reasonably expect a USB-C on a desktop/laptop of an Apple Silicon device, to be USB4 40Gbps Thunderbolt, capable of anything and everything you may want to use it for.
Now, some of them are USB3 10Gbps. Which ones? Gotta look at the specs or tiny icons, I guess?
Apple could have chosen to have the self-documenting USB-A ports to signify the 10Gbps limitation of some of these ports (conveniently, USB-A is limited to exactly 10Gbps, making it perfect for the use-case of having a few extra "low-speed" ports at very little manufacturing cost), but instead, they've decided to further dilute the USB-C brand. Pure innovation!
With the end user likely still having to use a USB-C to USB-A adapters anyways, because the majority of thumb drives, keyboards and mice, still require a USB-A port — even the USB-C ones that use USB-C on the kb/mice itself. (But, of course, that's all irrelevant because you can always spend 2x+ as much for a USB-C version of any of these devices, and the fact that the USB-C variants are less common or inferior to USB-A, is of course irrelevant when hype and fanaticism are more important than utility and usability.)
Agreed. In practice, SOAP was a train wreck. It's amazing how overly complicated they managed to make concepts that should've been simple, all the way down to just XML somehow being radically more complex than it looks to the wacky world of ill-defined standards for things like WSDLs and weird usage of multi-part HTTP and, to top it all off, it was all for nothing, because you couldn't guarantee that a SOAP server written in one language would be interoperable with clients in other languages. (I don't remember exactly what went wrong, but I hit issues trying to use a SOAP API powered by .NET from a Java client. I feel like that should be a pretty good case!)
It doesn't take very long for people to start romanticizing things as soon as they're not in vogue. Even when the painfulness is still fresh in memory, people lament over how stupid new stuff is. Well I'm not a fan of schemaless JSON APIs (I'm one of those weird people that likes protobufs and capnp much more) but I will take 50 years of schemaless JSON API work over a month of dealing with SOAP again.
It’s been a while but isn’t soap just xml over http-post? Seems like all the soap stuff I’ve done is just posting lots of xml and getting lots of xml back.
/“xml is like violence, if it’s not working just use more!”
> It’s been a while but isn’t soap just xml over http-post?
No.
SOAP uses that, but SOAP involves a whole lot of spec about how you do that, and that's even before (as the article seems to) treat SOAP as meaning SOAP + the set of WS-* standards built around it.
If it was some vaguely sensibly defined XML, it wouldn't be quite as bad. But it's a ludicrously over-complicated mapping between the service definition and the underlying XML, often auto-generated by a bunch of not very well designed nor compatible tooling.
I have plenty of good stuff to say, especially since REST (really JSON-RPC in practice), and GraphQL, seem to always being catching up to features the whole SOAP and SOA ecosystems already had.
Unfortunately as usual when a new technology cycle comes, everything gets thrown away, including the good parts.
I dunno, SMTP wasn't bad last time I had to play with it. In actual use it wasn't entirely trivial, but most of that happened at layers that weren't really the mail transfer protocol's fault (SPF et al.). Although, I'm extremely open to that being one exception in flood of cases where you are absolutely correct:)
I recall two SOAP-based services refusing to talk to each other because one nicely formatted the XML payload and the other didn't like that one bit. There is a lot we lost when we went to json but no, I don't look back at that stuff with any fondness.
And I actually like XML-based technologies. XML Schema is still unparalleled in its ability to compose and verify the format of multiple document types. But man, SOAP was such a beast for no real reason.
Instead of a simple spec for remote calls, it turned into a spec that described everything and nothing at the same time. SOAP supported all kinds of transport protocols (SOAP over email? Sure!), RPC with remote handles (like CORBA), regular RPC, self-describing RPC (UDDI!), etc. And nothing worked out of the box, because the nitty-gritty details of authentication, caching, HTTP response code interoperability and other "boring" stuff were just left as an exercise to the reader.
I'll give a different viewpoint and it's that I hate everything about XML. In fact one of the primary issues with SOAP was the XML. It never worked well across SOAP libraries. Eg. The .net and Java SOAP libraries have huge threads on stackoverflow "why is this incompatible" and a whole lot of needing to very tightly specify the schema. To the point it was a flaw; it might sound reasonable to tightly specify something but it got to the point there were no reasonable common defaults hence our complaints about SOAP verbosity and the work needed to make it function.
Part of this is the nature of XML. There's a million ways to do things. Should some data be parsed as an attribute of the tag or should it be another tag? Perhaps the data should be in the body between the tags? HTML, based on XML, has this problem; eg. you can seriously specify <font face="Arial">text</font> rather than have the font as a property of the wrapping tag. There's a million ways to specify everything and anything and that's why it makes a terrible data parsing format. The reader and writer must have the exact same schema in mind and there's no way to have a default when there's simply no particular correct way to do things in XML. So everything had to be very very precisely specified to the point it added huge amounts of work when a non-XML format with decent defaults would not have that issue.
This become a huge problem for SOAP and why i hate it. Every implementation had different default ways of handling even the simplest data structure passing between them and were never compatible unless you took weeks of time to specify the schema down to a fine grained level.
In general XML is problematic due to the lack of clear canonical ways of doing pretty much anything. You might say "but i can specify it with a schema" and to that i say "My problem with XML is that you need a schema for even the simplest use case in the first place".
Yes, XML has way too much flexibility. With some very dark corners like custom entities, DTDs, and BOMs (byte order marks). It's clearly a child of 90-s conceived before UTF-8, and the corrosive world of the modern networks.
But parts of XML infrastructure were awesome. I could define a schema for the data types, and have my IDE auto-complete and validate the XML documents as I typed them. I could also validate the input/output data and provide meaningful errors.
And yeah, I also worked with XML and got burned many times by small incompatibilities that always happen due to its inherent complexity. If XML were just a _bit_ simpler, it could have worked so much better.
Granted your soap library probably did the wrong thing there but you could do surprising low memory xml parsing with a sax event based parser. I remember taking the runtime of full dom parsers down from hours to minutes by rewriting them as sax parsers.
Ironically what put me entirely off SOAP was a tech presentation on SOAP.
Generally it worked very well when both ends were written in the same programming language and was horseshit if they weren’t. No wonder Microsoft liked SOAP so much.
And that begs the question why have a spec at all if it is not easily interoperable? If the specification is impossible to implement and understand, just make it language specific and call it a reference implementation. You can reinvent the wheel and it will be round.
You're missing the most significant lesson of all that MCP knew. That all of those featureful things are way too overcomplicated for most places, so they will gravitate to the simple thing. It's why JSON over HTTP blobs is king today.
I've been on the other side of high-feature serialization protocols, and even at large tech companies, something like migrating to gRPC is a multi-year slog that can even fail a couple of times because it asks so much of you.
MCP, at its core, is a standardization of a JSON API contract, so you don't have to do as much post-training to generate various tool calling style tokens for your LLM.
So, I just looked it up, thinking I might have overlooked something but, at least according to wikipedia, REST does not prescribe the format of the data transferred. So, I don't understand why you are comparing REST to xml, yaml, json or whatever.
Now, YAML has quite a few shortcomings compared to JSON (if you don't believe me, look at its handling of the string no, discussed on HN), so, at least to me, it's obvious why JSON won.
SOAP, don't get me started on that, it's worth less than XML, protobuf is more efficient but less portable, etc.
XML has been supported in javascript for essentially just as long as JSON, arguably longer. Heck, the first in-practice and standardized HTTP request APIs for javascript were "XMLHTTPRequest" (and similar "XMLHTTP" names). And XHTML is a thing, and it predates both JSON and AJAX.
CORBA emerged in 1991 with another crucial insight: in heterogeneous environments, you can’t just “implement the protocol” in each language and hope for the best. The OMG IDL generated consistent bindings across C++, Java, Python, and more, ensuring that a C++ exception thrown by a server was properly caught and handled by a Java client. The generated bindings guaranteed that all languages saw identical interfaces, preventing subtle serialization differences.
Yeah, the modern JSON centered API landscape came about as a response to failures of CORBA and SOAP. It didn’t forget the lessons of CORBA, it rejected them.
I've worked somewhere where CORBA was used very heavily and to great effect - though I suspect the reason for our successful usage was that one of the senior software engineers worked on CORBA directly.
I applied for a job at AT&T using CORBA around 1998 and I think that’s the last time I encountered it other than making JDK downloads slower.
Didn’t get that job, one of the interviewers asked me to write concurrent code, didn’t like my answer, but his had a race condition in it and I was unsuccessful in convincing him he was wrong. He was relying on preemption not occurring on a certain instruction (or multiprocessing not happening). During my tenure at the job I did take the real flaws in the Java Memory Model would come out and his answer became very wrong and mine only slightly.
CORBA got a lot of things right. But it was unfortunately a child of the late 80-s telecom networks mixed with OOP-hype.
So it baked in core assumptions that the network is transparent, reliable, and symmetric. So you could create an object on one machine, pass a reference to it to another machine, and everything is supposed to just work.
Which is not what happens in the real world, with timeouts, retries, congested networks, and crashing computers.
Oh, and CORBA C++ bindings had been designed before the STL was standardized. So they are a crawling horror, other languages were better.
Just an interesting bit of trivia, the Large Hadron Collider uses/used (don't know if it still does) CORBA in its distributed control system. (On the control system I worked on we use Sun RPC, which was fine as things go but doesn't have the language support that CORBA has. We used a separate SOAP interface to the system to allow for languages such as Python. Today I'd use gRPC, or the BEAM.)
On a more general note, I see in many critical comments here what I perceive to be a category error. Using JSON to pass data between web client and server, even in more complex web apps, is not the same thing as supporting two-way communications between autonomous software entities that are tasked to do something, perhaps something critical. There could be millions of these exchanges in some arbitrarily short time period, thus any possibility of errors is multiplied accordingly, and the effect any error could cascade if it does not fail early. I really don't believe this is a case where "worse is better." To use an analogy, yes everyday English is a versatile language that works great for most use cases; but when you really need to nail things down, with no tolerance for ambiguity, you get legalese or some other jargon. Or CORBA, or gRPC, etc.
> but when you really need to nail things down, with no tolerance for ambiguity, you get legalese
If only that were true. Litigation happens every single day over the meanings of contracts and laws that were drafted by well-trained and experienced attorneys.
The law is a much more ambitious attempt at formalization than any programming language, hence the more dramatic failures.
Comparatively, programming languages are very constrained. The environments in which they are interpreted and executed are far better understood than any human courtroom.
Your point is an interesting one but it’s painting with too broad a brush.
MCP is flawed but it learnt one thing correctly from years of RPC - complexity is the biggest time sink and holds back adoption in deference to simpler competing standards (cf XML vs JSON)
- SOAP - interop needs support of DOC or RPC based between systems, or a combination, XML and schemas are also horribly verbose.
- CORBA - libraries and framework were complex, modern languages at the time avoided them in deference to simpler standards (e.g. Java's Jini)
- GPRC - designed for speed, not readability, requires mappings.
It's telling that these days REST and JSON (via req/resp, webhooks, or even streaming) are the modern backbone of RPC. The above standards either are shoved aside or for GPRC only used where extreme throughput is needed.
Since REST and JSON are the plat du jour, MCP probably aligns with that design paradigm rather than the dated legacy protocols.
How is this browser specific or mentioned to be browser?
The technologies can be purely "enterprise integration" of backend services.
When was swagger (openapi) for example forbidden to be used for RPC?
E.g. an endpoint that doesn't just support a crud op but take an event with an operation to execute?
I read this thrice: ...When OpenAI bills $50,000 for last month’s API usage, can you tell which department’s MCP tools drove that cost? Which specific tool calls? Which individual users or use cases?...
It seems to be a game of catch up for most things AI. That said, my school of thought is that certain technologies are just too big for them to be figured out early on - web frameworks, blockchain, ...
- the gap starts to shrink eventually. With AI, we'll just have to keep sharing ideas and caution like you have here.
Such very interesting times we live in.
The author seems to fundamentally misunderstand how MCPs are going to be used and deployed.
This is really obvious when they talk about tracing and monitoring, which seem to be the main points of criticism anyway.
They bemoan that they cant trace across MCP calls, assuming somehow there would be a person administering all the MCPs.
Of course each system has tracing in whatever fashion fits its system.
They are just not the same system, nor owned by the same people let alone companies.
Same as monitoring cost. Oh, you can’t know who racked up the LLM costs? Well of course you can, these systems are already in place and there are a million of ways to do this. It has nothing to do with MCP.
Reading this, I think its rather a blessing to start fresh and without the learnings of 40 years of failed protocols or whatever
Am I the only person who thinks it's a bad idea for a hallucinating AI to have anything to do with financial services, healthcare, or manufacturing? Doesn't really matter what the RPC protocol looks like if the meat and potatoes of the tech will confidently make shit up. You can't hand-wave this away by saying "oh it's a reasoning AI". Albert Einstein may be real smart, but it's still a bad idea to take financial advice from him when he drops acid.
OpenAPI (or its Swagger predecessor) or Proto (I assume by this you mean protobuf?) don't cover what MCP does. It could have layered over them instead of using JSON-RPC, but I don't see any strong reason why they would be better than JSON-RPC as the basis (Swagger has communication assumptions that don't work well with MCP's local use case; protobuf doesn't cover communication at all and would require additional consideration in the protocol layered over it.)
You'd still need basically the entire existing MCP spec to cover the use cases if it replaced JSON-RPC with Swagger or protobuf, plus additional material to cover the gaps and complications that that switch would involve.
Not much, really. A lot of LLM models need system prompts or fine-tuning to reliably use MCP; (though to be clear you don't NEED to do either of those things, it just dramatically increases the reliability of the LLM)
It's amusing to watch people refer to MCP as a set of tools, or a framework, or an SDK you can invoke, or something or other across a wide range of forums.It's just a standard. A convention. Calling it a protocol is a stretch as well. But there's no meat to it, really.
If you just used Rest API's, you'd need to create little "tools" (say, another executable) locally that the LLM can invoke that can call those API's. MCP standardizes what those tools should act like and their overall lifecycle model.
The references to it being like USB are also quite frankly absurd and delusional.
But that's the caliber of developer we're dealing with today.
Proto has a full associated spec (gRPC) on communication protocols and structured definitions for them. MCP could easily have built upon these and gotten a lot “for free”. Generally gRPC is better than JsonRPC (see below).
I agree that swagger leaves a lot unplanned. I disagree about the local use case because (1) we could just run local HTTP servers easily and (2) I frankly assume the future of MCP is mostly remote.
Returning back to JSON-RPC, it’s a poorly executed RPC protocol. Here is an excellent HackerNews thread on it, but the TLDR is parsing JSON is expensive and complex, we have tons of tools (eg load balancers) that make modern services, and making those tools parse json is very expensive. Many people in the below thread mention alternative ways to implement J-RPC but that depends on new clients.
Eh... No, it does not support streaming responses.
I know this because I wish it did. You can approximate streaming responses by using progress notifications. If you want something like the LLM partial response streaming, you'll have to extend MCP with custom capabilities flags. It's totally possible to extend it in this way, but then it's non standard.
Perhaps you are alluding to the fact that it's bidirectional protocol (by spec at least).
That's transport and message passing. The response isn't streamed. It's delivered as a single message when the task is complete. Don't be confused by the word "Streamable". That's just there because it's using SSE to stream a series of JSON-RPC messages from the Server to the Client. But the Response to any specific Request is a single monolithic message. In the this space, an LLM that supports streaming is sending the response to a request as partials as they are generated. This allows you to present the results faster and give a lower perceived latency. MCP *does not* support this by the current specs. As I said, you can extend MCP and provide these partials in ProgresNotification messages. Then you are using a non-standard spec extension.
Author disregards why none of these technologies are relevant in the modern web.
Sure, they might still find themselves in highly regulated industries where risk avoidance trumps innovation everyday, all day.
MCP is for _the web_ , it started with stdio only because Anthropic was learning lessons from building Claude Code.
Author also seems to expect that the result from MCP tool usage will feed directly to an LLM. This is preposterous and a recipe for disaster. Obviously you’d validade structured response against a schema, check for harmful content, etc etc.
I am not sure what you mean. Stateless RPCs, cache controls, client-side typing, tracing/observability, and bidirectional streaming are all things that strike me as very relevant to the modern web for all but the smallest of toy projects, let alone projects in serious engineering organizations.
> Author also seems to expect that the result from MCP tool usage will feed directly to an LLM
Isn't this exactly what MCP is for? Most tools I've come across are to feed context from other sources directly to the LLM. I believe this is the most common use-case for the protocol.
So far every single use of mcp's I've seen in the wild is that the response is sent straight to the LLM without doing any validation. Seems reasonable for the Author to expect that when it's exactly what is happening.
Many great points. I think we are thinking about MCP the wrong way.
The greater problem is industry misunderstanding and misalignment with what agents are and where they are headed.
Web platforms of the world believe agents will be embedded in networked distributed infrastructure. So we should ship an MCP platform in our service mesh for all of the agents running in containers to connect to.
I think this is wrong, and continues to be butchered as the web pushes a hard narrative that we need to enable web-native agents & their sdks/frameworks that deploy agents as conventional server applications. These are not agents nor the early evolutionary form of them.
Frontier labs will be the only providers of the actual agentic harnesses. And we are rapidly moving to computer use agents - MCP servers were intended to serve as single instance deployments for single harnesses. ie. a single mcp server on my desktop for my Claude Desktop.
So I'm in the "MCP is probably not a great idea" camp but I couldn't say "this is how it SHOULD be done", and the author makes great criticisms but falls short of actual suggestions. I'm assuming the author is not seriously recommending we go back to SOAP and I've never heard of CORBA. I've heard of gRPC but I can't tell if the author is saying it is good or bad.
Also Erlang uses RPCs for pretty much all "synchronous" interactions but it's pretty minimal in terms of ceremony. Seems pretty reliable.
So this is a serious question because hand rolling "40 years" of best practices seems hard, what should we be using for RPC?
To answer your serious questions, gRPC is actually not a bad choice if you are making at the beginning of your project. Migrating over to it is going to be a challenge if you were using something else because it's pretty opinionated but if you have a clean sheet that's what I would use. Cap-n-Proto or Thrift are also probably good choices. These are all solid RPC frameworks that give you everything you need for that out of the box at the expense of more complicated builds.
I am torn. I see this argument and intellectually agree with it (that interfaces need to be more explicit). However it seems that every time there is a choice between “better” design and “good enough”, the “good enough” wins handily.
Multics vs Unix, xml based soap vs json based rest apis, xhtml’s failure, javascript itself, … I could keep going on.
So I’ve resigned myself to admitting that we are doomed to reimplement the “good enough” every time, and continue to apply bandaid after bandaid to gradually fix problems after we rediscover them, slowly.
It depends on system boundaries. If an actor doesn't face the consequences of a decision, but another does, that's an externality. When externalities are present, it is often rational (narrowly speaking) for an actor to accept designs that look awful from a broader perspective.
In other words, many technical problems flow rather predictably from decision-making boundaries that don't internalize the externalities.
Ever heard someone say "if you care about X, run for office"? The same applies to technology. If one cares about good designs, one must promote organizational and societal structures that actually have a fighting chance at bringing those about.
The days of nerds and hackers not caring about broader dynamics and structures are long gone. Sitting back and letting the business folks have control is fine if you want them to optimize for the existing incentives. But if you want to change the rules of the game, you gotta jump in at the deep end.
- a standardized way in which the costs associated with an MCP tool call can be communicated to the MCP Client and reported to central tracking - nothing here I see, but it's a really good idea!
- serialization issues e.g. "the server might report a date in a format unexpected by the client" - this isn't wrong, but since the consumer of most tool responses is itself an LLM, there's a fair amount of mitigation here. And in theory an MCP Client can use an LLM to detect under-specified/ambiguous tool specifications, and could surface these issues to the integrator.
Now, I can't speak to the speed at which Maintainers and Core Maintainers are keeping up with the community's momentum - but I think it's meaningful that the community has momentum for evolving the specification!
I see this post in a highly positive light: MCP shows promise because you can iterate on these kinds of structured annotations, in the context of a community that is actively developing their MCP servers. Legacy protocols aren't engaging with these problems in the same way.
One of the MCP Core Maintainers here. I want to emphasize that "If you see something, say something" very much works with the MCP community - we've recently standardized on the Spec Enhancement Proposal (SEP) process, and are also actively (and regularly) reviewing the community proposals with other Core Maintainers and Maintainers. If there is a gap - open an issue or join the MCP Contributor Discord server (open for aspiring and established contributors, by the way), where a lot of contributors hang out and discuss on-deck items.
These aren’t third-party libraries, these are RFP processes for adding to the official protocol, and standardizing the semantics of new fields and data types. A world of difference IMO.
MCP is not a protocol. It doesn't protocolize anything of use. It's just "here's some symbols, do with them whatever you want.", leaving it there but then advertising that as a feature of its universality. It provides almost just as much of a protocol as TCP, but rebuild on 5 OSI layers, again.
It's not a security issue, it's a ontological issue.
That being said. MCP as a protocol has a fairly simple niche. Provide context that can be fed to a model to perform some task. MCP covers the discovery process around presenting those tools and resources to an Agent in a standardized manner. An it includes several other aspects that are useful in this niche. Things like "sampling" and "elicitations". Is it perfect? Not at all. But it's a step in the right direction.
The crowd saying "just point it at an OpenAPI service" does not seem to fully understand the current problem space. Can many LLMs extract meaning from un-curated API response messages? Sure. But they are also burning up context holding junk that isn't needed. Part of MCP is the acknowledgement that general API responses aren't the right way to feed the model the context it needs. MCP is supposed to be taking a concrete task, performing all the activities need to gather the info or affect the change, then generate clean context meant for the LLM. If you design an OpenAPI service around those same goals, then it could easily be added to an Agent. You'd still need to figure out.all the other aspects, but you'd be close. But at that point you aren't pointing an Agent at a random API, you're pointing it at a purpose made API. And then you have to wonder, why not something like MCP that's designed for that purpose from the start?
I'll close by saying there are an enormous number of MCP Servers out there that are poorly written, thin wrappers on general APIs, or have some other bad aspects. I attribute a lot of this to the rise in AI Coding Agents allowing people with poor comprehension of the space enabling them to crank out this... Noise.
There are also great examples of MCP Servers to be found. They are the ones that have thoughtful designs, leverage the spec fully, and provide nice clean context for the Agent to feed to the LLM.
I can envision a future where we can simply point an agent at a series of OpenAPI services and the agent uses it's models to self-assemble what we consider the MCP server today. Basically it would curate accessing the APIs into a set of focused tools and the code needed to generate the final context. That's not quite where we are today. It's likely not far off though.
Otherwise the larger picture is that MCP is a land grab for building an eco-system around integrations to get access to data. Your LLM agent is not valuable if it can't access things for you... and from a market perspective enterprise pays a lot for this stuff already, and yes MCP is not thought out at all for Enterprise really... At least thankfully they added stateless connections to the spec...
IMO worrying about type-safety in the protocol when any string field in the reply can prompt-inject the calling LLM feels like putting a band-aid on a decapitation, but YMMV
osx is easier to use than mac.. whatever they named their old versions.
It goes on and on. I can have 50 browser tabs open at the same time, each one hosting a highly complicated app, ranging from media playback to chat rooms to custom statistical calculators. I don't need to install anything for any of these apps, I just type in a short string in my url bar. And they all just work, at the same time.
I have to keep buying a new computer every few years because software keeps getting slower. That machine still costs a thousand dollars, and my "low-end" internet connection now costs $100. My hypertext document viewer requires at least 8GB of RAM for normal use, with eight CPU cores. No new network protocol can exist without being tunneled over the hypertext document viewer's stateless application-layer network protocol. All of this so that I can click on a screen to read some text.
The supercomputers in our pockets (that used to be telephones, but don't work well for that anymore) will let us run the programs that one of two companies allow us to run, which will run most apps... as long as the hardware is as recent as our laptops/desktops.
Yes, we're very advanced. In the past 20 years, we have achieved... the same thing we had 20 years ago... only with more hardware requirements, programming languages, and frameworks. Today you can do anything... as long as it's on a web page, on recent hardware (and God help you if you haven't updated your software in the past month)
Has there ever been a time, going back to the 70s with original PCs, where new software didn't necessitate a new computer?
Things are also getting better now that Intel is dying. I mean, the new Apple silicon chips are astoundingly fast and energy efficient, an M1 from 5 years ago is still going strong and probably won't truly need replacing for another 2. Similar for Ryzen chips from 5 years ago!
Things have changed a lot in 20 years. In 2005 we didn't consume all of our video / audio media online. We didn't have social media, just blogs and RSS readers. YouTube had just been released. TikTok, Facebook and Twitter didn't exist. Hypermedia today is very rich and necessitates a lot of resources. But at the same time, most work the past 10 years has been on native apps (on mobile particularly but also PCs), not web sites. Most people don't use the web browser as much.
> MCP discards this lesson, opting for schemaless JSON with optional, non-enforced hints.
Actually, MCP uses a normative TypeScript schema (and, from that, an autogenerated JSON Schema) for the protocol itself, and the individual tool calls also are specified with JSON Schema.
> Type validation happens at runtime, if at all.
That's not a consequence of MCP "opting for schemaless JSON" (which it factually does not), that's, for tool calls, a consequence of MCP being a discovery protocol where the tools, and thus the applicable schemas, are discovered aruntime.
If you are using MCP as a way to wire up highly-static components, you can do discovery against the servers once they are wired up, statically build the clients around the defined types, and build your toolchain to raise errors if the discovery responses change in the future. But that's not really the world MCP is built for. Yes, that means that the toolchain needs, if it is concerned about schema enforcement, use and apply the relevant schemas at runtime. So, um, do that?
I think this article is missing the point that MCP is simply using the mainstream building blocks that have already regressed from what we've had previously, namely, JSON in place of proper RCP.
The ISO8601 v Unix epoch example seems very weak to me. I'd certainly expect any model to be capable of distinguishing between these things, so, it doesn't seem like a big deal that either one would be allowed in a JSON.
Honestly, my view that nothing of value ever gets published on medium, is strongly reinforced here.
The fact that the model can recognize a Unix timestamp when it sees one doesn't really help you if it then tries to work around the API mismatch by helpfully converting the timestamp into a hallucinated ISO date.
Why would a hype protocol use outdated concepts instead of the hype JSON?
Wouldn't medical records actually be better in JSON, because the field could expressly have "kg" or "lb" suffix within the value of the field itself, or even in the name of the field, like "weight-in-kg" or "weight-in-lb"? This is actually the beauty of JSON compared other formats where these things may end up being just a unitless integer.
The biggest problem with medical data would probably remain the human factor, where regardless of the format used by the machines and by MCP, the underlying data may already be incorrect or not coded properly, so, if anything, AI would likely have a better chance of interpreting the data correctly than the API provider blindly mislabelling unitless data.
On that note; some of these “best practices” arguably haven’t worked out. “Be conservative with what you send, liberal with what you receive” has turned even decent protocols into a dumpster fire, so why keep the charade going?
TCP is basically the only example of that principle that works and it only works because the protocol is low level and constrained. Almost all the implementations of that principle from close to the app layer are abominations we're barely keeping running.
The stuff about the utility of machine-readable Web Service Description Language got me rolling my eyes.
WSDL is just pure nonsense. The idea that software would need to decide which API endpoints it needs on its own, is just profoundly misguided... Literally nobody and nothing ever reads the WSDL definitions; it's just poor man's documentation, at best.
LLMs only reinforce the idea that WSDL is a dumb idea because it turns out that even the machines don't care for your 'machine-friendly' format and actually prefer human-friendly formats.
Once you have an MPC tool working with a specific JSON API, it will keep working unless the server makes breaking changes to the API while in production which is terrible practice. But anyway, if you use a server, it means you trust the server. Client-side validation is dumb; like people who need to put tape over their mouths because they don't trust themselves to follow through on their diet plans.
WSDLs are routinely used to generate the language bindings for the SOAP actions. WSDL being language-agnostic ensures that bindings in different languages, and/on the client vs the server side, are consistent with each other.
WSDLs being available from the servers allows (a) clients to validate the requests they make before sending them to the server, and (b) developers (or in principle even AI) with access to the server to create a client without needing further out-of-band specifications.
> WSDL being language-agnostic ensures that bindings in different languages, and/on the client vs the server side, are consistent with each other.
In theory. In reality java could talk to java. M$ stuff could talk to other M$ stuff. And pretty much everyone else was left out in the cold. consistent cross language interop never actually happened despite the claims that it would.
I think this is unwise. There are a lot of things that clients need to take into account which cannot be described by WSDLs (e.g. timing related or language specific considerations which require careful thinking through).
I don't buy this idea that code should be generated automatically without a human involved (at least as a reviewer).
I also don't buy the idea that clients should validate their requests before sending to the server. The client's code should trust itself. I object to any idea of code (or any entity) not trusting itself. That is a flawed trust model.
> MCP discards this lesson, opting for schemaless JSON with optional, non-enforced hints. Type validation happens at runtime, if at all. When an AI tool expects an ISO-8601 timestamp but receives a Unix epoch, the model might hallucinate dates rather than failing cleanly. In financial services, this means a trading AI could misinterpret numerical types and execute trades with the wrong decimal precision. In healthcare, patient data types get coerced incorrectly, potentially leading to wrong medication dosing recommendations. Manufacturing systems lose sensor reading precision during JSON serialization, leading to quality control failures.
Having worked with LLMs every day for the past few years, it is easy to see every single one of these things happening.
I can practically see it playing out now: there is some huge incident of some kind, in some system or service with an MCP component somewhere, with some elaborate post-mortem revealing that some MCP server somewhere screwed up and output something invalid, the LLM took that output and hallucinated god knows what, its subsequent actions threw things off downstream, etc.
It would essentially be a new class of software bug caused by integration with LLMs, and it is almost sure to happen when you combine it with other sources of bug: human error, the total lack of error checking or exception handling that LLMs are prone to (they just hallucinate), a bunch of gung-ho startups "vibe coding" new services on top of the above, etc.
I foresee this being followed by a slew of Twitter folks going on endlessly about AGI hacking the nuclear launch codes, which will probably be equally entertaining.
Before 2023 I always thought that all the bugs and glitches of technology in Star Trek were totally made up and would never happen this way.
Post-LLM I am absolutely certain that they will happen exactly that way.
I am not sure what LLM integrations have to do with engineering anymore, or why it makes sense to essentially put all your company's infrastructure into external control. And that is not even scratching the surface with the lack of reproducibility at every single step of the way.
It "somehow works" isn't engineering.
Never mind the quality or if it's even going to work in production.
And maybe that's all that's needed, I don't really know.
I'm sure that's just me being the old curmudgeon of a software engineer I am, wishing people thought about more than one user using a system and 2 engineers supporting it.
Consider this - everything will "somehow work" if the system has been there for generations and is complex enough that no single human brain can keep everything about it in the brain at any given time.
It is easy to keep a system high quality, well maintained, well understood for a year with a small team, but imagine doing that for 100+ years with a system constantly evolving in complexity with generations of maintainers, people being rotated.
But it sure is fast.
So very much like an LLM accessing multiple pieces of functionality across different tools and API endpoints (if you want to imagine it that way).
While it is seemingly very knowledgeable, it is rather stupid. It gets duped by nefarious actors or has a class of bugs that are elementary that put the crew into awkward positions.
Most professional software engineers might have previously looked as these scenarios as implausible, given the "failure model" of current software is quite blunt, and especially given how far into the future the series took place.
Now we see that computational tasks are becoming less predictable, less straight-forward, with cascading failures instead of blunt, direct failures. Interacting with an LLM might be compared to talking with a person in psychosis when it starts to hallucinate.
So you get things like this in the Star Trek universe: https://www.youtube.com/watch?v=kUJh7id0lK4
Which make a lot more sense, become a lot more plausible and a lot more relatable with our current implementations of AI/LLM's.
I wanted to add that in Star Trek they always talk with techno babble things like "Computer, create a matrix from a historic person who was knowledgeable in a specialized surgery field" and then the Hologram room creates that avatar's approximation, with the programming and simulated/hallucinated expertise.
The holodeck is a special kind of weird because sooo many accidents happen because of sloppy coding that the AI of the ship's computer created as flawed programs that later then hurt the crew members because of failing or ignored/bypassed safety protocols, which we see now as the rising field of prompt engineering in redteams.
Additionally, in Star Trek instead of coding on tablets, they usually just show analytics data or debug views of what the ship's computer created. The crew never actually code on a computer, and if they do they primarily just "vibe code" it by saying absurd things like "Computer, analyze the enemy ship's frequency and create a phasing shield emitter to block their phasers" (or something like that) and the computer generates those programs on the fly.
The cool part that I liked the most is when Voyager's neural packs (think of them as the AI-to-system control adapters) actually got sick with a biological virus because they were essentially made out of brain matter.
I liked that part too. I hadn't paid attention much before but that was a fun revelation that the computer is run by a bunch of brain tissue pouches. The LLM "guts" to speak is pretty much a collection of brain tissue clumps semantically, with weights and connections as opposed, to some database of logical assertions, like an expert systems people envisioned in the 1980s.
The author even later says that MCP supports JSON Schema, but also claims "you can't generate type-safe clients". Which is plainly untrue, there exist plenty of JSON Schema code generators.
Claude will happily cast your int into a 2023 Toyota Yaris and keep on hallucinating things.
> Cast an integer into the type of a 2023 Toyota Yaris using Javascript
(GPT-4o mini)
> To cast an integer into the type of a 2023 Toyota Yaris in JavaScript, you would typically create a class or a constructor function that represents the Toyota Yaris. Then, you can create an instance of that class using the integer value. Here's an example of how you might do this:
Claude Code validated the response against the schema and did not pass the response to the LLM.
It works in this instance. On this run. It is not guaranteed to work next time. There is a error percentage here that makes it _INEVITABLE_ that eventually, with enough executions, the validation will pass when it should fail.
It will choose not to pass this to the validator, at some point in the future. It will create its own validator, at some point in the future. It will simply pretend like it did any of the above, at some point in the future.
This might be fine for your B2B use case. It is not fine for underlying infrastructure for a financial firm or communications.
Can you guarantee it will validate it every time ? Can you guarantee the way MCPs/tool calling are implemented (which is already an incredible joke that only python brained developers would inflict upon the world) will always go through the validation layer, are you even sure of what part of Claude handles this validation ? Sure, it didn't cast an int into a Toyota Yaris. Will it cast "70Y074" into one ? Maybe a 2022 one. What if there are embedded parsing rules into a string, will it respect it every time ? What if you use it outside of Claude Code, but just ask nicely through the API, can you guarantee this validation still works ? Or that they won't break it next week ?
The whole point of it is, whichever LLM you're using is already too dumb to not trip when lacing its own shoes. Why you'd trust it to reliably and properly parse input badly described by a terrible format is beyond me.
Yes, to the extent you can guarantee the behavior of third party software, you can (which you can't really guarantee no matter what spec the software supposedly implements, so the gaps aren't an MCP issue), because “the app enforces schema compliance before handing the results to the LLM” is deterministic behavior in the traditional app that provides the toolchain that provides the interface between tools (and the user) and the LLM, not non-deterministic behavior driven by the LLM. Hence, “before handing the results to the LLM”.
> The whole point of it is, whichever LLM you're using is already too dumb to not trip when lacing its own shoes. Why you'd trust it to reliably and properly parse input badly described by a terrible format is beyond me.
The toolchain is parsing, validating, and mapping the data into the format preferred by the chosen models promot template, the LLM has nothing to do with doing that, because that by definition has to happen before it can see the data.
You aren't trusting the LLM.
The LLM has everything to do with that. The LLM is literally choosing to do that. I don't know why this point keeps getting missed or side-stepped.
It WILL, at some point in the future and given enough executions, as a matter of statistical certainty, simply not do that above, or pretend to do the above, or do something totally different at some point in the future.
No, the LLM doesn't control on a case-by-caae basis what the toolchain does between the LLM putting a tool call request in an output message and the toolchain calling the LLM afterwards.
If the toolchain is programmed to always validate tool responses against the JSON schema provided by MCP server before mapping into the LLM prompt template and calling the LLM again to handle the response, that is going to happen 100% of the time. The LLM doesn't choose it. It CAN'T because the only way it even knows that the data has come back from the tool call is that the toolchain has already done whatever it is programmed to do, ending with mapping the response into a prompt and calling the LLM again.
Even before MCPs or even models specifically trained and with vendor-provided templates for tool calling (but after the ReAct architecture was described), it was like a weekend project to implement a basic framework supporting tooling calling around a local or remote LLM. I don't think you need to do that to understand how silly the claim that the LLM controls what the toolchain does with each response and might make it not validate it is, but certainly doing it will give you a visceral understanding of how silly it is.
The pieces here are:
* Claude Code, a Node (Javascript) application that talks to MCP server(s) and the Claude API
* The MCP server, which exposes some tools through stdin or HTTP
* The Claude API, which is more structured than "text in, text out".
* The Claude LLM behind the API, which generates a response to a given prompt
Claude Code is a Node application. CC is configured in JSON with a list of MCP servers. When CC starts up, CC"s Javascript initialises each server and as part of that gets a list of callable functions.
When CC calls the LLM API with a user's request, it's not just "here is the user's words, do it". There are multiple slots in the request object, one of which is a "tools" block, a list of the tools that can be called. Inside the API, I imagine this is packaged into a prefix context string like "you have access to the following tools: tool(args) ...". The LLM API probably has a bunch of prompts it runs through (figure out what type of request the user has made, maybe using different prompts to make different types of plan, etc.) and somewhere along the way the LLM might respond with a request to call a tool.
The LLM API call then returns the tool call request to CC, in a structured "tool_use" block separate from the freetext "hey good news, you asked a question and got this response". The structured block means "the LLM wants to call this tool."
CC's JS then calls the server with the tool request and gets the response. It validates the response (e.g., JSON schemas) and then calls the LLM API again bundling up the success/failure of the tool call into a structured "tool_result" block. If it validated and was successful, the LLM gets to see the MCP server's response. If it failed to validate, the LLM gets to see that it failed and what the error message was (so the LLM can try again in a different way).
The idea is that if a tool call is supposed to return a CarMakeModel string ("Toyota Tercel") and instead returns an int (42), JSON Schemas can catch this. The client validates the server's response against the schema, and calls the LLM API with
So the LLM isn't choosing to call the validator, it's the deterministic Javascript that is Claude Code that chooses to call the validator.There are plenty of ways for this to go wrong: the client (Claude Code) has to validate; int vs string isn't the same as "is a valid timestamp/CarMakeModel/etc"; if you helpfully put the thing that failed into the error message ("Expect string, got integer (42)") then the LLM gets 42 and might choose to interpret that as a CarMakeModel if it's having a particularly bad day; the LLM might say "well, that didn't work, but let's assume the answer was Toyota Tercel, a common car make and model", ... We're reaching here, yet these are possible.
But the basic flow has validation done in deterministic code and hiding the MCP server's invalid responses from the LLM. The LLM can't choose not to validate. You seemed to be saying that the LLM could choose not to validate, and your interlocutor was saying that was not the case.
I hope this helps!
No they're literally just skipping an entire step into how LLM's actually "use" MCP.
MCP is just a standard, largely for humans. LLM's do not give a singular fuck about it. Some might be fine tuned for it to decrease erroneous output, but at the end of the day it's just system prompts.
And respectfully, your example misunderstands what is going on:
>* The Claude API, which is more structured than "text in, text out".
>* The Claude LLM behind the API, which generates a response to a given prompt
No. That's not what "this" is. LLM's use MCP to discover tools they can call, aka function/tool calling. MCP is just an agreed upon format, it doesn't do anything magical; it's just a way of aligning the structure across companies, teams, and people.
There is not an "LLM behind the API", while a specific tool might implement its overall feature set using LLM's, that's totally irrelevant to what's being discussed and the principle point of contention.
Which is this: an LLM interacting with other tools via MCP still needs system prompts or fine tuning to do so. Both of those things are not predictable or deterministic. They will fail at some point in the future. That is indisputable. It is a matter of statistical certainty.
It's not up for debate. And an agreed upon standard between humans that ultimately just acts as convention is not going to change that.
It is GRAVELY concerning that so many people are trying to use technical jargon of which they clearly are ill-equipped to do so. The magic rules all.
No,you are literally misunderstanding the entire control flow of how an LLM toolchain uses both the model and any external tools (whether specified via MCP or not, but the focus of the conversation is MCP.)
> MCP is just a standard, largely for humans.
The standard is for humans implementing both tools and the toolchains that call them.
> LLM's do not give a singular fuck about it.
Correct. LLM toolchains, which if they can connect to tools via MCP, are also MCP clients care about it. LLMs don't care abojt it because the toolchain is the thing that actually calls both the LLM and the tools. And that's true whether the toolchain is a desktop frontend with a local, in process llama.cpp backend for running the LLM or if its the Claude Desktop app with a remote connection to the Anthropic API for calling the LLM or whatever.
> Some might be fine tuned for it to decrease erroneous output,
No, they aren't. Most models that are used to call tools now are specially trained for tool calling with a well-defined format for requesting tool calls from the toolchain a mnd receiving results back from it (though this isn't necessary for tool calling to work, people were using the ReAct pattern in toolchains to do it with regular chat models without any training or prespecified prompt/response format for tool calls just by having the toolchain inject tool-related instructions in the prompt, and read LLM responses to see if it was asking for tool calls), none of them that exist now are fine tuned for MCP, nor do they need to be because they literally never see it. The toolchain reads LLM responses, identifies tool call requests, takes any that map to tools defined via MCP and routes them down the channel (http or subprocess stdio) specified by the MCP, and does the reverse woth responses from the MCP server, validating responses and then mapping them into a prompt template that specifies where tool responses go and how they are formatted. It does the same thing (minus the MCP parts) for tools that aren’t specified by MCP (frontends might have their own built-tools, or have other mechanisms for custom tools that predate MCP support.) The LLM doesn't see any difference between MCP tools and other tools or a human reading the message with the tool request and manually creating a response that goes directly back.
> LLM's use MCP to discover tools they can call,
No, they don't. LLM frontends, which are traditional deterministic programs, use MCP to do that, and to find schemas for what should be sent to and expected from the tools. LLMs don’t see the MCP specs, and get information from the toolchain in prompts in formats that are model-specific and unrelated to MCP that tell them what tools they can request calls be made to and what they can expect back.
> an LLM interacting with other tools via MCP still needs system prompts or fine tuning to do so. Both of those things are not predictable or deterministic. They will fail at some point in the future. That is indisputable.
That's not, contrary to your description, a point of contention.
The point of contention is that the validation of data returned by an MCP server against the schema provided by the server is not predictable or deterministic. Confusing these two issues can only happen if you think the model does something with each response that controls whether or not the toolchain validates it, which is impossible, because the toolchain does whatever validation it is programmed to do before the model sees the data. The model has no way to know there is a response until that happens.
Now,can the model make requests that the don't fit the toolchain’s expectations due to unpredictable model behavior? Sure. Can the model do dumb things with the post-validation reaponse data after the toolchain has validated it and mapped it into the models prompt template and called the model with that prompt, for the same reason? Abso-fucking-lutely.
Can the model do anything to tell the toolchain not to validate response data for a tool call that it did decide to make on behalf of the model if the toolchain is programmed to validate the response data against the schema provided by the tool server? No, it can't. It can't even know that the tool was provided by an MCP and that that might be an issue, not can it know that the toolchain made the request, nor can it know that the toolchain received a response until the toolchain has done what it is programmed to do with the response through the point of populating the prompt template and calling the model with the resulting prompt, by which point any validation it was programmed to do has been done and is an immutable part of history.
You are REALLY, REALLY misunderstanding how this works. Like severely.
You think MCP is being used for some other purpose despite the one it was explicitly designed for... which is just weird and silly.
>Confusing these two issues can only happen if you think the model does something with each response that controls whether or not the toolchain validates it
No, you're still just arguing against something no one is arguing for the sake of pretending like MCP is doing something it literally cannot do or fundamentally fix about how LLM's operate.
I promise you if you read this a month from now with a fresh pair of eyes you will see your mistake.
That will depend on what MCP client you are using and how they've handled it.
How does the AI bypass the MCP layer to make the request? The assumption is (as I understand it) the AI says “I want to make MCP request XYZ with data ABC” and it sends that off to the MCP interface which does the heavy lifting.
If the MCP interface is doing the schema checks, and tossing errors as appropriate, how is the AI routing around this interface to bypass the schema enforcement?
It doesn't. I don't know why the other commenters are pretending this step does not happen.
There is a prompt that basically tells the LLM to use the generated manifest/configuration files. The LLM still has to not hallucinate in order to properly call the tools with JRPC and properly follow MCP protocol. It then also has to make sense of the structured prompts that define the tools in the MCP manifest/configuration file.
It's system prompts all the way down. Here's a good read of some the underlying/supporting concepts: https://huggingface.co/docs/hugs/en/guides/function-calling
Why this fact is seemingly being lost in this thread, I have no idea, but I don't have anything nice to say about it so I won't :). Other than we're all clearly quite screwed, of course.
MCP is to make things standard for humans, with expected formats. The LLM's really couldn't give a shit and don't have anything super special about how the interact with MCP configuration files or the protocol (other than some additional fine-tuning, again, to make it less likely to get the wrong output).
No, there isn't. The model doesn't see any difference between MCP-supplied tools, tools built in to the toolchain, and tools supplied by any other method. The prompt simply provides tool names, arguments, and response types to the model. The toolchain, a conventional deterministic program, reads the model response, finds things that meet the models defined format for tool calls, parses out the call names and arguments, looks up in its own internal list of tools to find matching names and see if they are internal, MCP supplied, or other tools, and routes the calls appropriately, gathers responses, does any validation it is designed to do, then mals the validated results into where the model's prompt template specifies tool results should go, and calls the model again with an new message appended to the previous conversation context containing the tool results.
The MCP interface (Claude Code in this case) is doing the schema checks. Claude Code will refuse to provide the result to the LLM if it does not pass the schema check, and the LLM has no control over that.
Which is completely fucking irrelevant to what everyone else is discussing.
> Which is completely fucking irrelevant to what everyone else is discussing.
Not sure what you think is going on, but that is literally the question this subthread is debating, starting with an exchange in which the salient claims were:
From: https://news.ycombinator.com/item?id=44849695
> Claude Code validated the response against the schema and did not pass the response to the LLM.
From: https://news.ycombinator.com/item?id=44850894
> This time.
> Can you guarantee it will validate it every time ?
I can't gaurantee that behavior will remain the same more than any other software. But all this happens before the LLM is even involved.
> The whole point of it is, whichever LLM you're using is already too dumb to not trip when lacing its own shoes. Why you'd trust it to reliably and properly parse input badly described by a terrible format is beyond me.
You are describing why MCP supports JSON Schema. It requires parsing & validating the input using deterministic software, not LLMs.
No. It is not. You are still misunderstanding how this works. It is "choosing" to pass this to a validator or some other tool, _for now_. As a matter of pure statistics, it will simply not do this at some point in the future on some run.
It is inevitable.
Or write a simple MCP server and a client that uses it. FastMCP is easy: https://gofastmcp.com/getting-started/quickstart
You are quite wrong. The LLM "chooses" to use a tool, but the input (provided by the LLM) is validated with JSON Schema by the server, and the output is validated by the client (Claude Code). The output is not provided back to the LLM if it does not comply with the JSON Schema, instead an error is surfaced.
I think the others are trying to point out that statistically speaking, in at least one run the LLM might do something other than choose to use the correct tool. i.e 1 out of (say) 1 million runs it might do something else
The question is whether havign observed Claude Code validating a tool response before handing the response back to the LLM, you can count on that validation on future calls, not whether you can count on the LLM calling a tool in a similar situation.
>The LLM "chooses" to use a tool
Take a minute to just repeat this a few times.
LLMs cannot decide to skip this validation. They can only decide not to call the tool.
So is your criticism that MCP doesn't specify if and when tools are called? If so then you are essentially asking for a massive expansion of MCP's scope to turn it into an orchestration or workflow platform.
No, its not. The validation happens at the frontend before the LLM sees the response. There is no way for the LLM to choose anything about what happens.
The cool thing about having coded a basic ReAct pattern implementation (before MCP, or even models trained on any specific prompt format for tool calls, was a thing, but none of that impacts the basic pattern) is that it gives a pretty visceral understanding of what is going on here, and all that's changed since is per model standardization of prompt and response patterns on the frontend<->LLM side and, with MCP, of the protocol for interacting on the frontend<->tool side.
It is absolutely possible to do this, and to generate client code which complies with ISO-8601 in JS/TS. Large amounts of financial services would not work if this was not the case.
See the c# support for ISO-8601 strings: https://learn.microsoft.com/en-us/dotnet/standard/base-types...
`DateTime` is not an ISO-8601 type. It can _parse_ an ISO-8601 formatted string.
And even past that, there are Windows-specific idiosyncrasies with how the `DateTime` class implements the parsing of these strings and how it stores the resulting value.
This is exactly the point: a string is just a data interchange format in the context of a DateTime, and C# provides (as far as I can tell) a complete way of accessing the ISO-8601 specification on the language object. It also supports type-safe generation of clients and client object (or struct) generation from the ISO-8601 string format.
> And even past that, there are Windows-specific idiosyncrasies with how the `DateTime` class implements the parsing of these strings and how it stores the resulting value.
Not really. The windows statements on the article (and I use this on linux for financial services software) are related to automated settings of the preferences for generated strings. All of these may be set within the code itself.
That was based on decades of experience in .NET and Windows. Not the article ;).
Related but distinct from serialization.
The merchants of complexity are disappointed. It turns out that even machines don't care for 'machine-readable' formats; even the machines prefer human-readable formats.
The only entities on this planet who appreciate so-called 'machine-readability' are bureaucrats; and they like it for the same reason that they like enterprise acronyms... Literally the opposite of readability.
Already happening.
https://www.infosecurity-magazine.com/news/atlassian-ai-agen...
LLMs are basically automating PEBKAC
this is like saying "HTTP doesn't do json validation", which, well, yeah.
May have changed, but unlikely. I worked with medical telemetry as a young man and it was impressed upon me thoroughly how important parsing timestamps correctly was. I have a faint memory, possibly false, of this being the first time I wrote unit tests (and without the benefit of a test framework).
We even accounted for lack of NTP by recalculating times off of the timestamps I. Their message headers.
And the reasons I was given were incident review as well as malpractice cases. A drug administered three seconds before a heart attack starts is a very different situation than one administered eight seconds after the patient crashed. We saw recently with the British postal service how lives can be ruined by bad data, and in medical data a minute is a world of difference.
I also work in healthcare, and we've seen HL7v2 messages with impossible timestamps. (E.g., in the spring-forward gap.)
As RPC mechanisms go, HTTP is notable for how few of the classic blunders they made in 1.0 of the spec. Clock skew correction is just my favorite. Technically it exists for cache directives, but it’s invaluable for coordination across machines. There are reasons HTTP 2.0 waited decades to happen. It just mostly worked.
If we can get an internal, sensitive-data-handling agent to ingest a crafted prompt, either via direct prompt injection against a more abstract “parent” agent, or by tainting an input file/URL it’s told to process, we can plant what I have internally coined an “unfolding injection.”
The injection works like a parasitic goal, it doesn’t just trick one agent, it rewrites the downstream intent. As the orchestrator routes tasks to other agents, each one treats the tainted instructions as legitimate and works toward fulfilling them.
Because many orchestrations re-summarize, re-plan, or synthesize goals between steps, the malicious instructions can actually gain fidelity as they propagate. By the time they reach a sensitive action (exfiltration, privilege escalation, external calls), there’s no trace of the original “weird” wording, just a confidently stated, fully-integrated sub-goal.
It’s essentially a supply-chain attack on the orchestration layer: you compromise one node in the agent network, and the rest “help” you without realizing it. Without explicit provenance tracking and policy enforcement between agents, this kind of unfolding injection is almost trivial to pull off, and we've been able to compromise entire environments based on the information the agentic system provided us, or just gave us either a bind or reverse shell in the case it has cli access and ability to figure out its own network constraints.
SSRF has been making a HUGE return in agentic systems, and Im sad defcon and black hat didnt really have many talks on this subject this year, because it is a currently evolving security domain and entirely new method of exploitation. The entire point of agentic systems is non determinism, but it also makes it a security nightmare. As a researcher though, this is basically a gold mine of all sorts of new vulnerabilities we'll be seeing. If you work as a bugbounty hunter and see a new listing for an AI company I can almost assuredly say you can get a pretty massive payout just by exploiting the innate trust between agents and the internal tools they are leveraging. Even if you dont have the architecture docs of the agentic system you can likely prompt inject the initial task enough to taint the further agents to have them list out the orchestration flow by creatively adjusting your prompt for different types of orchestration and how the company might be doing prompt engineering on the agents persona and task its designed to work on and then submit report on to parent agent, and the limited input validation between them.
When desktop OSes came out, hardware resources were scarce so all the desktop OSes (DOS, Windows, MacOS) forgot all the lessons from Unix: multi user, cooperative multitasking, etc. 10 years later PC hardware was faster than workstations from the 90s yet we're still stuck with OSes riddled with limitations that stopped making sense in the 80s.
When smartphones came out there was this gold rush and hardware resources were scarce so OSes (iOS, Android) again forgot all the lessons. 10 years later mobile hardware was faster than desktop hardware from the 00s. We're still stuck with mistakes from the 00s.
AI basically does the same thing. It's all lead by very bright 20 and 30 year olds that weren't even born when Windows was first released.
Our field is doomed under a Cascade of Attention-Deficit Teenagers: https://www.jwz.org/doc/cadt.html (copy paste the link).
It's all gold rushes and nobody does Dutch urban infrastructure design over decades. Which makes sense as this is all driven by the US, where long term plan I is anathema.
Of course this keeps happening
If an LLM can be shown to be useful 80% of the time to the JS mindset this is fine, and the remaining 20% can be resolved once we're being paid for the rest, Pareto principle be damned.
Mostly, no. Whether its the client sending (statically) bad data or the server returning (statically) bad data, schema validation on the other end (assuming somehow it is allowed by the toolchain on the sending end) should reject it before it gets to the custom code of the MCP server or MCP client.
For arguments that are the right type but wrong because of the state of the universe, yes, the server receiving it should send a useful error message back to the client. But that's a different issue.
At some point we have to decide as a community of engineers that we have to stop building tools that are little more than loaded shotguns pointed at our own feet.
GIEMGO garbage in even more garbage out
The stuff about type validation is incorrect. You don't need client-side validation. You shouldn't be using APIs you don't trust as tools and you can always add instructions about the LLM's output format to convert to different formats.
MCP is not the issue. The issue is that people are using the wrong tools or their prompts are bad.
If you don't like the format of an MCP tool and don't want to give formatting instructions the LLMs, you can always create your own MCP service which outputs data in the correct format. You don't need the coercion to happen on the client side.
CORBA did pretty much everything wrong, which makes it a great anti-example. Automatic client generation? Fuck that.
Ironically, it's achieved this - but that's an indictment of USB-C, not an accomplishment of MCP. Just like USB-C, MCP is a nigh-universal connector with very poorly enforced standards for what actually goes across it. MCP's inconsistent JSON parsing and lack of protocol standardization is closely analogous to USB-C's proliferation of cable types (https://en.wikipedia.org/wiki/USB-C#Cable_types); the superficial interoperability is a very leaky abstraction over a much more complicated reality, which IMO is worse than just having explicitly different APIs/protocols.
Previously, you could reasonably expect a USB-C on a desktop/laptop of an Apple Silicon device, to be USB4 40Gbps Thunderbolt, capable of anything and everything you may want to use it for.
Now, some of them are USB3 10Gbps. Which ones? Gotta look at the specs or tiny icons, I guess?
Apple could have chosen to have the self-documenting USB-A ports to signify the 10Gbps limitation of some of these ports (conveniently, USB-A is limited to exactly 10Gbps, making it perfect for the use-case of having a few extra "low-speed" ports at very little manufacturing cost), but instead, they've decided to further dilute the USB-C brand. Pure innovation!
With the end user likely still having to use a USB-C to USB-A adapters anyways, because the majority of thumb drives, keyboards and mice, still require a USB-A port — even the USB-C ones that use USB-C on the kb/mice itself. (But, of course, that's all irrelevant because you can always spend 2x+ as much for a USB-C version of any of these devices, and the fact that the USB-C variants are less common or inferior to USB-A, is of course irrelevant when hype and fanaticism are more important than utility and usability.)
Unfortunately, no one understood SOAP back.
(Additional context: Maintaining a legacy SOAP system. I have nothing good to say about SOAP and it should serve as a role model for no one)
It doesn't take very long for people to start romanticizing things as soon as they're not in vogue. Even when the painfulness is still fresh in memory, people lament over how stupid new stuff is. Well I'm not a fan of schemaless JSON APIs (I'm one of those weird people that likes protobufs and capnp much more) but I will take 50 years of schemaless JSON API work over a month of dealing with SOAP again.
/“xml is like violence, if it’s not working just use more!”
No.
SOAP uses that, but SOAP involves a whole lot of spec about how you do that, and that's even before (as the article seems to) treat SOAP as meaning SOAP + the set of WS-* standards built around it.
Unfortunately as usual when a new technology cycle comes, everything gets thrown away, including the good parts.
And I actually like XML-based technologies. XML Schema is still unparalleled in its ability to compose and verify the format of multiple document types. But man, SOAP was such a beast for no real reason.
Instead of a simple spec for remote calls, it turned into a spec that described everything and nothing at the same time. SOAP supported all kinds of transport protocols (SOAP over email? Sure!), RPC with remote handles (like CORBA), regular RPC, self-describing RPC (UDDI!), etc. And nothing worked out of the box, because the nitty-gritty details of authentication, caching, HTTP response code interoperability and other "boring" stuff were just left as an exercise to the reader.
Part of this is the nature of XML. There's a million ways to do things. Should some data be parsed as an attribute of the tag or should it be another tag? Perhaps the data should be in the body between the tags? HTML, based on XML, has this problem; eg. you can seriously specify <font face="Arial">text</font> rather than have the font as a property of the wrapping tag. There's a million ways to specify everything and anything and that's why it makes a terrible data parsing format. The reader and writer must have the exact same schema in mind and there's no way to have a default when there's simply no particular correct way to do things in XML. So everything had to be very very precisely specified to the point it added huge amounts of work when a non-XML format with decent defaults would not have that issue.
This become a huge problem for SOAP and why i hate it. Every implementation had different default ways of handling even the simplest data structure passing between them and were never compatible unless you took weeks of time to specify the schema down to a fine grained level.
In general XML is problematic due to the lack of clear canonical ways of doing pretty much anything. You might say "but i can specify it with a schema" and to that i say "My problem with XML is that you need a schema for even the simplest use case in the first place".
But parts of XML infrastructure were awesome. I could define a schema for the data types, and have my IDE auto-complete and validate the XML documents as I typed them. I could also validate the input/output data and provide meaningful errors.
And yeah, I also worked with XML and got burned many times by small incompatibilities that always happen due to its inherent complexity. If XML were just a _bit_ simpler, it could have worked so much better.
Generally it worked very well when both ends were written in the same programming language and was horseshit if they weren’t. No wonder Microsoft liked SOAP so much.
IBM thought they were good at lockin, until Bill Gates came along.
I've been on the other side of high-feature serialization protocols, and even at large tech companies, something like migrating to gRPC is a multi-year slog that can even fail a couple of times because it asks so much of you.
MCP, at its core, is a standardization of a JSON API contract, so you don't have to do as much post-training to generate various tool calling style tokens for your LLM.
I think you meant that is why JSON won instead of XML?
Not just XML, but a lot of other serialization formats and standards, like SOAP, protobuf in many cases, yaml, REST, etc.
People say REST won, but tell me how many places actually implement REST or just use it as a stand-in term for casual JSON blobs to HTTP URLs?
Now, YAML has quite a few shortcomings compared to JSON (if you don't believe me, look at its handling of the string no, discussed on HN), so, at least to me, it's obvious why JSON won.
SOAP, don't get me started on that, it's worth less than XML, protobuf is more efficient but less portable, etc.
That's backwards reasoning. XML was too complicated, so they decided on a simpler JSON.
And its complexity and size now are rivaling the specs of the good old XML-infused times.
Didn’t get that job, one of the interviewers asked me to write concurrent code, didn’t like my answer, but his had a race condition in it and I was unsuccessful in convincing him he was wrong. He was relying on preemption not occurring on a certain instruction (or multiprocessing not happening). During my tenure at the job I did take the real flaws in the Java Memory Model would come out and his answer became very wrong and mine only slightly.
So it baked in core assumptions that the network is transparent, reliable, and symmetric. So you could create an object on one machine, pass a reference to it to another machine, and everything is supposed to just work.
Which is not what happens in the real world, with timeouts, retries, congested networks, and crashing computers.
Oh, and CORBA C++ bindings had been designed before the STL was standardized. So they are a crawling horror, other languages were better.
On a more general note, I see in many critical comments here what I perceive to be a category error. Using JSON to pass data between web client and server, even in more complex web apps, is not the same thing as supporting two-way communications between autonomous software entities that are tasked to do something, perhaps something critical. There could be millions of these exchanges in some arbitrarily short time period, thus any possibility of errors is multiplied accordingly, and the effect any error could cascade if it does not fail early. I really don't believe this is a case where "worse is better." To use an analogy, yes everyday English is a versatile language that works great for most use cases; but when you really need to nail things down, with no tolerance for ambiguity, you get legalese or some other jargon. Or CORBA, or gRPC, etc.
If only that were true. Litigation happens every single day over the meanings of contracts and laws that were drafted by well-trained and experienced attorneys.
Comparatively, programming languages are very constrained. The environments in which they are interpreted and executed are far better understood than any human courtroom.
Your point is an interesting one but it’s painting with too broad a brush.
- SOAP - interop needs support of DOC or RPC based between systems, or a combination, XML and schemas are also horribly verbose.
- CORBA - libraries and framework were complex, modern languages at the time avoided them in deference to simpler standards (e.g. Java's Jini)
- GPRC - designed for speed, not readability, requires mappings.
It's telling that these days REST and JSON (via req/resp, webhooks, or even streaming) are the modern backbone of RPC. The above standards either are shoved aside or for GPRC only used where extreme throughput is needed.
Since REST and JSON are the plat du jour, MCP probably aligns with that design paradigm rather than the dated legacy protocols.
No, they're the medium of the web.
The author is specifically addressing enterprise integration into business workflows - not showing stuff in a browser.
It seems to be a game of catch up for most things AI. That said, my school of thought is that certain technologies are just too big for them to be figured out early on - web frameworks, blockchain, ...
- the gap starts to shrink eventually. With AI, we'll just have to keep sharing ideas and caution like you have here. Such very interesting times we live in.
This is really obvious when they talk about tracing and monitoring, which seem to be the main points of criticism anyway.
They bemoan that they cant trace across MCP calls, assuming somehow there would be a person administering all the MCPs. Of course each system has tracing in whatever fashion fits its system. They are just not the same system, nor owned by the same people let alone companies.
Same as monitoring cost. Oh, you can’t know who racked up the LLM costs? Well of course you can, these systems are already in place and there are a million of ways to do this. It has nothing to do with MCP.
Reading this, I think its rather a blessing to start fresh and without the learnings of 40 years of failed protocols or whatever
1. Lessons.
2. Fairly sure all of Google is built on top of protobuf.
You'd still need basically the entire existing MCP spec to cover the use cases if it replaced JSON-RPC with Swagger or protobuf, plus additional material to cover the gaps and complications that that switch would involve.
It's amusing to watch people refer to MCP as a set of tools, or a framework, or an SDK you can invoke, or something or other across a wide range of forums.It's just a standard. A convention. Calling it a protocol is a stretch as well. But there's no meat to it, really.
If you just used Rest API's, you'd need to create little "tools" (say, another executable) locally that the LLM can invoke that can call those API's. MCP standardizes what those tools should act like and their overall lifecycle model.
The references to it being like USB are also quite frankly absurd and delusional.
But that's the caliber of developer we're dealing with today.
I agree that swagger leaves a lot unplanned. I disagree about the local use case because (1) we could just run local HTTP servers easily and (2) I frankly assume the future of MCP is mostly remote.
Returning back to JSON-RPC, it’s a poorly executed RPC protocol. Here is an excellent HackerNews thread on it, but the TLDR is parsing JSON is expensive and complex, we have tons of tools (eg load balancers) that make modern services, and making those tools parse json is very expensive. Many people in the below thread mention alternative ways to implement J-RPC but that depends on new clients.
https://news.ycombinator.com/item?id=34211796
I know this because I wish it did. You can approximate streaming responses by using progress notifications. If you want something like the LLM partial response streaming, you'll have to extend MCP with custom capabilities flags. It's totally possible to extend it in this way, but then it's non standard.
Perhaps you are alluding to the fact that it's bidirectional protocol (by spec at least).
Sure, they might still find themselves in highly regulated industries where risk avoidance trumps innovation everyday, all day.
MCP is for _the web_ , it started with stdio only because Anthropic was learning lessons from building Claude Code.
Author also seems to expect that the result from MCP tool usage will feed directly to an LLM. This is preposterous and a recipe for disaster. Obviously you’d validade structured response against a schema, check for harmful content, etc etc.
> Author also seems to expect that the result from MCP tool usage will feed directly to an LLM
Isn't this exactly what MCP is for? Most tools I've come across are to feed context from other sources directly to the LLM. I believe this is the most common use-case for the protocol.
The greater problem is industry misunderstanding and misalignment with what agents are and where they are headed.
Web platforms of the world believe agents will be embedded in networked distributed infrastructure. So we should ship an MCP platform in our service mesh for all of the agents running in containers to connect to.
I think this is wrong, and continues to be butchered as the web pushes a hard narrative that we need to enable web-native agents & their sdks/frameworks that deploy agents as conventional server applications. These are not agents nor the early evolutionary form of them.
Frontier labs will be the only providers of the actual agentic harnesses. And we are rapidly moving to computer use agents - MCP servers were intended to serve as single instance deployments for single harnesses. ie. a single mcp server on my desktop for my Claude Desktop.
> In financial services, this means a trading AI could misinterpret numerical types and execute trades with the wrong decimal precision.
If you are letting an LLM execute trades with no guardrails then it is a ticking time bomb no matter what protocol you use for the tool calls.
> When an AI tool expects an ISO-8601 timestamp but receives a Unix epoch, the model might hallucinate dates rather than failing cleanly.
If your process breaks because of a hallucinated date -- don't use an LLM for it.
Also Erlang uses RPCs for pretty much all "synchronous" interactions but it's pretty minimal in terms of ceremony. Seems pretty reliable.
So this is a serious question because hand rolling "40 years" of best practices seems hard, what should we be using for RPC?
Multics vs Unix, xml based soap vs json based rest apis, xhtml’s failure, javascript itself, … I could keep going on.
So I’ve resigned myself to admitting that we are doomed to reimplement the “good enough” every time, and continue to apply bandaid after bandaid to gradually fix problems after we rediscover them, slowly.
https://en.m.wikipedia.org/wiki/Worse_is_better
It's been confirmed over and over since then. And I say that as someone who naturally gravitates towards "better" solutions.
In other words, many technical problems flow rather predictably from decision-making boundaries that don't internalize the externalities.
Ever heard someone say "if you care about X, run for office"? The same applies to technology. If one cares about good designs, one must promote organizational and societal structures that actually have a fighting chance at bringing those about.
The days of nerds and hackers not caring about broader dynamics and structures are long gone. Sitting back and letting the business folks have control is fine if you want them to optimize for the existing incentives. But if you want to change the rules of the game, you gotta jump in at the deep end.
The world we could have lived in... working web forms validations, working microdata...
Point-by-point for the article's gripes:
- distributed tracing/telemetry - open discussion at https://github.com/modelcontextprotocol/modelcontextprotocol...
- structured tool annotation for parallelizability/side-effects/idempotence - this actually already exists at https://modelcontextprotocol.io/specification/2025-06-18/sch... but it's not well documented in https://modelcontextprotocol.io/specification/2025-06-18/ser... - someone should contribute to improving this!
- a standardized way in which the costs associated with an MCP tool call can be communicated to the MCP Client and reported to central tracking - nothing here I see, but it's a really good idea!
- serialization issues e.g. "the server might report a date in a format unexpected by the client" - this isn't wrong, but since the consumer of most tool responses is itself an LLM, there's a fair amount of mitigation here. And in theory an MCP Client can use an LLM to detect under-specified/ambiguous tool specifications, and could surface these issues to the integrator.
Now, I can't speak to the speed at which Maintainers and Core Maintainers are keeping up with the community's momentum - but I think it's meaningful that the community has momentum for evolving the specification!
I see this post in a highly positive light: MCP shows promise because you can iterate on these kinds of structured annotations, in the context of a community that is actively developing their MCP servers. Legacy protocols aren't engaging with these problems in the same way.
https://github.com/modelcontextprotocol/modelcontextprotocol...
The Python Bindings PR is here modelcontextprotocol/rust-sdk#172
The Typescript Bindings PR is here modelcontextprotocol/rust-sdk#183
MCP Bench at https://github.com/unimcp/mcpbench
MCP started from an accessibility direction, which is why it’s catching on.
Headaches are inevitable, but don’t network effects often dominate technological superiority?
MCP is not a protocol. It doesn't protocolize anything of use. It's just "here's some symbols, do with them whatever you want.", leaving it there but then advertising that as a feature of its universality. It provides almost just as much of a protocol as TCP, but rebuild on 5 OSI layers, again.
It's not a security issue, it's a ontological issue.
That being said. MCP as a protocol has a fairly simple niche. Provide context that can be fed to a model to perform some task. MCP covers the discovery process around presenting those tools and resources to an Agent in a standardized manner. An it includes several other aspects that are useful in this niche. Things like "sampling" and "elicitations". Is it perfect? Not at all. But it's a step in the right direction.
The crowd saying "just point it at an OpenAPI service" does not seem to fully understand the current problem space. Can many LLMs extract meaning from un-curated API response messages? Sure. But they are also burning up context holding junk that isn't needed. Part of MCP is the acknowledgement that general API responses aren't the right way to feed the model the context it needs. MCP is supposed to be taking a concrete task, performing all the activities need to gather the info or affect the change, then generate clean context meant for the LLM. If you design an OpenAPI service around those same goals, then it could easily be added to an Agent. You'd still need to figure out.all the other aspects, but you'd be close. But at that point you aren't pointing an Agent at a random API, you're pointing it at a purpose made API. And then you have to wonder, why not something like MCP that's designed for that purpose from the start?
I'll close by saying there are an enormous number of MCP Servers out there that are poorly written, thin wrappers on general APIs, or have some other bad aspects. I attribute a lot of this to the rise in AI Coding Agents allowing people with poor comprehension of the space enabling them to crank out this... Noise.
There are also great examples of MCP Servers to be found. They are the ones that have thoughtful designs, leverage the spec fully, and provide nice clean context for the Agent to feed to the LLM.
I can envision a future where we can simply point an agent at a series of OpenAPI services and the agent uses it's models to self-assemble what we consider the MCP server today. Basically it would curate accessing the APIs into a set of focused tools and the code needed to generate the final context. That's not quite where we are today. It's likely not far off though.
Otherwise the larger picture is that MCP is a land grab for building an eco-system around integrations to get access to data. Your LLM agent is not valuable if it can't access things for you... and from a market perspective enterprise pays a lot for this stuff already, and yes MCP is not thought out at all for Enterprise really... At least thankfully they added stateless connections to the spec...
- Electron disregards 40 years of best deployment practices,
- Web disregards 40 years of best GUI practices,
- Fast CPUs and lots of RAM disregards 40 years of best software optimization techniques,
there are probably many more examples.
windows 10 is easier to use than windows 95.
osx is easier to use than mac.. whatever they named their old versions.
It goes on and on. I can have 50 browser tabs open at the same time, each one hosting a highly complicated app, ranging from media playback to chat rooms to custom statistical calculators. I don't need to install anything for any of these apps, I just type in a short string in my url bar. And they all just work, at the same time.
Things are in fact better now.
The supercomputers in our pockets (that used to be telephones, but don't work well for that anymore) will let us run the programs that one of two companies allow us to run, which will run most apps... as long as the hardware is as recent as our laptops/desktops.
Yes, we're very advanced. In the past 20 years, we have achieved... the same thing we had 20 years ago... only with more hardware requirements, programming languages, and frameworks. Today you can do anything... as long as it's on a web page, on recent hardware (and God help you if you haven't updated your software in the past month)
Things are also getting better now that Intel is dying. I mean, the new Apple silicon chips are astoundingly fast and energy efficient, an M1 from 5 years ago is still going strong and probably won't truly need replacing for another 2. Similar for Ryzen chips from 5 years ago!
Things have changed a lot in 20 years. In 2005 we didn't consume all of our video / audio media online. We didn't have social media, just blogs and RSS readers. YouTube had just been released. TikTok, Facebook and Twitter didn't exist. Hypermedia today is very rich and necessitates a lot of resources. But at the same time, most work the past 10 years has been on native apps (on mobile particularly but also PCs), not web sites. Most people don't use the web browser as much.
Actually, MCP uses a normative TypeScript schema (and, from that, an autogenerated JSON Schema) for the protocol itself, and the individual tool calls also are specified with JSON Schema.
> Type validation happens at runtime, if at all.
That's not a consequence of MCP "opting for schemaless JSON" (which it factually does not), that's, for tool calls, a consequence of MCP being a discovery protocol where the tools, and thus the applicable schemas, are discovered aruntime.
If you are using MCP as a way to wire up highly-static components, you can do discovery against the servers once they are wired up, statically build the clients around the defined types, and build your toolchain to raise errors if the discovery responses change in the future. But that's not really the world MCP is built for. Yes, that means that the toolchain needs, if it is concerned about schema enforcement, use and apply the relevant schemas at runtime. So, um, do that?
The ISO8601 v Unix epoch example seems very weak to me. I'd certainly expect any model to be capable of distinguishing between these things, so, it doesn't seem like a big deal that either one would be allowed in a JSON.
Honestly, my view that nothing of value ever gets published on medium, is strongly reinforced here.
But why did the designers make that choice when they had any of half a dozen other RCP protocols to choose from?
> The ISO8601 v Unix epoch example seems very weak to me. I'd certainly expect any model to be capable of distinguishing between these things
What about the medical records issue? How is the model to distinguish a weight in kgs from one in pounds?
Wouldn't medical records actually be better in JSON, because the field could expressly have "kg" or "lb" suffix within the value of the field itself, or even in the name of the field, like "weight-in-kg" or "weight-in-lb"? This is actually the beauty of JSON compared other formats where these things may end up being just a unitless integer.
The biggest problem with medical data would probably remain the human factor, where regardless of the format used by the machines and by MCP, the underlying data may already be incorrect or not coded properly, so, if anything, AI would likely have a better chance of interpreting the data correctly than the API provider blindly mislabelling unitless data.
On that note; some of these “best practices” arguably haven’t worked out. “Be conservative with what you send, liberal with what you receive” has turned even decent protocols into a dumpster fire, so why keep the charade going?
Failed protocols such as TCP adopted Postel's law as a guiding principle, and we all know how that worked out!
WSDL is just pure nonsense. The idea that software would need to decide which API endpoints it needs on its own, is just profoundly misguided... Literally nobody and nothing ever reads the WSDL definitions; it's just poor man's documentation, at best.
LLMs only reinforce the idea that WSDL is a dumb idea because it turns out that even the machines don't care for your 'machine-friendly' format and actually prefer human-friendly formats.
Once you have an MPC tool working with a specific JSON API, it will keep working unless the server makes breaking changes to the API while in production which is terrible practice. But anyway, if you use a server, it means you trust the server. Client-side validation is dumb; like people who need to put tape over their mouths because they don't trust themselves to follow through on their diet plans.
WSDLs being available from the servers allows (a) clients to validate the requests they make before sending them to the server, and (b) developers (or in principle even AI) with access to the server to create a client without needing further out-of-band specifications.
In theory. In reality java could talk to java. M$ stuff could talk to other M$ stuff. And pretty much everyone else was left out in the cold. consistent cross language interop never actually happened despite the claims that it would.
I don't buy this idea that code should be generated automatically without a human involved (at least as a reviewer).
I also don't buy the idea that clients should validate their requests before sending to the server. The client's code should trust itself. I object to any idea of code (or any entity) not trusting itself. That is a flawed trust model.