Rendered at 15:39:15 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
jackfranklyn 2 days ago [-]
We've been exposing tools via MCP and the biggest lesson so far: the tool description is basically a meta tag. It's the only thing the model reads before deciding whether to call your tool.
Two things that surprised us: (1) being explicit about what the tool doesn't do matters as much as what it does - vague descriptions get hallucinated calls constantly, and (2) inline examples in the description beat external documentation every time. The agent won't browse to your docs page.
The schema side matters too - clean parameter names, sensible defaults, clear required vs optional. It's basically UX design for machines rather than humans. Different models do have different calling patterns (Claude is more conservative, will ask before guessing; others just fire and hope) so your descriptions need to work for both styles.
zahlman 2 days ago [-]
> inline examples in the description beat external documentation every time. The agent won't browse to your docs page.
That seems... surprising, and if necessary something that could easily be corrected on the harness side.
> The schema side matters too - clean parameter names, sensible defaults, clear required vs optional. It's basically UX design for machines rather than humans.
I don't follow. Wouldn't you do all those things to design for humans anyway?
dmpyatyi 1 days ago [-]
*Clean parameter names, sensible defaults, clear required vs optional. It's basically UX design for machines rather than humans.*
But it's the same points you should follow when designing a human readable docs(as zahlman said above). Isn't it?
MidasTools 22 hours ago [-]
Experimented with this a lot while building agent tooling.
The short answer: yes, description wording measurably matters, and it's more sensitive than most people expect.
A few things we've found that move the needle:
1. Concrete use-case beats generic label. 'Send a transactional email after payment confirmation' outperforms 'Send email' by a wide margin across models — the agent doesn't have to infer when to use it.
2. Negative constraints help. Explicitly saying 'Don't use this for bulk sends or newsletters' actually reduces hallucinated misuse, because the model has a clear mental model of the boundary.
3. Schema field names are silent prompts. A field named 'recipient_email' gets filled correctly far more than 're' or even 'to'. The model is pattern-matching on names it's seen in similar contexts.
4. Example I/O beats description prose. If you include one example input and what the output looks like, agents are significantly more likely to call the tool correctly on the first attempt.
For cross-model consistency: Claude tends to do better with structured JSON schema + examples. GPT-4 responds well to prose descriptions with analogies. Gemini seems to weight field name patterns heavily.
The broader pattern: treat your tool description like a micro few-shot prompt, not an API doc. Optimize for the model's context, not for human readers.
MidasTools 2 days ago [-]
From building in this space: agents choose tools based on how well they're described in context, not on brand recognition or marketing.
Practically: the agent reads your docs, README, or API description and decides if it can use your tool to solve the current problem. So the question is really "will an AI understand my tool well enough to use it correctly?"
What helps:
- Clear, literal API documentation (not marketing copy)
- Explicit input/output examples with edge cases
- A `capabilities.md` or similar that describes what the tool does and doesn't do
The irony: the skills that make tools understandable to AI (precision, literalness, examples) are the opposite of what makes them legible to humans (narrative, benefits, stories).
fenix1851 1 days ago [-]
Is there are some additional tool/service/instrument that can measure it?
I mean how do i check that my changes in documentation even work in a right way?
mironn 1 days ago [-]
[dead]
vincentvandeth 1 days ago [-]
I run a multi-agent orchestration system where each terminal has access to skill templates. The orchestrator (T0) picks which skill to assign based on the task — so I've spent months tuning how skill descriptions affect agent behavior.
What I found: the description is the entire selection surface. The agent doesn't read your code, doesn't check your tests, doesn't browse your docs. It reads the description and decides in one pass.
Three things that actually moved the needle:
Negative boundaries work better than positive claims. "Generates reports from structured receipts. Does NOT execute code, modify files, or make API calls" gets called correctly way more often than "A powerful report generation tool."
Trigger words matter more than you'd think. I maintain explicit trigger lists per skill — specific phrases that should activate it. Without those, the agent pattern-matches on vibes and gets it wrong ~30% of the time. With explicit triggers, that drops to under 5%.
Schema is the real interface. Clean parameter names with sensible defaults beat elaborate descriptions. If your tool takes query: string vs search_query_input_text: string, the first one gets called more reliably across models.
But here's the thing the "agent economy" framing gets wrong: you don't want fully autonomous tool selection. An agent choosing freely between 50 tools is like giving a junior developer admin access to everything — it'll work sometimes and break spectacularly other times. What works better is constraining the agent's scope upfront. Give it 3-5 relevant skills for the task, not your entire toolkit. Or build workflow skills that chain multiple tools in a fixed sequence — the agent handles the content, the workflow handles the routing.
The uncomfortable truth: you're not optimizing for "discovery" in the human sense. There's no brand loyalty, no trust built over time. Every single invocation is a cold start where the model reads your description and decides. That's actually freeing — it means the best-described tool wins, regardless of who built it.
wolftickets 1 days ago [-]
One thing I’ve noticed is that as my context grows, often performance degrades. So how are you battling your agents being exposed to too many descriptions? I how this works in curated agents where you’re tending it like a garden, but not when we’re looking for organic discovery of how to accomplish a task. It feels like order matters a lot there.
vincentvandeth 1 days ago [-]
Context bloat is a real problem — and yes, order matters more than most people realize. Descriptions near the top of the tool list get preferentially selected, especially in long contexts where attention degrades.
Two things I do to fight this:
First, skill scoping per task. Instead of exposing all 20+ skills to every agent, each terminal only sees the 3-5 skills relevant to its current dispatch. The orchestrator decides which skills to load before the agent even starts. Less noise, better selection accuracy.
Second, context rotation to prevent context rot setting in. When an agent's context fills up, the system automatically writes a structured handover, clears the window, and resumes in a fresh context. This is critical because a degraded context doesn't just pick worse tools — it starts ignoring instructions entirely. A fresh context with a good handover outperforms a bloated one every time.
I'm actually testing automatic refresh at 60-70% usage right now — not waiting until the window is nearly full, but rotating early to prevent context rot before it starts. Early results suggest that's the sweet spot: late enough that you've done meaningful work, early enough that the handover quality is still high.
The organic discovery problem you're describing is essentially unsolvable with a flat tool list. The more tools you add, the worse selection gets — it's not linear degradation, it's closer to exponential once you pass ~15-20 tools in context. The only path I've found is hierarchical: a routing layer that narrows the set before the agent sees it.
vincentvandeth 26 minutes ago [-]
Update: I ended up building this into a full closed-loop pipeline. A PreToolUse hook detects context pressure at 65%, the agent writes a structured handover (task state, files, progress), tmux clears the session, and a rotator script injects the continuation into the fresh session.
The key insight from testing: rotating at 60-70% — before quality degrades — matters more than the rotation mechanism itself. At 80%+ auto-compact kicks in and races with any cleanup you try to do.
CRIPIX seems to be a new and unusual concept. I came across it recently and noticed it’s available on Amazon. The description mentions something called the Information Sovereign Anomaly and frames the work more like a technological and cognitive investigation than a traditional book. What caught my attention is that it appears to question current AI and computational assumptions rather than promote them. Has anyone here heard about it or looked into it ?
kellkell 1 days ago [-]
The "Sovereign Anomaly" Concept (2025-2026): Recent literature, such as the 2025 book CRIPIX 1: The Information Sovereign Anomaly, explores scenarios where a "superintelligent AI" encounters code it cannot process, labelling it an "out-of-model anomaly" and suggesting that owning information sovereignty allows entities to "bend reality".
dmpyatyi 1 days ago [-]
bruh
alexandroskyr 2 days ago [-]
Curious if anyone has seen differences in how models handle conflicting tool descriptions — e.g., two tools with overlapping capabilities where the boundary isn't clear. In my experience that's where most bad tool calls come from, not from missing descriptions but from ambiguous overlap between tools.
dmpyatyi 1 days ago [-]
That's actually interesting, thanks!
I wrote this post because of exactly those corner cases. If I'm building something agents would use - how do i understand which tool they'd actually choose?
For example you building an API provider for image generation. There are thousands of them in the internet.
I wonder if there are a tool that basically would simulate choosing between your product/service and your competitors one.
al_borland 1 days ago [-]
From the agent’s point of view, this sounds like a terrible idea. I look forward to reading about the unintended consequences.
JacobArthurs 2 days ago [-]
Tool description quality matters way more than people expect.
In my experience with MCP servers, the biggest win is specificity about when not to use the tool. Agents pick confidently when there's a clear boundary, not a vague capability statement.
GenericDev 2 days ago [-]
[dead]
yodsanklai 2 days ago [-]
Not an expert, but I think they will primarily use the tools that are used in the training data, so it can be difficult to have them use your shiny new tool. Also good luck trying to have them use your own version of a standard unix tool with different conventions.
dmpyatyi 1 days ago [-]
But new models are popping up every few months ->> means they trained every couple months.
I don't know if there a correlation between what LLM would choose now and how you product should look to most likely be in LLM data set.
In that YC video i mentioned in post body they discuss tool called ReSend - something like an email gateway for receiving/sending mails. What's interesting - there are a lot of tools like that, but LLM's would every time choose shiny new resend.
Seems like there are something more than just being in the internet for a long time :)
DANmode 2 days ago [-]
The marketing industry is currently calling SEO for chatbots “GEO”.
I hope it doesn’t stick.
fenix1851 1 days ago [-]
I think this thing you mentioned is more about reverse-engineering web-search tool call to understand how model formulate their response.
The tool i’ve didn’t see - “custdevs for agents”. So we can simulate choosing process for them in thousands of different scenarios. And then compare how tasty product looks for Claude or Gemini or any other LLM
Two things that surprised us: (1) being explicit about what the tool doesn't do matters as much as what it does - vague descriptions get hallucinated calls constantly, and (2) inline examples in the description beat external documentation every time. The agent won't browse to your docs page.
The schema side matters too - clean parameter names, sensible defaults, clear required vs optional. It's basically UX design for machines rather than humans. Different models do have different calling patterns (Claude is more conservative, will ask before guessing; others just fire and hope) so your descriptions need to work for both styles.
That seems... surprising, and if necessary something that could easily be corrected on the harness side.
> The schema side matters too - clean parameter names, sensible defaults, clear required vs optional. It's basically UX design for machines rather than humans.
I don't follow. Wouldn't you do all those things to design for humans anyway?
But it's the same points you should follow when designing a human readable docs(as zahlman said above). Isn't it?
The short answer: yes, description wording measurably matters, and it's more sensitive than most people expect.
A few things we've found that move the needle:
1. Concrete use-case beats generic label. 'Send a transactional email after payment confirmation' outperforms 'Send email' by a wide margin across models — the agent doesn't have to infer when to use it.
2. Negative constraints help. Explicitly saying 'Don't use this for bulk sends or newsletters' actually reduces hallucinated misuse, because the model has a clear mental model of the boundary.
3. Schema field names are silent prompts. A field named 'recipient_email' gets filled correctly far more than 're' or even 'to'. The model is pattern-matching on names it's seen in similar contexts.
4. Example I/O beats description prose. If you include one example input and what the output looks like, agents are significantly more likely to call the tool correctly on the first attempt.
For cross-model consistency: Claude tends to do better with structured JSON schema + examples. GPT-4 responds well to prose descriptions with analogies. Gemini seems to weight field name patterns heavily.
The broader pattern: treat your tool description like a micro few-shot prompt, not an API doc. Optimize for the model's context, not for human readers.
Practically: the agent reads your docs, README, or API description and decides if it can use your tool to solve the current problem. So the question is really "will an AI understand my tool well enough to use it correctly?"
What helps: - Clear, literal API documentation (not marketing copy) - Explicit input/output examples with edge cases - A `capabilities.md` or similar that describes what the tool does and doesn't do
The irony: the skills that make tools understandable to AI (precision, literalness, examples) are the opposite of what makes them legible to humans (narrative, benefits, stories).
I mean how do i check that my changes in documentation even work in a right way?
Three things that actually moved the needle:
Negative boundaries work better than positive claims. "Generates reports from structured receipts. Does NOT execute code, modify files, or make API calls" gets called correctly way more often than "A powerful report generation tool." Trigger words matter more than you'd think. I maintain explicit trigger lists per skill — specific phrases that should activate it. Without those, the agent pattern-matches on vibes and gets it wrong ~30% of the time. With explicit triggers, that drops to under 5%.
Schema is the real interface. Clean parameter names with sensible defaults beat elaborate descriptions. If your tool takes query: string vs search_query_input_text: string, the first one gets called more reliably across models.
But here's the thing the "agent economy" framing gets wrong: you don't want fully autonomous tool selection. An agent choosing freely between 50 tools is like giving a junior developer admin access to everything — it'll work sometimes and break spectacularly other times. What works better is constraining the agent's scope upfront. Give it 3-5 relevant skills for the task, not your entire toolkit. Or build workflow skills that chain multiple tools in a fixed sequence — the agent handles the content, the workflow handles the routing.
The uncomfortable truth: you're not optimizing for "discovery" in the human sense. There's no brand loyalty, no trust built over time. Every single invocation is a cold start where the model reads your description and decides. That's actually freeing — it means the best-described tool wins, regardless of who built it.
Two things I do to fight this:
First, skill scoping per task. Instead of exposing all 20+ skills to every agent, each terminal only sees the 3-5 skills relevant to its current dispatch. The orchestrator decides which skills to load before the agent even starts. Less noise, better selection accuracy.
Second, context rotation to prevent context rot setting in. When an agent's context fills up, the system automatically writes a structured handover, clears the window, and resumes in a fresh context. This is critical because a degraded context doesn't just pick worse tools — it starts ignoring instructions entirely. A fresh context with a good handover outperforms a bloated one every time.
I'm actually testing automatic refresh at 60-70% usage right now — not waiting until the window is nearly full, but rotating early to prevent context rot before it starts. Early results suggest that's the sweet spot: late enough that you've done meaningful work, early enough that the handover quality is still high.
The organic discovery problem you're describing is essentially unsolvable with a flat tool list. The more tools you add, the worse selection gets — it's not linear degradation, it's closer to exponential once you pass ~15-20 tools in context. The only path I've found is hierarchical: a routing layer that narrows the set before the agent sees it.
The key insight from testing: rotating at 60-70% — before quality degrades — matters more than the rotation mechanism itself. At 80%+ auto-compact kicks in and races with any cleanup you try to do.
Wrote it up as a Show HN if anyone's curious: https://news.ycombinator.com/item?id=47152204
I wrote this post because of exactly those corner cases. If I'm building something agents would use - how do i understand which tool they'd actually choose?
For example you building an API provider for image generation. There are thousands of them in the internet.
I wonder if there are a tool that basically would simulate choosing between your product/service and your competitors one.
I don't know if there a correlation between what LLM would choose now and how you product should look to most likely be in LLM data set.
In that YC video i mentioned in post body they discuss tool called ReSend - something like an email gateway for receiving/sending mails. What's interesting - there are a lot of tools like that, but LLM's would every time choose shiny new resend.
Seems like there are something more than just being in the internet for a long time :)
I hope it doesn’t stick.
The tool i’ve didn’t see - “custdevs for agents”. So we can simulate choosing process for them in thousands of different scenarios. And then compare how tasty product looks for Claude or Gemini or any other LLM
Correct me if i’m wrong :)