#select_tools
1 messages ยท Page 1 of 1 (latest)
Current status: still seems promising, but it's extremely dependent on the system prompt. We already are - #1357369057334136882 message - so that's not a dealbreaker, but now it's just even more important.
What I'm finding though is it's hard to come up with one system prompt that makes all models happy. That's still the goal, but yeah. So far I've been able to get a system prompt that makes OpenAI and Gemini happy, but not Claude. And I can craft a system prompt that makes Claude happy, but not Gemini.
The march continues...
Worth noting the issue with Claude when Gemini+OpenAI are happy is that it doesn't call returnToUser - so the same issue as on main. For some reason it's just not keen to call it. The system prompt needs to REALLY focus on it (like "look at returnToUser and work backward")
Still have things to try. Just pulling up the curtain a bit!
One of the breakthroughs was listing the return type for each tool in the selectTools description.
So in terms of dependency on system prompt, so far this is a regression right? Since so far only Gemini has needed one
Not really. Claude needs one right now too and is pretty broken since the introduction of the return tool. I really don't think we can get by without one at this point.
(see linked message)
I wonder if mcp.run has the same issue. They also claim a dynamic/programmable mcp server, but seem to have gotten it to work over MCP
Maybe worth looking at which tools they expose
For sure they don't rely on system prompt since they got it to work over MCP
maybe they use the instructions API?
(but I would be surprised just from how chaotic everything is haha)
that's the thing - the instructions API is explicitly for this sort of thing
so I don't think we should think of it as "MCP doesn't support system prompts"
yeah but do any clients actually implement it?
if they don't, that's their problem, and more reflective of how new and incomplete the whole field is
Goose implements it
a few quick observations too:
- I noticed in a trace that
returnwasn't always selected (seemed like it had to be explicitly selected inselectTools). I think probably that should be only for object functions, not meta-tools
that's already the case, they don't have to select that one, some just do that anyway yeah
ah ๐ but maybe remove it from the list then (or maybe it's not even in the list and they hallucinate it)
yeah it's a hallucination lol
models say the darndest things - at one point Gemini (when poorly prompted) was calling selectTools(["Workspace_research", "Workspace_research", "Workspace_research"]) when told to "research and record three findings"
the models were all also ignoring the required self arg like 50% of the time ๐ฌ - which I've worked around by having it default to the most recent object of the desired type
also self is now named Foo, to match Foo_bar, which I thought would help. it didn't, but I decided to keep it that way anyway. (Gemini sometimes mangles words that are special words in Python, like return -> return_, hence the switch to returnToUser)
i actually kind of like this trick though - it means you can hop between objects somewhat safely without extra calls to switch selections etc.
but, it's a bit of a crutch, I would much prefer they remember object IDs and use them accurately
From last time I looked into it, their tools aren't nearly as dynamic. It removes the pain of managing the MCP server config .json, by putting their single stateful entrypoint in the middle (tied to your account where you install servlets, i.e. tools, from their marketplace). But in the end those tools/servlets are already designed to be exposed over MCP; they don't have a sophisticated stateful object/API/function calling/output returning scheme like we do, so they shouldn't need a system prompt at all.
ah I see. I was wondering if they added an indirection in how llms consumed tools, so that different llm sessions could get different tools.
update: PR nearly done, updated the description with eval results + more details https://github.com/dagger/dagger/pull/10134
๐ Testing this now with the quickstart on different models
Currently hitting an issue where the actual operations that modify my environment are showing as cached and my saved container is unmodified
https://v3.dagger.cloud/kpenfound/traces/3af5f19a4d48beb8cbeb55a819adb8ad
Just to set expectations: this is with qwen and afaik the baseline there is already the floor. Would be really nice if we can get it to work but it hasn't been a priority for this PR - I've been focusing on the usual providers first
I guess it amounts to doing a lot more prompting, I see you've got a hefty one there. Some of it is out of date (I see "Select Container#1")
Haha yeah I'll sign the waiver that says qwen might not work properly. My goal is to find what amount of prompting gets different sizes of qwen to kind of work. I was surprised with this one because in the trace it looks like it actually did the right thing
no idea what's going on here lol - I set ?debug on my local Cloud UI which shows literally all spans (even passthrough) and it's not like there's a hidden call or anything. You can see a revealed passthrough span at the bottom
like why's it saying "I have created ..." when it didn't even take any action 
The repeated selectTools might be something we can system prompt away
ok, I wasn't sure if that was a quirk of the new API or not. Agreed with the repeated selects. Do you know where the tool calls at the end are actually happening?
it seems like it failed to call withExec a couple times and then just returned an older Container
ah, thank you, that explains it
the cached statuses are just because the actual effects are cached, like at the buildkit level. so that part worked fine I think?
hmm it also bailed early because it called save which is normally a hint to our loop that it's done (and it can break). in this case it looks like it was treating it more like a checkpoint. it might have kept going if it didn't call it so early
i added that break there because otherwise I saw qwen just repeatedly calling return (which is now save) with the same value
without debug, this is what was confusing me. It looks like it did the right thing and all was good
yeah I also saw the repeated return on 0.18.3
i'll add some prompting to say only use save when you're done for-reals
ah, what happened was it made the same failed go build call from earlier, against Container#1 instead of the newly returned Container#2. and since it was the same thing, the span got deduped
yeah maybe you have to write it differently
i'll copy some parts of the default system prompt into the user prompt
i'm not too familiar with ollama but i'd expect it to at least apply stronger weight to a system prompt still
so in my mind that would just make things worse, but 
yeah who knows, all of the "openAI compatible APIs" so far have taken the "compatible" part pretty loosely
I've been doing a bunch of testing with local models on this branch. The issues I'm hitting aren't specific to this change, so I can start a new thread if that's better
The challenge I'm trying to overcome is that it's currently impossible for a local model (like qwen2.5-coder:14b on ollama) to complete the quickstart - even with toy-workspace instead of container.
To set expectations, I recreated the quickstart using openai's agent sdk and the same model has no problems at all completing the "make a curl clone" prompt.
I guess the question is: how urgent is this problem?
@lapis narwhal suggested trying 32b but I haven't yet - did you try it by any chance?
Yeah I've mostly tried 32b with dagger, even more customized versions of it from huggingface
fwiw I used qwen 32b a lot on 17-llm.X with success on single-object, so it's still a tool calling complexity issue
hmm interesting
i'm trying to avoid the idea that smaller models will need us to maintain an entirely different scheme 
partly because if that is the answer, we'd really need to justify the existence of the more complex one (e.g. it makes it better at more sophisticated tasks, somehow larger models prefer the more sophisticated scheme, or the more sophisticated scheme saves you a ton in token cost). but maybe they're low ish maintenance once you figure it out
I even got 3b (three) to work with the openai agents sdk, but not consistently
like this part is interesting - it seems to think it called those tools, instead of just making them available
there might be a simple fix for that if so
right now selectTools just replies with "ok" lol
it could either:
- reply with "You now have the %s tool available."
- reply with a list of tool schemas - which might double as a workaround for MCP clients that don't suppport the tools-changed notification (or it's racy etc)
maybe those will help - i'll try 2) first since the secondary effect would be neato (compatibility with more clients)
here's the current version of the prompt I was trying with qwen 32b and the exact quickstart we have today that uses container https://github.com/kpenfound/agents/blob/67d0f355e6191b0392ad98fcca8c83eae4b82c8b/quickstart/prompts/qwen2.5-coder.md?plain=1
and the models I've been trying
hf.co/unsloth/Qwen2.5-Coder-32B-Instruct-128K-GGUF:Q4_K_M
hf.co/unsloth/Qwen2.5-Coder-32B-Instruct-128K-GGUF:Q8_0
qwen2.5-coder:32b
qwen2.5-coder:14b
if you push something like this i'll rebuild and try it
here's the openai version btw https://gist.github.com/kpenfound/c0227e0e308ed3161fcde27e3f289d60
lmk how it goes
...gotta make a zed extension first lol
you shouldn't need to, the mcp configs are hackable
couldn't find it in their docs
{
"context_servers": {
"dagger": {
"command": {
"path": "dagger",
"args": ["mcp", "-m", "~/src/dagger"],
"env": {
"ANTHROPIC_API_KEY": "REDACTED",
"NO_COLOR": "true"
}
},
"settings": {}
},
in your zed config
ah cool - i just told an agent to do it for me and let it rip, since we might want it eventually lol
if it gets stuck ill give that a go instead
Yooooo that did it ๐ https://v3.dagger.cloud/kpenfound/traces/ea99e098ec75c34ca93ade59e1990aeb
This was qwen2.5-coder:14b. It got confused with immutable objects at first but then it got there
HOT DAMN!
Watch full episodes of Brooklyn Nine-Nine on Google Play: http://bit.ly/1qSidLc
Watch full episodes of Brooklyn Nine-Nine on Itunes: http://apple.co/2gsqIjI
Keep up to date with the Nine-Nine and Subscribe: http://bit.ly/2y90XNd
Welcome to the OFFICIAL Brooklyn Nine-Nine YouTube channel. Here you'll find the best moments from Jake, ...
can't figure out how to get zed to actually use a MCP server 
i configured it but it never runs it
lol turns out this confuses Goose
one step forward, one step back
maybe because it thinks it replaced the full tool set, including selectTools?:
( O)> gimme a go container
I'll help you create a Go container environment. Let me use Dagger to set this up.
โโโ selectTools | dagger โโโโโโโโโโโโโโโโโโโโโโโโโโ
tools:
-
container
Hmm, I see that I can create a container but I need some additional tools to properly set it up with Go. Let me search for any available extensions that might help.
โโโ search_available_extensions | platform โโโโโโโโโโโโโโโโโโโโโโโโโโ
I apologize, but I notice that I currently don't have access to the full set of container tools needed to pull a Go image or configure it. While I can create a base container, I don't have access to the methods needed to:
1. Pull a Go image
2. Configure it with necessary settings
3. Install dependencies
Would you like me to:
1. Create just a basic scratch container (though it wouldn't be very useful without additional configuration)
2. Wait for additional container tools to be enabled
3. Try a different approach to help you with your Go development needs
when I change it back to ok it figures it out just fine
currently it's responding with {"tools":...} - maybe that should be {"addedTools":...}
oh yep that fixed it
gonna try disabling the mcp tools changed notification and see if this fools goose into using them 
lol yep
it took "tools":... too literally
(very interesting that it respected it that hard, even...)
4.1?
( O)> gimme a go container
I'll help you create a Go container using Dagger. Let me select the container tool and set it up.
โโโ selectTools | dagger โโโโโโโโโโโโโโโโโโโโโโโโโโ
tools:
-
container
โโโ think | dagger โโโโโโโโโโโโโโโโโโโโโโโโโโ
thought: ...
โโโ container | dagger โโโโโโโโโโโโโโโโโโโโโโโโโโ
Execution failed: RPC error: code=-32602, message=tool 'container' not found: tool not found
โ Gathering computational momentum... 2025-04-17T21:53:43.433047Z ERROR goose::agents::agent: Error: Request failed: Request failed with status: 400 Bad Request. Message: messages.5: `tool_use` ids were found without `tool_result` blocks immediately after: toolu_01RdGLghXzbXdfK6FmE5QXTt. Each `tool_use` block must have a corresponding `tool_result` block in the next message.
at crates/goose/src/agents/agent.rs:530
it does indeed - it only failed because I also disabled the bit that actually wires up the tool on the server side.
what makes you think it's also not wired up client side?
i guess it'd prolly fail earlier if that was the case?
it's in the core/mcpserver.go file in the MCP server handling path 
yeah, afaict this is showing the client went ahead and called a nonexistant tool just based on being told it has them in the selectTools response
i just can't tell from the output whether it actually tried to call it
this line:
โโโ container | dagger โโโโโโโโโโโโโโโโโโโโโโโโโโ
and then I think that error must come from the mark3labs package
the error matching would be enough for sure
love it, text generation all the way down
2025-04-17T21:53:43.433047Z ERROR goose::agents::agent: Error: Request failed: Request failed with status: 400 Bad Request. Message: messages.5: `tool_use` ids were found without `tool_result` blocks immediately after: toolu_01RdGLghXzbXdfK6FmE5QXTt. Each `tool_use` block must have a corresponding `tool_result` block in the next message.
``` this thing plagues me, and apparently many others
i don't understand how they haven't fixed it
ah that should be fixed on my branch
it's actually our bad
we're just raising an error internally instead of appending an errored tool response
there are similar issues on claude code and goose too
ahhhh interesting that makes sense
apparently there are situations where anthropics api itself returns responses that trip that, too
and yeah, i just switched back to claude with non-error example i got working on open ai and it keeps working
update: went back to calling it select_tools (previously load_tools) (previously selectTools). Gemini showed markedly worse evals with load_tools, yielding an incredibly opaque FinishReason(10) in the majority of runs, which means "the agent generated a malformed function call".
It's incredibly frustrating to troubleshoot this error since there isn't any other info returned besides what you get the agent to tell you (and likely hallucinate). But going on what it says it'll do before it blows up, it seems like with the name as load_tools the model was prone to jump right to using tools without loading them first. Renaming back to select_tools makes a dramatic difference.
Just an example of the butterfly effect you can have from tiny changes, and why evals are critical. This took hours to figure out, after many stabs in the dark ๐ตโ๐ซ
Still need someone to โ this PR - CI is green
no API breaking changes?
Nothing breaking - I changed historyJSON to return JSON instead of String but kept module compatibility
merged - ty!
can corroborate that the load_tools felt worse as well, glad to hear the evals backed up the vibes.
the goal was to prevent models from loading the same tools over and over, since they stick around, but oh well
select_tools
there was some point on the mcp demo branch where we had the tool name change btw, but the descriptions/prompts weren't updated... idk if you noticed that on your end and fixed it for the evals
yeah, i fixed that and it made a big difference for some models, but Gemini still struggled a lot with FinishReason(10)
on 18.4 qwen is selecting tools nicely, which is awesome. It's tripping over outputs:
โ๐ค {"completed": "Container#3"}
What is the tool it needs to do that correctly so I can prompt it? save?
yeah save is how it can return outputs, but it should be getting prompted on that already via the system prompt 
I'm somewhat skeptical of ollama's openai compat api and system prompts... but that's just my gut and haven't tested anything. I'll guide towards save in my user prompt