#last listed tool carries more weight
1 messages · Page 1 of 1 (latest)
This is the sort of thing that can make a new approach feel totally invalidated one day and then totally validated another day, without knowing why. In any other context you might think to list the most important thing first. Lots of voodoo black magic type things like this with AI - evals are super important. 😅
Worth noting that this eval doesn't even add a ton of tools - just a few for the Testspace type
Thank you @fickle sky ! This is exactly something I wanted to try and yeah the context window size makes sense.
What’s the score without system prompt tool at all?
SUCCESS RATE: 0/20 (0%)
Gemini needs a lot of hand-holding
It just replies with this for all of them:
🤖 💬 I can only select a Testspace to start working. Which Testspace ID would you like me to select?
even though it has a single tool, selectTestspace, whose description lists a single ID to select
This eval test was with Gemini only ?
Yeah - it's the only one really capable of running 20 in parallel like that, and is pretty inept out of the box, so it's good for testing the prompts + tool calling scheme. I throw the evals at the other models too but only after getting Gemini solid
there's a explore function that wanders around and runs various evals against various models, backing off as necessary to deal with rate limits
Should we add the systems prompt tool for all or just for Gemini ?
dagger-dev -m github.com/vito/daggerverse/botsbuildingbots call --model gemini-2.0-flash explore
(that --model is for the one doing the exploring, it calls other models)
I'm leaning towards just applying it for all, since it's not just Gemini that benefits, it's just the one that needs it the most out of the box, among the big AI providers. Qwen and I think Claude 3.5 Sonnet benefit too. Having per-model behavior seems like a bit of a headache
FYI @fickle sky if you do anything model-specific, it will be lost by llm.withModel() (I added a FIXME)
oh like if you start as Claude and swap to Gemini it doesn't pick up the Gemini-only system prompt, because it wasn't Gemini at init time?
right
ah ok. well, good news is that's going away anyway
will keep in mind for the future
Moving the system prompt into a tool isn't really panning out. Predictably, the model doesn't apply enough weight to it. Putting it at the end helped, but that resulted in pushing other important details out of that seemingly very narrow window of importance. I'm hesitant to say "context window" here because Gemini claims to have a much larger one - the issue appears to be weighting.
For example, with the system prompt exposed as a tool the model would sometimes forget to call the returnToUser tool until I moved the returnToUser tool near the end. But when I did that, it pushed the selectFoo tools out of the "window," so the model stopped recognizing the object IDs listed in its description and would ask for me to provide one. All of these problems go away when a real system prompt is used instead. We can't have our whole scheme competing to be the last tool.
So, I think we need to stop worrying and love the system prompt. And for MCP, we should use the "instructions" API and make sure that MCP clients support it. It's the existing documented way to inject instructions into the system prompt level so any MCP clients that don't support it are arguably just broken/incomplete, and should be fixed through lobbying or PRs. We won't be the only ones struggling with this.
(
@onyx pumice and/or @brazen stirrup for finding that)
It seems like Goose already supports it, so that's a good sign: https://github.com/block/goose/blob/fdf82c369c6c2f46e12e6512ffdd9eeb6e1716ad/crates/goose/src/prompts/system.md?plain=1#L25
Tibor gets all the credit ❤️