last listed tool carries more weight | Dagger | Page 1

fickle sky Apr 3, 2025, 3:02 PM

#

This is the sort of thing that can make a new approach feel totally invalidated one day and then totally validated another day, without knowing why. In any other context you might think to list the most important thing first. Lots of voodoo black magic type things like this with AI - evals are super important. 😅

#

Worth noting that this eval doesn't even add a ton of tools - just a few for the Testspace type

onyx pumice Apr 3, 2025, 3:04 PM

#

Thank you @fickle sky ! This is exactly something I wanted to try and yeah the context window size makes sense.

#

What’s the score without system prompt tool at all?

fickle sky Apr 3, 2025, 3:08 PM

#

SUCCESS RATE: 0/20 (0%)

Gemini needs a lot of hand-holding

#

https://v3.dagger.cloud/dagger/traces/3cc55bd06d1ca8303d7861634b170c22

Dagger Cloud

Browse and visualize Dagger traces.

#

It just replies with this for all of them:

🤖 💬 I can only select a Testspace to start working. Which Testspace ID would you like me to select?

#

even though it has a single tool, selectTestspace, whose description lists a single ID to select

onyx pumice Apr 3, 2025, 3:13 PM

#

This eval test was with Gemini only ?

fickle sky Apr 3, 2025, 3:14 PM

#

Yeah - it's the only one really capable of running 20 in parallel like that, and is pretty inept out of the box, so it's good for testing the prompts + tool calling scheme. I throw the evals at the other models too but only after getting Gemini solid

#

there's a explore function that wanders around and runs various evals against various models, backing off as necessary to deal with rate limits

onyx pumice Apr 3, 2025, 3:16 PM

#

Should we add the systems prompt tool for all or just for Gemini ?

fickle sky Apr 3, 2025, 3:16 PM

#

dagger-dev -m github.com/vito/daggerverse/botsbuildingbots call --model gemini-2.0-flash explore

(that --model is for the one doing the exploring, it calls other models)

fickle sky Apr 3, 2025, 3:20 PM

#

onyx pumice Should we add the systems prompt tool for all or just for Gemini ?

I'm leaning towards just applying it for all, since it's not just Gemini that benefits, it's just the one that needs it the most out of the box, among the big AI providers. Qwen and I think Claude 3.5 Sonnet benefit too. Having per-model behavior seems like a bit of a headache

unreal salmon Apr 3, 2025, 3:47 PM

#

FYI @fickle sky if you do anything model-specific, it will be lost by llm.withModel() (I added a FIXME)

fickle sky Apr 3, 2025, 3:49 PM

#

unreal salmon FYI <@108011715077091328> if you do anything model-specific, it will be lost by ...

oh like if you start as Claude and swap to Gemini it doesn't pick up the Gemini-only system prompt, because it wasn't Gemini at init time?

unreal salmon Apr 3, 2025, 3:49 PM

#

right

fickle sky Apr 3, 2025, 3:49 PM

#

ah ok. well, good news is that's going away anyway

unreal salmon Apr 3, 2025, 3:49 PM

#

https://github.com/dagger/dagger/issues/10039

fickle sky Apr 3, 2025, 3:49 PM

#

will keep in mind for the future

fickle sky Apr 4, 2025, 6:48 PM

#

Moving the system prompt into a tool isn't really panning out. Predictably, the model doesn't apply enough weight to it. Putting it at the end helped, but that resulted in pushing other important details out of that seemingly very narrow window of importance. I'm hesitant to say "context window" here because Gemini claims to have a much larger one - the issue appears to be weighting.

For example, with the system prompt exposed as a tool the model would sometimes forget to call the returnToUser tool until I moved the returnToUser tool near the end. But when I did that, it pushed the selectFoo tools out of the "window," so the model stopped recognizing the object IDs listed in its description and would ask for me to provide one. All of these problems go away when a real system prompt is used instead. We can't have our whole scheme competing to be the last tool.

So, I think we need to stop worrying and love the system prompt. And for MCP, we should use the "instructions" API and make sure that MCP clients support it. It's the existing documented way to inject instructions into the system prompt level so any MCP clients that don't support it are arguably just broken/incomplete, and should be fixed through lobbying or PRs. We won't be the only ones struggling with this.

( @onyx pumice and/or @brazen stirrup for finding that)

It seems like Goose already supports it, so that's a good sign: https://github.com/block/goose/blob/fdf82c369c6c2f46e12e6512ffdd9eeb6e1716ad/crates/goose/src/prompts/system.md?plain=1#L25

brazen stirrup Apr 4, 2025, 6:49 PM

#

fickle sky Moving the system prompt into a tool isn't really panning out. Predictably, the ...

Tibor gets all the credit ❤️

#last listed tool carries more weight