#Cannot decode `Directory#1`
1 messages · Page 1 of 1 (latest)
there are 2 bugs in one bug. First is that the LLM is not great at typechecking, it is sometimes taking a HelloDagger#1 as a dagger.Directory: https://v3.dagger.cloud/tiborvass/traces/c218cb47a4d1c0fe518206ea305f957d
Thankfully I won't be hitting this since i'll be autoselecting it with @young solstice 's PR.
The other bug is that when it does choose a directory, it uses the string Directory#1 as-is instead of looking up its base64 ID. Where is the correspondance between the LLM-friendly numbers and the dagger IDs ?
I'm testing on main.
https://dagger.cloud/tiborvass/traces/5bef0311f17844dd789f50488a038719
Cannot decode Directory#1
what LLM impl are you using, btw? im a lil suprised by this error but maybe we just have different preferred providers
4o
i never hit it with claude 3.5
with claude 3.5 it is asking "which directory?" instead of trying the default, which tbh is the actual bug. https://v3.dagger.cloud/tiborvass/traces/c24a5d944b9ec91265331e8cf496bea8
So now i'm starting to think that 4o is hallucinating a Directory#1.
How can I check all the variables in a running shell ?
once i've opened a prompt, i go back to the shell and then do $agent | env | inputs | name
i am almost entirely certain there is a less stupid way than that though lmao
TIL $agent
it used to be .llm
Now, i just repro'd the same behavior as 4o on claude 3.5 (it's trying to pass Directory#1 but it was never created).
$agent | env | inputs | name returns just hello
may want to base your repro + fixes on https://github.com/dagger/dagger/pull/10134
it's a large messy pr atm, sorry - but the tool calling scheme is very different, so I'm curious if any of the stuff I've already done addresses the issue in any way
sure! will try now
the evals are in-repo now, so you can do stuff like this:
# run a few attempts of an eval, analyze the results, suggest a new system prompt
dagger-dev -m modules/evaluator call --model claude-3-7-sonnet-latest --docs ./core/llm_docs.md --initial-prompt ./core/llm_dagger_prompt.md evaluate --model gpt-4.1 --name BuildMulti
# or to run a single eval
dagger-dev -m modules/evaluator/evals call build-multi report
i had procrastinated cloning them or learning how to use them in their original location, and now you have rewarded my procrastination
Hey Alex, do we agree that this PR relies heavily on the system prompt to fix all those edge cases ?
I lost track with all the ideas
So, in order to have MCP - PROMPT parity, we need to re-focus on the system prompt eval / idea from @tawdry veldt.
Because, all the fixes from this branch currently won't be repercuted on MCP cc @young solstice
Sorry Alex, got sidetracked, will get back into testing your branch. And I think Guillaume is referring to handling prompts via Instructions in MCP, which if we start relying heavily on it again, we'll need to prioritize to make things work with MCP. /cc @young solstice
I'll try to add an eval and run it with your instructions above
maybe a hot take but we should keep focus on a minimal mcp demo unless merging selectTools is hyper-imminent, and if it is we should acknowledge we're gonna have some serious logical merge conflicts to iron out
right now the intuitive prompt is hitting the Directory#1 issue, so i'm trying to fix that.
this PR adds a system prompt, but that's not the central goal, it's also a revamped tool calling scheme altogether (no more 'current selection', instead there's a growing set of tools selected by the model)
I have a throwaway commit in the PR that also bubbles up the system prompt through MCP instructions but I haven't tested it yet. Goose respects it in theory
you're right though - i'm mostly looking at this trying to prevent redundant work but if this is a blocker for MCP maybe just go forward - at worst you'll find a different solution and I'll see how to adjust my branch if needed @tawdry veldt
i don't think i'm far from merging, but there's a lot of clean-up to do (as in splitting up the PR potentially)
good news, the problem seems to go away with your branch! Will need to do more tests with MCP but it looks promising
oh sweet. what does the eval/repro look like?
Will send you later but basically the QuickStart example and the prompt for now is just “build dagger hello”