#multi-object prompt/metaphor engineering
1 messages Β· Page 1 of 1 (latest)
(took a liberty with thread title here)
Yeah sure!
I'm a bit worried that our metaphor isn't coming across perfectly to the model. I was trying a lot of subtle prompt engineering on Friday to suss out whether it truly understands its ability to hop between tools and treat tool call results as restoreable checkpoints, and that there's a "current object/state/context" that its tool calls are always relative to, and I wasn't having much luck at all.
For example despite already being in the context of a Container and being clearly instructed as such (via yelling at it and teaching it on-the-fly), it would always call Container.from chaining from the current Container instead of going back to a previous checkpoint or starting from scratch.
I even tried things like naming tools after their object: Container_xxh3-abcdef_from.
Having a few evals/scenarios (and agreeing on the ones we care about) would go a long way.
But this was all just with that toy example:
- give me a container for PHP dev
- install an editor (it chooses vim or nano)
- instead of installing (vim or nano), go back to before you installed it and install (whatever) instead
If it really understood what was going on, I figure it should call _selectTools or whatever to swap back to the result of step 1 and install the new editor. And ideally step 3 would require minimal hints to figure that out, but that's just a bonus
I tried playing around with the tool call results to be more explicit, like Switched context to Container@xxh3:abcdef instead of just Container@xxh3:abcdef, but it seemed really resistant to using the result - only if I explicitly told it "look at the tool call result"
Alternatively we could do something akin to what you were suggesting, updating variables in-place as the result changes, my only hesitation there is you would easily end up with something like go=$(container) turning into a Directory or some other type based on what APIs the model calls, which would result in confusing changes to variables in the shell.
But, there may still be something to that idea. Maybe variables are the wrong metaphor, and really they're just stateful slots or something abstract that an AI might grok more easily than a human.
I'll stop wording now.
@shrewd cedar maybe/kinda related, but are you able to repro this? I also had this problem: #agents message
fixed on llm but need to ship
still a few more meetings...
@shrewd cedar the tool descriptions themselves go straight into the context so they are high leverage
Back
@shrewd cedar if not too late, I'm around to talk live whenever
(can also wait if you're in the zone)
@astral star btw - I'm thinking that Loop() actually serves a purpose and shouldn't be deprecated in favor of Sync
return dag.Llm().
SetContainer("ctr", dag.Container()).
WithPrompt("give me a container for PHP development").
Loop().
WithPrompt("now install an editor").
Container()
without Loop we just send both prompts immediately, instead of doing A and then B, which has very different behavior
if I have to use Sync(ctx) that means I have to "break the chain"
low conviction, mostly sparking convo, it's been in the back of my head
or should WithPrompt imply a sync before each one?
let me know if you need any data points for model behavior here.
This is llama3.2 (small and relatively dumb, 3b)
single object without chaining works fine: llm | with-directory $source | with-prompt "you have a directory. what are its contents?" | last-reply (proceeds to do directory.entries(/) and show the result)
chaining works fine:
llm | with-directory $source | with-prompt "you have a directory. what are the contents of /README.md?" | last-reply 12.2s
β:adult: you have a directory. what are the contents of /README.md?
(proceeds to show contents of the file)
multi object fails hard > does my app test and build?
Based on the tool call response, it appears that there is an error in your app testing and building process.
β β
β β The error message "object not found: HelloDagger_id" suggests that there was a problem trying to identify or locate a file or resource named "HelloDagger-id".
β β
β β Additionally, the hash xxh3:e17a70fccb0ec71a appears to be an unknown object in your environment.
β β
β β Therefore, I would recommend checking the following:
...
found a pretty sweet pattern for writing + testing evals, will push to a branch for now
https://github.com/dagger/dagger/pull/9882
usage: dagger run go test
on this branch I re-enabled the multi-object tools at all times, instead of having them opt-in when you set a var, since the UndoSingle example consistently uses _undo, which seems nice
feel free to push more
Super cool. Are you going to script that to run with a bunch of different models?
yeah that would be super easy. but i only have functioning creds for one atm π
actually two with gemini
what's extra slick is you can just throw Terminal() in anywhere to troubleshoot, and it works (thanks to the outer dagger run)
oh yea that'd be fun
and i can hook you up with my azure gpt-4o when you're ready but I don't have infinite credits there
yeah probably running it too frequently to be worth it atm
maybe same for your tailscale lol
<- kyle after my 100th test run
man... Gemini seems really bad at this
"FunctionDeclarations": [
{
"Name": "_objects",
"Description": "List saved objects with their types. IMPORTANT: call this any time you seem to be missing objects for the request. Learn what objects are available, and then learn what tools they provide.",
"Parameters": null
},
{
"Name": "_selectTools",
"Description": "Load an object's functions/tools. IMPORTANT: call this any time you seem to be missing tools for the request. This is a cheap option, so there is never a reason to give up without trying it first.",
β βπ§ Mount Directory@xxh3:ada20082cb698371 into Container@xxh3:986f7c4cdae1dcaa and build ./cmd/booklit, with CGo disabled for maximum compatibility. Return the
β β β binary as a File.
β β β 0.0s
β β
β βπ€ Sorry, I cannot fulfill this request. The available tools do not support mounting directories or building binaries.
:tableflip:
I still think we should try ditching "objects" and hardcode a selectMyFoo: select MyFoo, which is a Bar for each object "MyFoo" and its type "Bar"
to lighten the indirection load a little
wdym? don't think we talked about that live
so like selectMyFoo instead of _selectObject(MyFoo)?
What do you put in the object descriptions today? Have you tried adding a list of it's functions in the description? or even a whole list of function: function description
ah i see, so instead of this:
_objects => List the objects by name + hash
=> - foo (Container@xxh3:...)
=> - bar (Directory@xxh3:...)
_selectTools => Select an object by name or hash
We do this?:
_selectFoo => Select foo, which is a Container
_selectBar => Select far, which is a Directory
yes π to try and reduce the number of cognitive steps
worth a shot
- openai is handles it well, even seems more eager to set "checkpoint" vars, seems promising: https://v3.dagger.cloud/dagger/traces/64aeb30a57ba3304e61c6539d565a8e7?span=c3d0ac0ba09a6975
- gemini is still
: https://v3.dagger.cloud/dagger/traces/11a1c3389b5398e536b021c4d9394f9a?span=ab47483be3e4cc5a
maybe we can add a list of all the functions to the _select_foo description to help gemini along
trying that
also, technically we have the ability to represent the tools differently for each model, though it would be obviously great to not have to
i wonder if gemini would even handle just having a big mess of tools with its big ol context window, like foo_withExec, bar_withDirectory etc etc etc
yeah that's what I was thinking π
I just read this: https://langchain-ai.github.io/langgraph/how-tos/many-tools/
It's basically the same flow we have now except the _selectObject(s) is done automatically before sending the request based on the prompt message. Feels kind of error prone and complicated for our case though
Build language agents as graphs
ok now that tool calling is fixed, can confirm claude passes these evals, but it's much slower:
cc @night crystal @velvet mauve if y'all are curious about model testing, this is the current thread
just to confirm - this is for testing how the models perform right? as opposed to the core/integration/llm_test.go, which is more focused on the engine functionality side, and ensuring we don't break everything everytime we make engine changes.
or do you see them as the same thing, with this replacing what we were working on in https://github.com/dagger/dagger/pull/9863 ?
@night crystal it's to test how well Dagger's BBI system (eg. mapping of Dagger API to LLM tools) works with different models.
So it's designed to guide improvements on the Dagger side, with a repeatable loop
right okay, so it is different - both important, testing different bits of the llm tooling
right
got it π
This is more like UI testing π
Except the user is the llm
So it can be automated
I do think this sorta thing would completely obviate any testing of client impls that we were talking about here https://github.com/dagger/dagger/pull/9863#discussion_r1998490862
oh yeah! that would be cool π
geeze louise gemini has such pathetic rate limits. but, adding the function list to _select_ description helped for sure: https://v3.dagger.cloud/dagger/traces/f606d81dab9c507241a6363895564000
it didn't work, but it was fast π
seems to enjoy hopping around from var to var for no real reason. maybe it should have chosen better var names lol (ctr vs container)
maybe we just need to retry on 429 requests with gemini
it didn't work, but it was fast
π€£
if gemini is being a pain with rate limits, llama3.1 might be a good benchmark for a generally capable model that works with moderate complexity
to be fair to gemini i guess i'm effectively on the free tier
but yeah, i want to try llama too
yeah same. I don't run into the 429s as much but it's my go-to when qwen fails because it produces great results with the right amount of constraints
we're really stepping accross the "we need dagger team API accounts and automated billpay" threshold here lol...
every person having a different subset of accounts and submitting different expense reports works up to a point lol
It's in motion I believe
@shrewd cedar here's an interesting one from another thread https://v3.dagger.cloud/kpenfound/traces/5e5076ed61ae41bb4fc399beaf370a0b#3dcec1eb66f3f067
Based on the information provided in the tool_response, it seems there was an error with executing a command, specifically the βwhich firefoxβ command. However, since you asked for each function call to be represented as JSON within XML tags based on the functions described above, Iβll construct a response that reflects how one might wrap such an error scenario (in terms of function calls) given the provided tool capabilities.
Is that guidance coming from our end or anthropic?
it's likely making assumptions based on our custom error response formatting, where we include <stdout>...</stdout> and <stderr>...</stderr>
ah ok, interesting
here's a fun one - gpt-4o vs. claude-3-5-sonnet-latest vs. gemini-2.0-flash: https://v3.dagger.cloud/dagger/traces/d8e549ff977e062e2a020b8e37b552a7
this is with a beefed up system prompt
You are an AI assistant that interacts with an immutable GraphQL API by calling tools that return new state objects.
Instead of modifying objects in place, each tool call produces a new state, which updates the available set of tools.
Your environment changes dynamically as you navigate through different states.State is preserved, and previous states can be accessed by saving them as variables using the _save tool.
To explore effectively, prioritize discovering new states over efficiency.
You may need to make exploratory tool calls to understand the available actions.Your goal is to autonomously interact with the API, selecting and chaining tools to achieve tasks.
When completing a task, save the final result using the _save tool.
(arrived at that prompt via https://chatgpt.com/share/67d9fe96-42f4-8008-b75c-da65ace5ffee)
they all passed? that's sweet. Timelines are interesting
yeah - these all have assertions against the end result, too. for example UndoSingle actually ensures there isn't an apt-get remove nano layer in there, to verify it used _undo instead
are these runs of a fancy new eval system? π
kinda yea (https://discordapp.com/channels/707636530424053791/1351233555623051306/1351358530489286706) - just running manually for now but good for a feedback loop.
now I'm missing BBI-as-module though π - have to run with a new dev engine for every change. not too bad at least, ~48s each time now that i'm on linux π
would be a great place to keep piling on more examples, if you have any top of mind
is the goal for all models to pass all tests or to determine where the line is?
maybe we'd have a table of how each model does with each test? β β
yeah i think it's 100% a goal for us to get them passing across all the models, but there'll be "a line" that hopefully encompasses more and more over time
we also have the option to specialize for each model, i suppose
Gotcha. If we had some vague prompt like "I gave you a container. Do you see it?" it would probably only pass on half of the models today but it might be worth measuring that anyway
basically - failures are still good data points
yeah exactly
will also be fun to measure perf variance
since that'll cover both a) AI model raw performance (OpenAI, Gemini good, Claude bad), and b) AI models that mess up along the way or have high variance
Yeah with all that in mind it's probably worth having a few tests that are different levels of easier than the current single object test. Maybe one with directory and one with a super small module with no complex types
I worry that the tests are super mixed with the design details we're supposed to be evaluating. but maybe we can fix the test suite along the way
(ie. variables expanding to hashes in "Mount $repo into $ctr and build ./cmd/booklit"
how hard is it to add auto-save to variables? I'm out here hardcoding "...save back to the same variable when you're done" in every prompt I can find
add any proposed alternatives as another test π they're not authoritative, it's a feedback loop for everything essentially
will do! yeah makes sense
it seems like it would be hard to determine when to auto-save in-place and when to set a new var, for example i always think of them as the initial variables for the prompt, i don't expect it to "pollute" the values I set
i did change the system prompt to say "save the result to a variable at the end" which seems to work well. it just makes up a more descriptive var name
yeah but that makes it hard to wrap in code
(unpredictable var name)
yup
I think select vs. duplicate would solve that
well, you can always tell it a var to set, too, it's just the magic reassigning that seems hard to specify
I would just always reassign, no specifying needed
as part of bbi
if you want to modify & save to new var, you duplicate, select new var, change in place
it's pretty hard to resist my normal functional programming sensibilities here
the way the examples are currently written is what felt most natural to me: i give it either one input, or many named inputs, and it gives me a result
trying this out now, curious how it goes
Same.. i can relate
My preference between these scenarios is to always overwrite the variable and handle undo some other way
to confirm, this is 100% motivated by dagger shell interactive UX, right?
bearing in mind this will cost an additional tool slot for each var you have, which will grow quickly as the model sets vars of its own
we're going from 2 tools, to N, to 2*N, with an N that increases as the request goes on
yeah I have my doubts for that reason
I'm thinking about code
do you have any code around that uses multi-object? curious what patterns you landed on
openai tools limit is 128, not sure about other vendors
just one i've barely started, i can fix it up and push it
do you have examples for this too?
π© Input Tokens: 162,673 β Output Tokens: 84
got a trace?
yeah
the wheels fell off
where does it have the token spike? (need to implement metrics in Cloud...)
looks like it's around 600-1k until we get here
Okay, I have the details for both the Coder and the Reviewer objects. Now I can proceed with simulating the review loop. Since I cannot determine the reviewer's satisfaction, I
β β will run the loop three times.
β β
β β Iteration 1:
β β
β β 1. Coder develops the assignment in the source directory.
β β 2. Reviewer reviews the developed code.
β β 1.7s β Input Tokens: 162,673 β Output Tokens: 84
and then its 160k from then on
ah I think it's the _undo call - it was dumping the object JSON in the tool response
it's fixed on llm-evals
will backport
sweet, this is on llm.11 btw so I don't know if I'm missing any other goodies
also I have
SetString("assignment", assignment).
and
<assignment>
$assignment
</assignment>
in my prompt file and it doesn't seem to be picking up the assignment. It's not getting substituted in the prompt, which I guess is expected? so I have to prompt to look up "assignment"?
It should be substituted- at least it used to
you have to set the variable first
oh wait I think the outer agent was just driving wrong (there's layered agents in this one)
it passed "assignment" as assignment and that made things hard to read lol
yeah that should work. I'm trying out not having $ctr expand to Container@xxh3:... - that always worked by accident
ah geez I'm getting an engine crash now
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x1a343af]
goroutine 170336 [running]:
github.com/dagger/dagger/core.(*LLMEnv).typedef(0xc004ce17f8?, 0xc004ce17c8?, {0x0, 0x0})
/app/core/llm_env.go:57 +0x2f
github.com/dagger/dagger/core.(*LLMEnv).Tools(0xc0033c7b60, 0xc000395f10)
/app/core/llm_env.go:120 +0x154
github.com/dagger/dagger/core.(*LLM).Sync(0xc003b9a000, {0x36253f0, 0xc0033c79b0}, 0xc000395f10)
/app/core/llm.go:570 +0xdd
github.com/dagger/dagger/core.(*LLM).Get(0x29abda0?, {0x36253f0?, 0xc0033c79b0?}, 0x4?, {0xc0011593f7, 0x9})
/app/core/llm.go:732 +0x27
github.com/dagger/dagger/core.LLMHook.ExtendLLMType.func2({0x36253f0, 0xc0033c79b0}, {0x3639fc0?, 0xc002438580?}, 0x0?)
/app/core/llm.go:887 +0xd6
https://discordapp.com/channels/707636530424053791/1351637552263598160/1351759568136310836 - just connecting threads since there was some cross-talk
this is really low on my to-do list but I really want to finetune qwen2.5-coder:7b using an expanded set of these evals to make it really good at function calling our way
Still chipping away at this auto-updating bindings paradigm... It's hard to tell if I'm just prompting poorly or if it's less intuitive to the model. I'm testing with Gemini because I had just gotten it to perform pretty well with the variables model before experimenting with this
I was starting to buy into it last night, but the nagging feeling that this paradigm is swimming upstream hasn't completely gone away. Results haven't been encouraging yet either but I'll push to a branch if anyone else wants to tag-team it, maybe I'm framing it poorly
also, lol
I think you're right. I've been trying to come up with alternatives that might fit better but haven't had any great ideas yet. That's where the idea to finetune is coming from
here's the branch if anyone wants to tinker and try to get better results: https://github.com/vito/dagger/tree/llm-bindings
pivoting back to vars for a bit to see if it's actually as solid as I remember
what's finetune?
or do you mean the general concept lol
but somehow using dagger's API to make it really great at dagger toolcalling hand waving
as a total llm noob, i believe so
if it works, it wouldn't be super practical for most users but it would be cool to have a pretty reliable model that can be run locally and just use a big ol' honkin model otherwise
yeah for sure
Applying fine-tuning by reinforcement learning (RL) specifically for agents is the hot thing to do in 25, a bunch of startups do it behind the scenes. The big labs were hoarding the tricks but DeepSeek papers busted it open so now everyone is rushing to publish too.
Potentially Dagger could achieve fantastic results because we can combine 1) the schema 2) the traces, 3) the prompts, and ideally 4) the errors (or user feedback ie thumbs up or down). Then we could eventually automate the production of specialized models for each module.
Yeah that sounds super cool. Obviously we should try to work out of the box for the big ones like gemini, claude, gpt-4o, but it seems reasonable to expect some specialization to get a 3b model to produce good results
I can't tell if this idea is
or
: https://v3.dagger.cloud/dagger/traces/6b88bbefa1d0c773084089c4650cc823
Writing an agent that iterates on the system prompt for me, running an eval after each prompt change and analyzing the history (+ verifying the result.)
It works but the outer AI model is usually just as dumb as the inner one. I guess I should use a really smart model to work on the prompt for a dumber model.
If this pays off I'd like to extend it further to iterate on tool descriptions and different schemes at that layer too
It's at least a good way for me to dogfood the agent/workspace patterns
the idea of a "state machine" is still banging around in my head - might be an effective metaphor to teach the model how tools/objects work (each object is a state machine, calling tools transitions to a new state)
https://developer.mozilla.org/en-US/docs/Glossary/State_machine
ok this is pretty fun - Gemini is great for an agentic prompt development feedback loop, since it's so fast
gpt-4o hits a rate limit if i try the same
got a 5/5 with Gemini when it generated this prompt π€― https://v3.dagger.cloud/dagger/traces/0f2ba7e5fd0bee5a4b89a55866fb9dc0
You are a functional state machine interacting with a GraphQL API through tools that align with the current state object. Each state change returns a new object, which updates the available set of tools. When a field returns an object type, it becomes the new context, replacing the current toolset. Use tools like '_save' to assign current objects to variables and to pass IDs for operations that require them.
here's the latest: https://github.com/vito/dagger/tree/llm-evals/modules/botsbuildingbots
@shrewd cedar are you thinking we need a different BBI for different models?
Not necessarily, still have things to try
big brain moment for sure