#multi-object prompt/metaphor engineering

1 messages Β· Page 1 of 1 (latest)

shrewd cedar
#

(took a liberty with thread title here)

Yeah sure!

I'm a bit worried that our metaphor isn't coming across perfectly to the model. I was trying a lot of subtle prompt engineering on Friday to suss out whether it truly understands its ability to hop between tools and treat tool call results as restoreable checkpoints, and that there's a "current object/state/context" that its tool calls are always relative to, and I wasn't having much luck at all.

For example despite already being in the context of a Container and being clearly instructed as such (via yelling at it and teaching it on-the-fly), it would always call Container.from chaining from the current Container instead of going back to a previous checkpoint or starting from scratch.

I even tried things like naming tools after their object: Container_xxh3-abcdef_from.

Having a few evals/scenarios (and agreeing on the ones we care about) would go a long way.

#

But this was all just with that toy example:

  1. give me a container for PHP dev
  2. install an editor (it chooses vim or nano)
  3. instead of installing (vim or nano), go back to before you installed it and install (whatever) instead

If it really understood what was going on, I figure it should call _selectTools or whatever to swap back to the result of step 1 and install the new editor. And ideally step 3 would require minimal hints to figure that out, but that's just a bonus

I tried playing around with the tool call results to be more explicit, like Switched context to Container@xxh3:abcdef instead of just Container@xxh3:abcdef, but it seemed really resistant to using the result - only if I explicitly told it "look at the tool call result"

#

Alternatively we could do something akin to what you were suggesting, updating variables in-place as the result changes, my only hesitation there is you would easily end up with something like go=$(container) turning into a Directory or some other type based on what APIs the model calls, which would result in confusing changes to variables in the shell.

But, there may still be something to that idea. Maybe variables are the wrong metaphor, and really they're just stateful slots or something abstract that an AI might grok more easily than a human.

I'll stop wording now.

astral star
#

@shrewd cedar maybe/kinda related, but are you able to repro this? I also had this problem: #agents message

shrewd cedar
astral star
#

still a few more meetings...

#

@shrewd cedar the tool descriptions themselves go straight into the context so they are high leverage

astral star
#

Back

astral star
#

@shrewd cedar if not too late, I'm around to talk live whenever

#

(can also wait if you're in the zone)

shrewd cedar
#

am around πŸ‘

#

dev audio?

shrewd cedar
#

@astral star btw - I'm thinking that Loop() actually serves a purpose and shouldn't be deprecated in favor of Sync

    return dag.Llm().
        SetContainer("ctr", dag.Container()).
        WithPrompt("give me a container for PHP development").
        Loop().
        WithPrompt("now install an editor").
        Container()

without Loop we just send both prompts immediately, instead of doing A and then B, which has very different behavior

if I have to use Sync(ctx) that means I have to "break the chain"

low conviction, mostly sparking convo, it's been in the back of my head

#

or should WithPrompt imply a sync before each one?

remote linden
#

let me know if you need any data points for model behavior here.

This is llama3.2 (small and relatively dumb, 3b)
single object without chaining works fine: llm | with-directory $source | with-prompt "you have a directory. what are its contents?" | last-reply (proceeds to do directory.entries(/) and show the result)

chaining works fine:

llm | with-directory $source | with-prompt "you have a directory. what are the contents of /README.md?" | last-reply 12.2s
β”‚:adult: you have a directory. what are the contents of /README.md?

(proceeds to show contents of the file)

multi object fails hard > does my app test and build?

Based on the tool call response, it appears that there is an error in your app testing and building process.
β”‚ ┃
β”‚ ┃ The error message "object not found: HelloDagger_id" suggests that there was a problem trying to identify or locate a file or resource named "HelloDagger-id".
β”‚ ┃
β”‚ ┃ Additionally, the hash  xxh3:e17a70fccb0ec71a  appears to be an unknown object in your environment.
β”‚ ┃
β”‚ ┃ Therefore, I would recommend checking the following:
...
shrewd cedar
#

found a pretty sweet pattern for writing + testing evals, will push to a branch for now

#

on this branch I re-enabled the multi-object tools at all times, instead of having them opt-in when you set a var, since the UndoSingle example consistently uses _undo, which seems nice

#

feel free to push more

remote linden
#

Super cool. Are you going to script that to run with a bunch of different models?

shrewd cedar
#

yeah that would be super easy. but i only have functioning creds for one atm πŸ˜›

#

actually two with gemini

#

what's extra slick is you can just throw Terminal() in anywhere to troubleshoot, and it works (thanks to the outer dagger run)

remote linden
#

awesome

#

if you need access to my tailscale machine lmk!

shrewd cedar
#

oh yea that'd be fun

remote linden
#

and i can hook you up with my azure gpt-4o when you're ready but I don't have infinite credits there

shrewd cedar
#

yeah probably running it too frequently to be worth it atm

#

maybe same for your tailscale lol

remote linden
#

that one should be fine

#

it's pretty underutilized right now

shrewd cedar
#

thisisfine <- kyle after my 100th test run

shrewd cedar
#

man... Gemini seems really bad at this

#
      "FunctionDeclarations": [
        {
          "Name": "_objects",
          "Description": "List saved objects with their types. IMPORTANT: call this any time you seem to be missing objects for the request. Learn what objects are available, and then learn what tools they provide.",
          "Parameters": null
        },
        {
          "Name": "_selectTools",
          "Description": "Load an object's functions/tools. IMPORTANT: call this any time you seem to be missing tools for the request. This is a cheap option, so there is never a reason to give up without trying it first.",
β”‚ β”‚πŸ§‘ Mount Directory@xxh3:ada20082cb698371 into Container@xxh3:986f7c4cdae1dcaa and build ./cmd/booklit, with CGo disabled for maximum compatibility. Return the
β”‚ β”‚ ┃ binary as a File.
β”‚ β”‚ ┃ 0.0s
β”‚ β”‚
β”‚ β”‚πŸ€– Sorry, I cannot fulfill this request. The available tools do not support mounting directories or building binaries.

:tableflip:

astral star
#

I still think we should try ditching "objects" and hardcode a selectMyFoo: select MyFoo, which is a Bar for each object "MyFoo" and its type "Bar"

#

to lighten the indirection load a little

shrewd cedar
#

wdym? don't think we talked about that live

astral star
#

No I forgot about that today πŸ™‚ But just came back to me

#

Severance moment

remote linden
#

so like selectMyFoo instead of _selectObject(MyFoo)?

#

What do you put in the object descriptions today? Have you tried adding a list of it's functions in the description? or even a whole list of function: function description

shrewd cedar
astral star
shrewd cedar
#

worth a shot

shrewd cedar
shrewd cedar
#

maybe we can add a list of all the functions to the _select_foo description to help gemini along thinkspin trying that

#

also, technically we have the ability to represent the tools differently for each model, though it would be obviously great to not have to

#

i wonder if gemini would even handle just having a big mess of tools with its big ol context window, like foo_withExec, bar_withDirectory etc etc etc

remote linden
shrewd cedar
#

cc @night crystal @velvet mauve if y'all are curious about model testing, this is the current thread

night crystal
#

just to confirm - this is for testing how the models perform right? as opposed to the core/integration/llm_test.go, which is more focused on the engine functionality side, and ensuring we don't break everything everytime we make engine changes.

or do you see them as the same thing, with this replacing what we were working on in https://github.com/dagger/dagger/pull/9863 ?

astral star
#

@night crystal it's to test how well Dagger's BBI system (eg. mapping of Dagger API to LLM tools) works with different models.

#

So it's designed to guide improvements on the Dagger side, with a repeatable loop

night crystal
#

right okay, so it is different - both important, testing different bits of the llm tooling

astral star
#

right

night crystal
#

got it πŸŽ‰

astral star
#

This is more like UI testing πŸ™‚

#

Except the user is the llm

#

So it can be automated

night crystal
#

chefkiss big brain moment for sure

#

it's neat, i like it πŸ˜„

night crystal
#

oh yeah! that would be cool πŸ˜„

shrewd cedar
#

seems to enjoy hopping around from var to var for no real reason. maybe it should have chosen better var names lol (ctr vs container)

#

maybe we just need to retry on 429 requests with gemini

remote linden
#

it didn't work, but it was fast

🀣

#

if gemini is being a pain with rate limits, llama3.1 might be a good benchmark for a generally capable model that works with moderate complexity

shrewd cedar
#

to be fair to gemini i guess i'm effectively on the free tier

#

but yeah, i want to try llama too

remote linden
#

yeah same. I don't run into the 429s as much but it's my go-to when qwen fails because it produces great results with the right amount of constraints

velvet mauve
#

every person having a different subset of accounts and submitting different expense reports works up to a point lol

astral star
#

It's in motion I believe

remote linden
#

@shrewd cedar here's an interesting one from another thread https://v3.dagger.cloud/kpenfound/traces/5e5076ed61ae41bb4fc399beaf370a0b#3dcec1eb66f3f067

Based on the information provided in the tool_response, it seems there was an error with executing a command, specifically the β€œwhich firefox” command. However, since you asked for each function call to be represented as JSON within XML tags based on the functions described above, I’ll construct a response that reflects how one might wrap such an error scenario (in terms of function calls) given the provided tool capabilities.

Is that guidance coming from our end or anthropic?

shrewd cedar
remote linden
#

ah ok, interesting

shrewd cedar
#

this is with a beefed up system prompt

#

You are an AI assistant that interacts with an immutable GraphQL API by calling tools that return new state objects.
Instead of modifying objects in place, each tool call produces a new state, which updates the available set of tools.
Your environment changes dynamically as you navigate through different states.

State is preserved, and previous states can be accessed by saving them as variables using the _save tool.
To explore effectively, prioritize discovering new states over efficiency.
You may need to make exploratory tool calls to understand the available actions.

Your goal is to autonomously interact with the API, selecting and chaining tools to achieve tasks.
When completing a task, save the final result using the _save tool.
(arrived at that prompt via https://chatgpt.com/share/67d9fe96-42f4-8008-b75c-da65ace5ffee)

remote linden
#

they all passed? that's sweet. Timelines are interesting

shrewd cedar
#

yeah - these all have assertions against the end result, too. for example UndoSingle actually ensures there isn't an apt-get remove nano layer in there, to verify it used _undo instead

astral star
#

are these runs of a fancy new eval system? πŸ™‚

shrewd cedar
#

would be a great place to keep piling on more examples, if you have any top of mind

remote linden
#

is the goal for all models to pass all tests or to determine where the line is?

#

maybe we'd have a table of how each model does with each test? βœ… ❌

shrewd cedar
#

yeah i think it's 100% a goal for us to get them passing across all the models, but there'll be "a line" that hopefully encompasses more and more over time

#

we also have the option to specialize for each model, i suppose

remote linden
#

Gotcha. If we had some vague prompt like "I gave you a container. Do you see it?" it would probably only pass on half of the models today but it might be worth measuring that anyway

#

basically - failures are still good data points

shrewd cedar
#

yeah exactly

#

will also be fun to measure perf variance

#

since that'll cover both a) AI model raw performance (OpenAI, Gemini good, Claude bad), and b) AI models that mess up along the way or have high variance

remote linden
#

Yeah with all that in mind it's probably worth having a few tests that are different levels of easier than the current single object test. Maybe one with directory and one with a super small module with no complex types

astral star
#

I worry that the tests are super mixed with the design details we're supposed to be evaluating. but maybe we can fix the test suite along the way

#

(ie. variables expanding to hashes in "Mount $repo into $ctr and build ./cmd/booklit"

#

how hard is it to add auto-save to variables? I'm out here hardcoding "...save back to the same variable when you're done" in every prompt I can find

shrewd cedar
shrewd cedar
#

i did change the system prompt to say "save the result to a variable at the end" which seems to work well. it just makes up a more descriptive var name

astral star
#

(unpredictable var name)

shrewd cedar
#

yup

astral star
shrewd cedar
#

well, you can always tell it a var to set, too, it's just the magic reassigning that seems hard to specify

astral star
#

I would just always reassign, no specifying needed

#

as part of bbi

#

if you want to modify & save to new var, you duplicate, select new var, change in place

shrewd cedar
#

it's pretty hard to resist my normal functional programming sensibilities here

#

the way the examples are currently written is what felt most natural to me: i give it either one input, or many named inputs, and it gives me a result

shrewd cedar
remote linden
#

My preference between these scenarios is to always overwrite the variable and handle undo some other way

shrewd cedar
#

to confirm, this is 100% motivated by dagger shell interactive UX, right?

shrewd cedar
#

we're going from 2 tools, to N, to 2*N, with an N that increases as the request goes on

remote linden
#

yeah I have my doubts for that reason

remote linden
shrewd cedar
#

do you have any code around that uses multi-object? curious what patterns you landed on

#

openai tools limit is 128, not sure about other vendors

remote linden
#

just one i've barely started, i can fix it up and push it

shrewd cedar
remote linden
shrewd cedar
#

got a trace?

remote linden
#

yeah

#

the wheels fell off

shrewd cedar
#

where does it have the token spike? (need to implement metrics in Cloud...)

remote linden
#

looks like it's around 600-1k until we get here

Okay, I have the details for both the  Coder  and the  Reviewer  objects. Now I can proceed with simulating the review loop. Since I cannot determine the reviewer's satisfaction, I
β”‚ ┃ will run the loop three times.
β”‚ ┃
β”‚ ┃ Iteration 1:
β”‚ ┃
β”‚ ┃ 1. Coder develops the assignment in the source directory.
β”‚ ┃ 2. Reviewer reviews the developed code.
β”‚ ┃ 1.7s β—† Input Tokens: 162,673 β—† Output Tokens: 84

and then its 160k from then on

shrewd cedar
#

ah I think it's the _undo call - it was dumping the object JSON in the tool response

#

it's fixed on llm-evals

#

will backport

remote linden
#

sweet, this is on llm.11 btw so I don't know if I'm missing any other goodies

#

also I have
SetString("assignment", assignment).
and

<assignment>
$assignment
</assignment>

in my prompt file and it doesn't seem to be picking up the assignment. It's not getting substituted in the prompt, which I guess is expected? so I have to prompt to look up "assignment"?

astral star
#

you have to set the variable first

remote linden
#

oh wait I think the outer agent was just driving wrong (there's layered agents in this one)

#

it passed "assignment" as assignment and that made things hard to read lol

shrewd cedar
remote linden
#

ah geez I'm getting an engine crash now

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x1a343af]

goroutine 170336 [running]:
github.com/dagger/dagger/core.(*LLMEnv).typedef(0xc004ce17f8?, 0xc004ce17c8?, {0x0, 0x0})
        /app/core/llm_env.go:57 +0x2f
github.com/dagger/dagger/core.(*LLMEnv).Tools(0xc0033c7b60, 0xc000395f10)
        /app/core/llm_env.go:120 +0x154
github.com/dagger/dagger/core.(*LLM).Sync(0xc003b9a000, {0x36253f0, 0xc0033c79b0}, 0xc000395f10)
        /app/core/llm.go:570 +0xdd
github.com/dagger/dagger/core.(*LLM).Get(0x29abda0?, {0x36253f0?, 0xc0033c79b0?}, 0x4?, {0xc0011593f7, 0x9})
        /app/core/llm.go:732 +0x27
github.com/dagger/dagger/core.LLMHook.ExtendLLMType.func2({0x36253f0, 0xc0033c79b0}, {0x3639fc0?, 0xc002438580?}, 0x0?)
        /app/core/llm.go:887 +0xd6
shrewd cedar
remote linden
#

this is really low on my to-do list but I really want to finetune qwen2.5-coder:7b using an expanded set of these evals to make it really good at function calling our way

shrewd cedar
#

Still chipping away at this auto-updating bindings paradigm... It's hard to tell if I'm just prompting poorly or if it's less intuitive to the model. I'm testing with Gemini because I had just gotten it to perform pretty well with the variables model before experimenting with this

#

I was starting to buy into it last night, but the nagging feeling that this paradigm is swimming upstream hasn't completely gone away. Results haven't been encouraging yet either but I'll push to a branch if anyone else wants to tag-team it, maybe I'm framing it poorly

#

also, lol

remote linden
shrewd cedar
#

pivoting back to vars for a bit to see if it's actually as solid as I remember

shrewd cedar
#

or do you mean the general concept lol

remote linden
#

but somehow using dagger's API to make it really great at dagger toolcalling hand waving

shrewd cedar
#

oh cool. is that like a process for making a specialized model?

#

(sorta)

remote linden
#

as a total llm noob, i believe so

#

if it works, it wouldn't be super practical for most users but it would be cool to have a pretty reliable model that can be run locally and just use a big ol' honkin model otherwise

shrewd cedar
#

yeah for sure

astral star
#

Applying fine-tuning by reinforcement learning (RL) specifically for agents is the hot thing to do in 25, a bunch of startups do it behind the scenes. The big labs were hoarding the tricks but DeepSeek papers busted it open so now everyone is rushing to publish too.

Potentially Dagger could achieve fantastic results because we can combine 1) the schema 2) the traces, 3) the prompts, and ideally 4) the errors (or user feedback ie thumbs up or down). Then we could eventually automate the production of specialized models for each module.

remote linden
#

Yeah that sounds super cool. Obviously we should try to work out of the box for the big ones like gemini, claude, gpt-4o, but it seems reasonable to expect some specialization to get a 3b model to produce good results

shrewd cedar
#

I can't tell if this idea is bigbrain or FrogeDumb: https://v3.dagger.cloud/dagger/traces/6b88bbefa1d0c773084089c4650cc823

Writing an agent that iterates on the system prompt for me, running an eval after each prompt change and analyzing the history (+ verifying the result.)

It works but the outer AI model is usually just as dumb as the inner one. I guess I should use a really smart model to work on the prompt for a dumber model.

If this pays off I'd like to extend it further to iterate on tool descriptions and different schemes at that layer too

#

It's at least a good way for me to dogfood the agent/workspace patterns

shrewd cedar
#

the idea of a "state machine" is still banging around in my head - might be an effective metaphor to teach the model how tools/objects work (each object is a state machine, calling tools transitions to a new state)

https://developer.mozilla.org/en-US/docs/Glossary/State_machine

MDN Web Docs

A state machine is a mathematical abstraction used to design algorithms. A state machine reads a set of inputs and changes to a different state based on those inputs.

shrewd cedar
shrewd cedar
#

gpt-4o hits a rate limit if i try the same

#

got a 5/5 with Gemini when it generated this prompt 🀯 https://v3.dagger.cloud/dagger/traces/0f2ba7e5fd0bee5a4b89a55866fb9dc0

You are a functional state machine interacting with a GraphQL API through tools that align with the current state object. Each state change returns a new object, which updates the available set of tools. When a field returns an object type, it becomes the new context, replacing the current toolset. Use tools like '_save' to assign current objects to variables and to pass IDs for operations that require them.

remote linden
shrewd cedar
astral star
#

@shrewd cedar are you thinking we need a different BBI for different models?

shrewd cedar