multi-object prompt/metaphor engineering | Dagger | Page 1

shrewd cedar Mar 17, 2025, 4:51 PM

#

(took a liberty with thread title here)

Yeah sure!

I'm a bit worried that our metaphor isn't coming across perfectly to the model. I was trying a lot of subtle prompt engineering on Friday to suss out whether it truly understands its ability to hop between tools and treat tool call results as restoreable checkpoints, and that there's a "current object/state/context" that its tool calls are always relative to, and I wasn't having much luck at all.

For example despite already being in the context of a Container and being clearly instructed as such (via yelling at it and teaching it on-the-fly), it would always call Container.from chaining from the current Container instead of going back to a previous checkpoint or starting from scratch.

I even tried things like naming tools after their object: Container_xxh3-abcdef_from.

Having a few evals/scenarios (and agreeing on the ones we care about) would go a long way.

#

But this was all just with that toy example:

give me a container for PHP dev
install an editor (it chooses vim or nano)
instead of installing (vim or nano), go back to before you installed it and install (whatever) instead

If it really understood what was going on, I figure it should call _selectTools or whatever to swap back to the result of step 1 and install the new editor. And ideally step 3 would require minimal hints to figure that out, but that's just a bonus

I tried playing around with the tool call results to be more explicit, like Switched context to Container@xxh3:abcdef instead of just Container@xxh3:abcdef, but it seemed really resistant to using the result - only if I explicitly told it "look at the tool call result"

#

Alternatively we could do something akin to what you were suggesting, updating variables in-place as the result changes, my only hesitation there is you would easily end up with something like go=$(container) turning into a Directory or some other type based on what APIs the model calls, which would result in confusing changes to variables in the shell.

But, there may still be something to that idea. Maybe variables are the wrong metaphor, and really they're just stateful slots or something abstract that an AI might grok more easily than a human.

I'll stop wording now.

astral star Mar 17, 2025, 5:46 PM

#

@shrewd cedar maybe/kinda related, but are you able to repro this? I also had this problem: #agents message

shrewd cedar Mar 17, 2025, 5:47 PM

#

astral star <@108011715077091328> maybe/kinda related, but are you able to repro this? I als...

fixed on llm but need to ship

astral star Mar 17, 2025, 6:07 PM

#

still a few more meetings...

#

@shrewd cedar the tool descriptions themselves go straight into the context so they are high leverage

astral star Mar 17, 2025, 8:51 PM

#

Back

astral star Mar 17, 2025, 9:07 PM

#

@shrewd cedar if not too late, I'm around to talk live whenever

#

(can also wait if you're in the zone)

shrewd cedar Mar 17, 2025, 9:07 PM

#

am around 👍

#

dev audio?

shrewd cedar Mar 17, 2025, 11:06 PM

#

@astral star btw - I'm thinking that Loop() actually serves a purpose and shouldn't be deprecated in favor of Sync

    return dag.Llm().
        SetContainer("ctr", dag.Container()).
        WithPrompt("give me a container for PHP development").
        Loop().
        WithPrompt("now install an editor").
        Container()

without Loop we just send both prompts immediately, instead of doing A and then B, which has very different behavior

if I have to use Sync(ctx) that means I have to "break the chain"

low conviction, mostly sparking convo, it's been in the back of my head

#

or should WithPrompt imply a sync before each one?

remote linden Mar 18, 2025, 12:39 AM

#

let me know if you need any data points for model behavior here.

This is llama3.2 (small and relatively dumb, 3b)
single object without chaining works fine: llm | with-directory $source | with-prompt "you have a directory. what are its contents?" | last-reply (proceeds to do directory.entries(/) and show the result)

chaining works fine:

llm | with-directory $source | with-prompt "you have a directory. what are the contents of /README.md?" | last-reply 12.2s
│:adult: you have a directory. what are the contents of /README.md?

(proceeds to show contents of the file)

multi object fails hard > does my app test and build?

Based on the tool call response, it appears that there is an error in your app testing and building process.
│ ┃
│ ┃ The error message "object not found: HelloDagger_id" suggests that there was a problem trying to identify or locate a file or resource named "HelloDagger-id".
│ ┃
│ ┃ Additionally, the hash  xxh3:e17a70fccb0ec71a  appears to be an unknown object in your environment.
│ ┃
│ ┃ Therefore, I would recommend checking the following:
...

shrewd cedar Mar 18, 2025, 12:44 AM

#

found a pretty sweet pattern for writing + testing evals, will push to a branch for now

#

https://github.com/dagger/dagger/pull/9882

usage: dagger run go test

#

on this branch I re-enabled the multi-object tools at all times, instead of having them opt-in when you set a var, since the UndoSingle example consistently uses _undo, which seems nice

#

feel free to push more

remote linden Mar 18, 2025, 12:57 AM

#

Super cool. Are you going to script that to run with a bunch of different models?

shrewd cedar Mar 18, 2025, 12:58 AM

#

yeah that would be super easy. but i only have functioning creds for one atm 😛

#

actually two with gemini

#

what's extra slick is you can just throw Terminal() in anywhere to troubleshoot, and it works (thanks to the outer dagger run)

remote linden Mar 18, 2025, 1:00 AM

#

awesome

#

if you need access to my tailscale machine lmk!

shrewd cedar Mar 18, 2025, 1:01 AM

#

oh yea that'd be fun

remote linden Mar 18, 2025, 1:01 AM

#

and i can hook you up with my azure gpt-4o when you're ready but I don't have infinite credits there

shrewd cedar Mar 18, 2025, 1:02 AM

#

yeah probably running it too frequently to be worth it atm

#

maybe same for your tailscale lol

remote linden Mar 18, 2025, 1:02 AM

#

that one should be fine

#

it's pretty underutilized right now

shrewd cedar Mar 18, 2025, 1:03 AM

#

thisisfine <- kyle after my 100th test run

shrewd cedar Mar 18, 2025, 3:19 AM

#

man... Gemini seems really bad at this

#

      "FunctionDeclarations": [
        {
          "Name": "_objects",
          "Description": "List saved objects with their types. IMPORTANT: call this any time you seem to be missing objects for the request. Learn what objects are available, and then learn what tools they provide.",
          "Parameters": null
        },
        {
          "Name": "_selectTools",
          "Description": "Load an object's functions/tools. IMPORTANT: call this any time you seem to be missing tools for the request. This is a cheap option, so there is never a reason to give up without trying it first.",

│ │🧑 Mount Directory@xxh3:ada20082cb698371 into Container@xxh3:986f7c4cdae1dcaa and build ./cmd/booklit, with CGo disabled for maximum compatibility. Return the
│ │ ┃ binary as a File.
│ │ ┃ 0.0s
│ │
│ │🤖 Sorry, I cannot fulfill this request. The available tools do not support mounting directories or building binaries.

:tableflip:

astral star Mar 18, 2025, 3:27 AM

#

I still think we should try ditching "objects" and hardcode a selectMyFoo: select MyFoo, which is a Bar for each object "MyFoo" and its type "Bar"

#

to lighten the indirection load a little

shrewd cedar Mar 18, 2025, 3:35 AM

#

wdym? don't think we talked about that live

astral star Mar 18, 2025, 3:42 AM

#

No I forgot about that today 🙂 But just came back to me

#

Severance moment

remote linden Mar 18, 2025, 3:57 AM

#

so like selectMyFoo instead of _selectObject(MyFoo)?

#

What do you put in the object descriptions today? Have you tried adding a list of it's functions in the description? or even a whole list of function: function description

shrewd cedar Mar 18, 2025, 4:20 AM

#

astral star I still think we should try ditching "objects" and hardcode a `selectMyFoo: sele...

ah i see, so instead of this:

_objects => List the objects by name + hash
=> - foo (Container@xxh3:...)
=> - bar (Directory@xxh3:...)
_selectTools => Select an object by name or hash

We do this?:

_selectFoo => Select foo, which is a Container
_selectBar => Select far, which is a Directory

astral star Mar 18, 2025, 4:21 AM

#

shrewd cedar ah i see, so instead of this: ``` _objects => List the objects by name + hash =>...

yes 👍 to try and reduce the number of cognitive steps

shrewd cedar Mar 18, 2025, 4:21 AM

#

worth a shot

shrewd cedar Mar 18, 2025, 4:46 AM

#

astral star yes 👍 to try and reduce the number of cognitive steps

openai is handles it well, even seems more eager to set "checkpoint" vars, seems promising: https://v3.dagger.cloud/dagger/traces/64aeb30a57ba3304e61c6539d565a8e7?span=c3d0ac0ba09a6975
gemini is still : https://v3.dagger.cloud/dagger/traces/11a1c3389b5398e536b021c4d9394f9a?span=ab47483be3e4cc5a

Dagger Cloud

Browse and visualize Dagger traces.

Dagger Cloud

Browse and visualize Dagger traces.

shrewd cedar Mar 18, 2025, 3:08 PM

#

maybe we can add a list of all the functions to the _select_foo description to help gemini along thinkspin trying that

#

also, technically we have the ability to represent the tools differently for each model, though it would be obviously great to not have to

#

i wonder if gemini would even handle just having a big mess of tools with its big ol context window, like foo_withExec, bar_withDirectory etc etc etc

remote linden Mar 18, 2025, 3:11 PM

#

shrewd cedar maybe we can add a list of all the functions to the `_select_foo` description to...

yeah that's what I was thinking 🙂

#

I just read this: https://langchain-ai.github.io/langgraph/how-tos/many-tools/
It's basically the same flow we have now except the _selectObject(s) is done automatically before sending the request based on the prompt message. Feels kind of error prone and complicated for our case though

How to handle large numbers of tools

Build language agents as graphs

shrewd cedar Mar 18, 2025, 3:43 PM

#

ok now that tool calling is fixed, can confirm claude passes these evals, but it's much slower:

claude 1m30s: https://v3.dagger.cloud/dagger/traces/d78ed1d6163a2115a4afa7486256448c
openai 21.9s: https://v3.dagger.cloud/dagger/traces/68152b1b4505f5b4c6c41d9b410a2dfb

Dagger Cloud

Browse and visualize Dagger traces.

Dagger Cloud

Browse and visualize Dagger traces.

#

cc @night crystal @velvet mauve if y'all are curious about model testing, this is the current thread

night crystal Mar 18, 2025, 3:46 PM

#

just to confirm - this is for testing how the models perform right? as opposed to the core/integration/llm_test.go, which is more focused on the engine functionality side, and ensuring we don't break everything everytime we make engine changes.

or do you see them as the same thing, with this replacing what we were working on in https://github.com/dagger/dagger/pull/9863 ?

astral star Mar 18, 2025, 3:48 PM

#

@night crystal it's to test how well Dagger's BBI system (eg. mapping of Dagger API to LLM tools) works with different models.

#

So it's designed to guide improvements on the Dagger side, with a repeatable loop

night crystal Mar 18, 2025, 3:49 PM

#

right okay, so it is different - both important, testing different bits of the llm tooling

astral star Mar 18, 2025, 3:49 PM

#

right

night crystal Mar 18, 2025, 3:49 PM

#

got it 🎉

astral star Mar 18, 2025, 3:49 PM

#

This is more like UI testing 🙂

#

Except the user is the llm

#

So it can be automated

night crystal Mar 18, 2025, 3:50 PM

#

chefkiss big brain moment for sure

#

it's neat, i like it 😄

velvet mauve Mar 18, 2025, 3:56 PM

#

night crystal *right* okay, so it is different - both important, testing different bits of the...

I do think this sorta thing would completely obviate any testing of client impls that we were talking about here https://github.com/dagger/dagger/pull/9863#discussion_r1998490862

GitHub

tests: add llm integration tests by jedevc · Pull Request #9863 · d...

night crystal Mar 18, 2025, 3:56 PM

#

oh yeah! that would be cool 😄

shrewd cedar Mar 18, 2025, 4:21 PM

#

geeze louise gemini has such pathetic rate limits. but, adding the function list to _select_ description helped for sure: https://v3.dagger.cloud/dagger/traces/f606d81dab9c507241a6363895564000
it didn't work, but it was fast 😎

Dagger Cloud

Browse and visualize Dagger traces.

#

seems to enjoy hopping around from var to var for no real reason. maybe it should have chosen better var names lol (ctr vs container)

#

maybe we just need to retry on 429 requests with gemini

remote linden Mar 18, 2025, 4:31 PM

#

it didn't work, but it was fast

🤣

#

if gemini is being a pain with rate limits, llama3.1 might be a good benchmark for a generally capable model that works with moderate complexity

shrewd cedar Mar 18, 2025, 4:35 PM

#

to be fair to gemini i guess i'm effectively on the free tier

#

but yeah, i want to try llama too

remote linden Mar 18, 2025, 4:35 PM

#

yeah same. I don't run into the 429s as much but it's my go-to when qwen fails because it produces great results with the right amount of constraints

velvet mauve Mar 18, 2025, 5:16 PM

#

shrewd cedar to be fair to gemini i guess i'm effectively on the free tier

we're really stepping accross the "we need dagger team API accounts and automated billpay" threshold here lol...

#

every person having a different subset of accounts and submitting different expense reports works up to a point lol

astral star Mar 18, 2025, 6:35 PM

#

It's in motion I believe

remote linden Mar 18, 2025, 9:34 PM

#

@shrewd cedar here's an interesting one from another thread https://v3.dagger.cloud/kpenfound/traces/5e5076ed61ae41bb4fc399beaf370a0b#3dcec1eb66f3f067

Based on the information provided in the tool_response, it seems there was an error with executing a command, specifically the “which firefox” command. However, since you asked for each function call to be represented as JSON within XML tags based on the functions described above, I’ll construct a response that reflects how one might wrap such an error scenario (in terms of function calls) given the provided tool capabilities.

Is that guidance coming from our end or anthropic?

Dagger Cloud

Browse and visualize Dagger traces.

shrewd cedar Mar 18, 2025, 9:36 PM

#

remote linden <@108011715077091328> here's an interesting one from another thread https://v3.d...

it's likely making assumptions based on our custom error response formatting, where we include <stdout>...</stdout> and <stderr>...</stderr>

remote linden Mar 18, 2025, 9:37 PM

#

ah ok, interesting

shrewd cedar Mar 18, 2025, 11:13 PM

#

here's a fun one - gpt-4o vs. claude-3-5-sonnet-latest vs. gemini-2.0-flash: https://v3.dagger.cloud/dagger/traces/d8e549ff977e062e2a020b8e37b552a7

Dagger Cloud

Browse and visualize Dagger traces.

#

this is with a beefed up system prompt

#

You are an AI assistant that interacts with an immutable GraphQL API by calling tools that return new state objects.
Instead of modifying objects in place, each tool call produces a new state, which updates the available set of tools.
Your environment changes dynamically as you navigate through different states.

State is preserved, and previous states can be accessed by saving them as variables using the _save tool.
To explore effectively, prioritize discovering new states over efficiency.
You may need to make exploratory tool calls to understand the available actions.

Your goal is to autonomously interact with the API, selecting and chaining tools to achieve tasks.
When completing a task, save the final result using the _save tool.
(arrived at that prompt via https://chatgpt.com/share/67d9fe96-42f4-8008-b75c-da65ace5ffee)

remote linden Mar 18, 2025, 11:16 PM

#

they all passed? that's sweet. Timelines are interesting

shrewd cedar Mar 18, 2025, 11:17 PM

#

yeah - these all have assertions against the end result, too. for example UndoSingle actually ensures there isn't an apt-get remove nano layer in there, to verify it used _undo instead

astral star Mar 18, 2025, 11:23 PM

#

are these runs of a fancy new eval system? 🙂

shrewd cedar Mar 18, 2025, 11:29 PM

#

astral star are these runs of a fancy new eval system? 🙂

kinda yea (https://discordapp.com/channels/707636530424053791/1351233555623051306/1351358530489286706) - just running manually for now but good for a feedback loop.

now I'm missing BBI-as-module though 😛 - have to run with a new dev engine for every change. not too bad at least, ~48s each time now that i'm on linux 🚗

#

would be a great place to keep piling on more examples, if you have any top of mind

remote linden Mar 18, 2025, 11:36 PM

#

is the goal for all models to pass all tests or to determine where the line is?

#

maybe we'd have a table of how each model does with each test? ✅ ❌

shrewd cedar Mar 18, 2025, 11:43 PM

#

yeah i think it's 100% a goal for us to get them passing across all the models, but there'll be "a line" that hopefully encompasses more and more over time

#

we also have the option to specialize for each model, i suppose

remote linden Mar 18, 2025, 11:44 PM

#

Gotcha. If we had some vague prompt like "I gave you a container. Do you see it?" it would probably only pass on half of the models today but it might be worth measuring that anyway

#

basically - failures are still good data points

shrewd cedar Mar 18, 2025, 11:45 PM

#

yeah exactly

#

will also be fun to measure perf variance

#

since that'll cover both a) AI model raw performance (OpenAI, Gemini good, Claude bad), and b) AI models that mess up along the way or have high variance

remote linden Mar 18, 2025, 11:47 PM

#

Yeah with all that in mind it's probably worth having a few tests that are different levels of easier than the current single object test. Maybe one with directory and one with a super small module with no complex types

astral star Mar 18, 2025, 11:48 PM

#

I worry that the tests are super mixed with the design details we're supposed to be evaluating. but maybe we can fix the test suite along the way

#

(ie. variables expanding to hashes in "Mount $repo into $ctr and build ./cmd/booklit"

#

how hard is it to add auto-save to variables? I'm out here hardcoding "...save back to the same variable when you're done" in every prompt I can find

shrewd cedar Mar 18, 2025, 11:55 PM

#

astral star I worry that the tests are super mixed with the design details we're supposed to...

add any proposed alternatives as another test 🙂 they're not authoritative, it's a feedback loop for everything essentially

astral star Mar 18, 2025, 11:55 PM

#

shrewd cedar add any proposed alternatives as another test 🙂 they're not authoritative, it's...

will do! yeah makes sense

shrewd cedar Mar 18, 2025, 11:58 PM

#

astral star how hard is it to add auto-save to variables? I'm out here hardcoding "...save b...

it seems like it would be hard to determine when to auto-save in-place and when to set a new var, for example i always think of them as the initial variables for the prompt, i don't expect it to "pollute" the values I set

#

i did change the system prompt to say "save the result to a variable at the end" which seems to work well. it just makes up a more descriptive var name

astral star Mar 19, 2025, 12:10 AM

#

shrewd cedar i did change the system prompt to say "save the result to a variable at the end"...

yeah but that makes it hard to wrap in code

#

(unpredictable var name)

shrewd cedar Mar 19, 2025, 12:11 AM

#

yup

astral star Mar 19, 2025, 12:11 AM

#

shrewd cedar it seems like it would be hard to determine when to auto-save in-place and when ...

I think select vs. duplicate would solve that

shrewd cedar Mar 19, 2025, 12:12 AM

#

well, you can always tell it a var to set, too, it's just the magic reassigning that seems hard to specify

astral star Mar 19, 2025, 12:12 AM

#

I would just always reassign, no specifying needed

#

as part of bbi

#

if you want to modify & save to new var, you duplicate, select new var, change in place

shrewd cedar Mar 19, 2025, 12:15 AM

#

https://tenor.com/NUte.gif

Tenor

#

it's pretty hard to resist my normal functional programming sensibilities here

#

the way the examples are currently written is what felt most natural to me: i give it either one input, or many named inputs, and it gives me a result

shrewd cedar Mar 19, 2025, 12:18 AM

#

astral star I think `select` vs. `duplicate` would solve that

trying this out now, curious how it goes

astral star Mar 19, 2025, 12:21 AM

#

shrewd cedar the way the examples are currently written is what felt most natural to me: i gi...

Same.. i can relate

remote linden Mar 19, 2025, 12:28 AM

#

My preference between these scenarios is to always overwrite the variable and handle undo some other way

shrewd cedar Mar 19, 2025, 12:44 AM

#

to confirm, this is 100% motivated by dagger shell interactive UX, right?

shrewd cedar Mar 19, 2025, 12:47 AM

#

astral star I think `select` vs. `duplicate` would solve that

bearing in mind this will cost an additional tool slot for each var you have, which will grow quickly as the model sets vars of its own

#

we're going from 2 tools, to N, to 2*N, with an N that increases as the request goes on

remote linden Mar 19, 2025, 12:48 AM

#

yeah I have my doubts for that reason

remote linden Mar 19, 2025, 12:48 AM

#

shrewd cedar to confirm, this is 100% motivated by `dagger shell` interactive UX, right?

I'm thinking about code

shrewd cedar Mar 19, 2025, 12:49 AM

#

do you have any code around that uses multi-object? curious what patterns you landed on

#

openai tools limit is 128, not sure about other vendors

remote linden Mar 19, 2025, 1:06 AM

#

just one i've barely started, i can fix it up and push it

shrewd cedar Mar 19, 2025, 1:07 AM

#

astral star how hard is it to add auto-save to variables? I'm out here hardcoding "...save b...

do you have examples for this too?

remote linden Mar 19, 2025, 1:45 AM

#

shrewd cedar do you have any code around that uses multi-object? curious what patterns you la...

😩 Input Tokens: 162,673 ◆ Output Tokens: 84

shrewd cedar Mar 19, 2025, 1:46 AM

#

got a trace?

remote linden Mar 19, 2025, 1:46 AM

#

https://github.com/kpenfound/agents/tree/multiobj/coder-and-reviewer

GitHub

agents/coder-and-reviewer at multiobj · kpenfound/agents

Contribute to kpenfound/agents development by creating an account on GitHub.

#

yeah

#

https://v3.dagger.cloud/kpenfound/traces/a7fbeb98650f01f9aa388642bc8591cb

Dagger Cloud

Browse and visualize Dagger traces.

#

the wheels fell off

shrewd cedar Mar 19, 2025, 1:49 AM

#

where does it have the token spike? (need to implement metrics in Cloud...)

remote linden Mar 19, 2025, 1:50 AM

#

looks like it's around 600-1k until we get here

Okay, I have the details for both the  Coder  and the  Reviewer  objects. Now I can proceed with simulating the review loop. Since I cannot determine the reviewer's satisfaction, I
│ ┃ will run the loop three times.
│ ┃
│ ┃ Iteration 1:
│ ┃
│ ┃ 1. Coder develops the assignment in the source directory.
│ ┃ 2. Reviewer reviews the developed code.
│ ┃ 1.7s ◆ Input Tokens: 162,673 ◆ Output Tokens: 84

and then its 160k from then on

shrewd cedar Mar 19, 2025, 1:52 AM

#

ah I think it's the _undo call - it was dumping the object JSON in the tool response

#

it's fixed on llm-evals

#

will backport

remote linden Mar 19, 2025, 1:53 AM

#

sweet, this is on llm.11 btw so I don't know if I'm missing any other goodies

#

also I have
SetString("assignment", assignment).
and

<assignment>
$assignment
</assignment>

in my prompt file and it doesn't seem to be picking up the assignment. It's not getting substituted in the prompt, which I guess is expected? so I have to prompt to look up "assignment"?

astral star Mar 19, 2025, 2:02 AM

#

remote linden also I have `SetString("assignment", assignment).` and ``` <assignment> $assignm...

It should be substituted- at least it used to

#

you have to set the variable first

remote linden Mar 19, 2025, 2:05 AM

#

oh wait I think the outer agent was just driving wrong (there's layered agents in this one)

#

it passed "assignment" as assignment and that made things hard to read lol

shrewd cedar Mar 19, 2025, 2:05 AM

#

remote linden also I have `SetString("assignment", assignment).` and ``` <assignment> $assignm...

yeah that should work. I'm trying out not having $ctr expand to Container@xxh3:... - that always worked by accident

remote linden Mar 19, 2025, 2:10 AM

#

ah geez I'm getting an engine crash now

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x1a343af]

goroutine 170336 [running]:
github.com/dagger/dagger/core.(*LLMEnv).typedef(0xc004ce17f8?, 0xc004ce17c8?, {0x0, 0x0})
        /app/core/llm_env.go:57 +0x2f
github.com/dagger/dagger/core.(*LLMEnv).Tools(0xc0033c7b60, 0xc000395f10)
        /app/core/llm_env.go:120 +0x154
github.com/dagger/dagger/core.(*LLM).Sync(0xc003b9a000, {0x36253f0, 0xc0033c79b0}, 0xc000395f10)
        /app/core/llm.go:570 +0xdd
github.com/dagger/dagger/core.(*LLM).Get(0x29abda0?, {0x36253f0?, 0xc0033c79b0?}, 0x4?, {0xc0011593f7, 0x9})
        /app/core/llm.go:732 +0x27
github.com/dagger/dagger/core.LLMHook.ExtendLLMType.func2({0x36253f0, 0xc0033c79b0}, {0x3639fc0?, 0xc002438580?}, 0x0?)
        /app/core/llm.go:887 +0xd6

#

https://v3.dagger.cloud/kpenfound/traces/406ffaf0664a634ae59b241e4a9ac199

Dagger Cloud

Browse and visualize Dagger traces.

shrewd cedar Mar 19, 2025, 1:42 PM

#

portalfrom https://discordapp.com/channels/707636530424053791/1351637552263598160/1351759568136310836 - just connecting threads since there was some cross-talk

remote linden Mar 19, 2025, 7:39 PM

#

this is really low on my to-do list but I really want to finetune qwen2.5-coder:7b using an expanded set of these evals to make it really good at function calling our way

shrewd cedar Mar 20, 2025, 3:01 AM

#

Still chipping away at this auto-updating bindings paradigm... It's hard to tell if I'm just prompting poorly or if it's less intuitive to the model. I'm testing with Gemini because I had just gotten it to perform pretty well with the variables model before experimenting with this

#

I was starting to buy into it last night, but the nagging feeling that this paradigm is swimming upstream hasn't completely gone away. Results haven't been encouraging yet either but I'll push to a branch if anyone else wants to tag-team it, maybe I'm framing it poorly

#

also, lol

remote linden Mar 20, 2025, 3:07 AM

#

https://tenor.com/view/new-high-score-smiling-friends-new-record-new-personal-best-9-trillion-points-gif-11924645167628210139

Tenor

remote linden Mar 20, 2025, 3:15 AM

#

shrewd cedar I was starting to buy into it last night, but the nagging feeling that this para...

I think you're right. I've been trying to come up with alternatives that might fit better but haven't had any great ideas yet. That's where the idea to finetune is coming from

shrewd cedar Mar 20, 2025, 3:15 AM

#

here's the branch if anyone wants to tinker and try to get better results: https://github.com/vito/dagger/tree/llm-bindings

#

pivoting back to vars for a bit to see if it's actually as solid as I remember

shrewd cedar Mar 20, 2025, 3:17 AM

#

remote linden I think you're right. I've been trying to come up with alternatives that might f...

what's finetune?

#

or do you mean the general concept lol

remote linden Mar 20, 2025, 3:23 AM

#

basically this https://docs.unsloth.ai/basics/tutorial-how-to-finetune-llama-3-and-use-in-ollama

Tutorial: How to Finetune Llama-3 and Use In Ollama | Unsloth Docum...

Beginner's Guide for creating a customized personal assistant (like ChatGPT) to run locally on Ollama

#

but somehow using dagger's API to make it really great at dagger toolcalling hand waving

shrewd cedar Mar 20, 2025, 3:24 AM

#

oh cool. is that like a process for making a specialized model?

#

(sorta)

remote linden Mar 20, 2025, 3:25 AM

#

as a total llm noob, i believe so

#

if it works, it wouldn't be super practical for most users but it would be cool to have a pretty reliable model that can be run locally and just use a big ol' honkin model otherwise

shrewd cedar Mar 20, 2025, 3:30 AM

#

yeah for sure

astral star Mar 20, 2025, 4:10 AM

#

Applying fine-tuning by reinforcement learning (RL) specifically for agents is the hot thing to do in 25, a bunch of startups do it behind the scenes. The big labs were hoarding the tricks but DeepSeek papers busted it open so now everyone is rushing to publish too.

Potentially Dagger could achieve fantastic results because we can combine 1) the schema 2) the traces, 3) the prompts, and ideally 4) the errors (or user feedback ie thumbs up or down). Then we could eventually automate the production of specialized models for each module.

remote linden Mar 20, 2025, 4:21 AM

#

Yeah that sounds super cool. Obviously we should try to work out of the box for the big ones like gemini, claude, gpt-4o, but it seems reasonable to expect some specialization to get a 3b model to produce good results

shrewd cedar Mar 20, 2025, 4:01 PM

#

I can't tell if this idea is bigbrain or FrogeDumb : https://v3.dagger.cloud/dagger/traces/6b88bbefa1d0c773084089c4650cc823

Writing an agent that iterates on the system prompt for me, running an eval after each prompt change and analyzing the history (+ verifying the result.)

It works but the outer AI model is usually just as dumb as the inner one. I guess I should use a really smart model to work on the prompt for a dumber model.

If this pays off I'd like to extend it further to iterate on tool descriptions and different schemes at that layer too

#

It's at least a good way for me to dogfood the agent/workspace patterns

shrewd cedar Mar 20, 2025, 10:33 PM

#

the idea of a "state machine" is still banging around in my head - might be an effective metaphor to teach the model how tools/objects work (each object is a state machine, calling tools transitions to a new state)

https://developer.mozilla.org/en-US/docs/Glossary/State_machine

MDN Web Docs

State machine - MDN Web Docs Glossary: Definitions of Web-related t...

A state machine is a mathematical abstraction used to design algorithms. A state machine reads a set of inputs and changes to a different state based on those inputs.

shrewd cedar Mar 21, 2025, 3:41 AM

#

https://v3.dagger.cloud/dagger/traces/fbcaba5971508bc2cb7b6f210761e622?listen=df507ce865de0b2c&listen=eebb952c922ae237&showHidden=f82a12d568073a87&showHidden=fb8bfeb9bda35c73#114ead400966c156

ok this is pretty fun - Gemini is great for an agentic prompt development feedback loop, since it's so fast

shrewd cedar Mar 21, 2025, 3:57 AM

#

gpt-4o hits a rate limit if i try the same

#

got a 5/5 with Gemini when it generated this prompt 🤯 https://v3.dagger.cloud/dagger/traces/0f2ba7e5fd0bee5a4b89a55866fb9dc0

You are a functional state machine interacting with a GraphQL API through tools that align with the current state object. Each state change returns a new object, which updates the available set of tools. When a field returns an object type, it becomes the new context, replacing the current toolset. Use tools like '_save' to assign current objects to variables and to pass IDs for operations that require them.