function masks | Dagger | Page 1

timid barn Mar 27, 2025, 3:27 AM

#

ah dang Claude 3.5 has a bit of trouble - let's see if I can makes some tweaks:

│ │ │🤖 Now we have successfully:
│ │ │ ┃
│ │ │ ┃ 1. Mounted Directory#1 at /src
│ │ │ ┃ 2. Set /src as the working directory
│ │ │ ┃ 3. Set CGO_ENABLED=0
│ │ │ ┃ 4. Built the binary
│ │ │ ┃
│ │ │ ┃ However, I notice that the available tools don't include a way to directly return a File from the container. You might need to use additional tools or specify how you'd like to access the resulting binary. The binary should be built as "booklit" in the working directory, but we'll need additional instructions on how you'd like to access or export it.
│ │ │ ┃
│ │ │ ┃ Would you like to specify how you'd like to access or export the resulting binary file?

oak zodiac Mar 27, 2025, 3:27 AM

#

Go go go!

#

What does the API look like with function masks ?

timid barn Mar 27, 2025, 3:29 AM

#

There's just a new required functions: []string arg for the existing select<Type> tool, and the model passes in the ones it wants, from the list of functions in the tool description.

#

So it kind of becomes selectFooTools (hey maybe I'll try that)

#

damn, no cigar

🤖 The binary has been built. Unfortunately, I don't see a direct tool available to return the binary as a File object. I'd need additional tools like file or similar to retrieve the built binary. Could you please provide the necessary tool to access the built binary file from the container?

timid barn Mar 27, 2025, 4:37 AM

#

man, it's sooo close but i'm seeing some other subtle misbehavior resulting from it, i fear the problem may run deep:

│ │ │🤖 selectContainerTools(
│ │ │ │ │ tools: ["withMountedDirectory", "withWorkdir", "withExec", "directory"]
│ │ │ │ │ object: ✔ Container.withEnvVariable(name: "BUSTER", value: "8-2025-03-27 04:24:48.203165361 +0000 UTC m=+0.178601237"): Container! 0.0s
│ │ │ │ ): String! 0.0s
│ │ │
│ │ │🤖 0.5s ◆ Input Tokens: 1,377 ◆ Output Tokens: 12
│ │ │
│ │ │ $ Container.withMountedDirectory(
│ │ │ │ │ path: "/app"
│ │ │ │ │ source: $ GitRef.tree: Directory! 0.1s CACHED
│ │ │ │ ): Container! 0.1s CACHED
│ │ │
│ │ │🤖 0.5s ◆ Input Tokens: 1,410 ◆ Output Tokens: 8
│ │ │
│ │ │ ✔ Container.withWorkdir(path: "/app"): Container! 0.0s
│ │ │
│ │ │🤖 0.6s ◆ Input Tokens: 1,439 ◆ Output Tokens: 23
│ │ │
│ │ │ ✔ Container.withExec(args: ["go", "build", "-o", "booklit", "./cmd/booklit"]): Container! 2.9s
│ │ │
│ │ │🤖 0.6s ◆ Input Tokens: 1,482 ◆ Output Tokens: 9
│ │ │
│ │ │ ✘ Container.directory(path: "/app/booklit"): Directory! 0.0s
│ │ │ ! path /app/booklit is a file, not a directory
│ │ │
│ │ │🤖 My apologies. It seems I made a mistake and tried to retrieve the compiled binary as a directory. I should have retrieved it as a file instead. However, there is no tool to directly retrieve a file. I will have to retrieve the parent directory.
│ │ │ ┃ 1.0s ◆ Input Tokens: 1,516 ◆ Output Tokens: 58
│ │ │
│ │ │ ✔ Container.directory(path: "/app"): Directory! 0.0s
│ │ │
│ │ │🤖 Now that I have the directory containing the binary, I need to select the right tools to work with it.
│ │ │ ┃ 0.7s ◆ Input Tokens: 951 ◆ Output Tokens: 32
│ │ │
│ │ │🤖 selectDirectoryTools(
│ │ │ │ │ tools: ["file"]
│ │ │ │ │ object: ✔ Container.directory(path: "/app"): Directory! 0.0s
│ │ │ │ ): String! 0.0s
│ │ │
│ │ │🤖 0.4s ◆ Input Tokens: 1,031 ◆ Output Tokens: 6
│ │ │
│ │ │ ✔ Directory.file(path: "booklit"): File! 0.0s

basically, it:

grabbed Container.directory (not sure why actually)
later realized it needed to grab a file
instead of enabling Container.file, it just YOLO'd and ran Container.directory
eventually got there in the end (via Directory.file), but that seems a bit seems lucky
(though I'm making a lot of inferences here, in reality the train might have gone off the rails at step 1)

Tried a few tricks with injecting prompts into tool descriptions, but not quite there yet. Ah well. Will keep tinkering tomorrow but this probably can't safely make the release.

Pushed here if anyone wants to try some things: https://github.com/dagger/dagger/pull/9986

Here's how I'm testing it: dagger-dev -m github.com/vito/daggerverse/botsbuildingbots --model gemini-2.0-flash call --attempts 10 evaluate --model gemini-2.0-flash --eval BuildMulti

timid barn Mar 27, 2025, 4:20 PM

#

Another bizarre mistake:

│🧑 Evaluate BuildMulti against gpt-4o, tell me the success rate, and analyze the results for every attempt.
│ ┃ 0.0s
│
│🤖 I need to access the evaluation results to determine the success rate and analyze the results for each attempt. I will start by selecting the 'evaluate' tool for the current workspace. Then I will need to know the operations available for the evaluate tool.
│ ┃ 0.9s ◆ Input Tokens: 423 ◆ Output Tokens: 59
│
│🤖 selectWorkspaceTools(
│ │ │ tools: ["evaluate"]
│ │ │ object: ✔ LLM.workspace: Workspace! 0.6s
│ │ ): String! 0.0s
│
│🤖 I have selected the 'evaluate' tool. Now I need to run the evaluation using the provided 'eval' and 'model' names. I will use the  Workspace_evaluate  function.
│ ┃ 0.8s ◆ Input Tokens: 530 ◆ Output Tokens: 49
│
│ ✘ Workspace.evaluate(eval: "$eval", model: "$model"): String! 0.5s
│ ! unknown evaluation: $eval
│
│🤖 It seems like the evaluation name  $eval  is not recognized. I need to find the correct evaluation name. I will use the  evalNames  tool to list the available evaluations.
│ ┃ 0.7s ◆ Input Tokens: 595 ◆ Output Tokens: 49

the code:

    return m.llm().
        WithPromptVar("eval", eval).
        WithPromptVar("model", model).
        WithPrompt(`You are a QA engineer running an LLM eval against a model`).
        WithPrompt(`Evaluate $eval against $model, tell me the success rate, and analyze the results for every attempt.`).
        Workspace().
        Findings(ctx)

It tried to use variables directly as arguments to functions even though I expanded them in my prompt. I guess the readVariable tool is making those vars a little too discoverable now?

cc @pliant elk this might be a general risk, it's the sort of thing that makes me worried about a toolset that has extra dimensions to it. We could expand vars in arguments like this, but it feels a little wishy-washy, what if it's legitimately trying to pass strings around that contain unexpanded variables? For example, calling withEnvVariable with expand: true - it would want those expanded in the container context, not the LLM context

also I guess we can keep the separate threads for the separate approaches, so we don't cross streams

#

actually, maybe withPromptVar variables just shouldn't be exposed via readVariable, they should literally just be for prompt expansion

pliant elk Mar 27, 2025, 4:25 PM

#

I'm not caught up on the current set of tools. What is readVariable?

#

You're like 10 rev iterations ahead of all of us

timid barn Mar 27, 2025, 4:26 PM

#

that was to support passing string vars to the LLM without having to directly interpolate them into the prompt

#

so it's a tool that lets you get the value for one, and its description lists the available vars, so it can discover them

pliant elk Mar 27, 2025, 4:28 PM

#

If we take the "shadowing" approach, where variables are exposed like regular functions, wouldn't each string variable just get its own tool?

timid barn Mar 27, 2025, 4:30 PM

#

I can try that instead - was trying to avoid having too many ways for the # of tool slots to be consumed, but I think that would mitigate this yeah

pliant elk Mar 27, 2025, 4:30 PM

#

maybe i'm missing context, but in the snippet above, it just looks like the LLM decides out of nowhere to use $foo and $bar for no apparent reason.

#

is it because of hints we're giving it behind the scenes?

timid barn Mar 27, 2025, 4:30 PM

#

yep, the readVariable description lists them, so it saw them there

#

i think if they were tools, it wouldn't, but we still have to be extra careful to not have it think of them like variables

#

i suspect it would make the same mistake if so

#

(especially if we use $foo syntax)

pliant elk Mar 27, 2025, 4:33 PM

#

yeah basically - erase the concept of variables from the BBI entirely -right?

timid barn Mar 27, 2025, 4:33 PM

#

yeah

timid barn Mar 27, 2025, 5:07 PM

#

😬 I'm seeing gpt-4o still treat these like variables with this framing:

    {
      "function": {
        "name": "read_myContent",
        "description": "Read the `myContent` value provided to you.",
        "parameters": {
          "additionalProperties": false,
          "properties": {},
          "required": [],
          "strict": true,
          "type": "object"
        }
      },
      "type": "function"
    },

=> later it tries to pass it in by name

    {
      "content": "",
      "tool_calls": [
        {
          "id": "call_RCaGGpj31JeDNtdGwPeIL2JM",
          "function": {
            "arguments": "{\"path\":\"/weird.txt\",\"source\":\"myContent\"}",
            "name": "Directory_withFile"
          },
          "type": "function"
        }
      ],
      "role": "assistant"
    },

and that was after it even called read_myContent to get the value

even when it does the right thing, it sometimes butchers the content:

│ │ ┃         Error:          Not equal:
│ │ ┃                         expected: "-$@!&* BEGIN WEIRD FILE -$@!&*\nim some fun content\n---- END WEIRD FILE----"
│ │ ┃                         actual  : "-$@!\x7f!&* BEGIN WEIRD FILE -$@!\x7f!&*\nim some fun content\n---- END WEIRD FILE----"
│ │ ┃
│ │ ┃                         Diff:
│ │ ┃                         --- Expected
│ │ ┃                         +++ Actual
│ │ ┃                         @@ -1,2 +1,2 @@
│ │ ┃                         --$@!&* BEGIN WEIRD FILE -$@!&*
│ │ ┃                         +-$@!!&* BEGIN WEIRD FILE -$@!!&*
│ │ ┃                          im some fun content
│ │ ┃         Test:           eval

trace: https://v3.dagger.cloud/dagger/traces/f18637aed198046b9359fca890b9e12d (note that there's an error response which still mentions vars, but I don't think that influenced it, the wheels fell off the wagon by then)

pliant elk Mar 27, 2025, 5:07 PM

#

well there's still the read_

timid barn Mar 27, 2025, 5:08 PM

#

how would you frame it?

pliant elk Mar 27, 2025, 5:08 PM

#

doesn't that tool talk about variables in the description?

timid barn Mar 27, 2025, 5:08 PM

#

it's kind of hard to escape the framing entirely - the tool's sole purpose is to read a named value

#

i'll try anyway

pliant elk Mar 27, 2025, 5:09 PM

#

I would expect the tool to be called myContent and the description to be something you can pass with the binding. Plus maybe a prefix like "returns a string"

timid barn Mar 27, 2025, 5:09 PM

#

ah was thinking about that but didn't know how you'd do that in .denv

pliant elk Mar 27, 2025, 5:09 PM

#

So from the LLM's point of view:

// Returns a string
myContent()

timid barn Mar 27, 2025, 5:10 PM

#

i'm also getting a little worried about the LLM's ability to even do this, considering the diff above, it seems to frequently fail at accurately reproducing inputs. I've seen it make more subtle mistakes too, like turning even just gpt-4o into gpt-4

#

but, soldiering on

timid barn Mar 27, 2025, 5:36 PM

#

saw it still try to do wild & crazy things with a simple iteration without descriptions, working on adding required descriptions to all objects + variables now to see if it helps
https://v3.dagger.cloud/dagger/traces/ed479b8e9b7ccdae42c50d3bb8756cb9?listen=f1a4375b37e56626&listen=edb2c8da3c537b43
(this was with just "myContent: Returns a string")

#

🤖 The evaluation ReadImplicitVars against model gpt-4o was successful in all 3 attempts, resulting in a 100% success rate. The model correctly wrote the content to the specified file in the directory in each attempt.
there is hope, yet!
https://dagger.cloud/dagger/traces/aa5fd10f6cb83ed5c230f6c1f7c5146c

timid barn Mar 27, 2025, 9:22 PM

#

have not had much luck with other models yet. GPT-4o figures it out pretty consistently (8/9), but Claude 3.7 (1/9) and Gemini (3/10) struggle.

I really like the idea of making sure each binding has a description, regardless, the UX feels better just from having them. But 1) I've still seen it try passing those names around as arguments, and 2) I'm a little worried that depending on what people name things, it'll confuse the model - the descriptions are extremely load-bearing, and in the past I've learned models don't consistently apply very strong weighting to them. For example, I saw the outer model (the one that runs + analyzes my evals) try calling eval, which is just the var that has the evaluation name, instead of actually running the evaluations. That was cleared up by adding a description, but it's a sign of some of the risk that can come with putting so many arbitrarily named tools in the tool namespace

#

I'll undo function masking and see how it does without it, since that was already causing some shaky behavior

#

@pliant elk if you're working in the same area (tool calling scheme / environments / etc) maybe I should just context switch to other things for a bit? And maybe I'll just remove readVariable for now to keep the tool scheme focused?

not sure how much we're overlapping atm

pliant elk Mar 27, 2025, 9:46 PM

#

timid barn <@488409085998530571> if you're working in the same area (tool calling scheme / ...

I can adapt to your priorities honestly. I'm trying to make my Environment API branch the most superficial possible diff from your branch. But haven't made tons of progress today, got distracted by other things.

Narrowing the tool scheme focus seems like a good idea

#

Going to try to make my environment API branch mergeable by tonight - but I don't want to constrain what you work on

timid barn Mar 27, 2025, 9:51 PM

#

I think it's a good time to context switch for a bit anyway, and once the environments API is taking shape I'll have a better idea of what to do when I get back to the tool calling scheme

#

I'll remove readVariable and any other things that seem like halfway measures

#

things I'm thinking to context switch to:

picking up lifeAlert
being able to press a key to splice into a message loop (interject) - like lifeAlert but in the other direction
using -i to auto-interject into any message loop that ends without returning the desired value
retry logic (on rate limits / overloaded)

pliant elk Mar 27, 2025, 9:55 PM

#

Can we add "replace .llm with variables" to the list? 😛

timid barn Mar 27, 2025, 9:55 PM

#

oh right

#

any preferred prio? ("now" is fine lol)

pliant elk Mar 27, 2025, 9:56 PM

#

All of those seem great. I would prioritize UX changes over implementation changes (even painful ones like rate limits) because there's less penalty to changing them later

timid barn Mar 27, 2025, 9:56 PM

#

yeah i left that one at the bottom since it's more easily delegateable

timid barn Mar 27, 2025, 10:51 PM

#

another item for the list: optimizing v3 to send logs over one connection, instead of one per span, since right now larger LLM traces can absolutely kill the UI

timid barn Mar 27, 2025, 11:59 PM

#

@pliant elk here's the variables cleanup pr: https://github.com/dagger/dagger/pull/9993

pliant elk Mar 28, 2025, 12:10 AM

#

Ha ha I was just working on that part 🙂 Easy manual rebase, thanks

timid barn Mar 28, 2025, 12:10 AM

#

testing it was pretty fun: https://v3.dagger.cloud/dagger/traces/7b3cec315c93c12e3efb07f3a8cd2de1

Dagger Cloud

Browse and visualize Dagger traces.

#

the dream would be running that in CI and having it post a github review/comment if the PR touches files with llm in the path

#

need to set up secrets for that thinkspin

#

...and maybe think about the cost implications

#function masks