#function masks
1 messages ยท Page 1 of 1 (latest)
ah dang Claude 3.5 has a bit of trouble - let's see if I can makes some tweaks:
โ โ โ๐ค Now we have successfully:
โ โ โ โ
โ โ โ โ 1. Mounted Directory#1 at /src
โ โ โ โ 2. Set /src as the working directory
โ โ โ โ 3. Set CGO_ENABLED=0
โ โ โ โ 4. Built the binary
โ โ โ โ
โ โ โ โ However, I notice that the available tools don't include a way to directly return a File from the container. You might need to use additional tools or specify how you'd like to access the resulting binary. The binary should be built as "booklit" in the working directory, but we'll need additional instructions on how you'd like to access or export it.
โ โ โ โ
โ โ โ โ Would you like to specify how you'd like to access or export the resulting binary file?
There's just a new required functions: []string arg for the existing select<Type> tool, and the model passes in the ones it wants, from the list of functions in the tool description.
So it kind of becomes selectFooTools (hey maybe I'll try that)
damn, no cigar
๐ค The binary has been built. Unfortunately, I don't see a direct tool available to return the binary as a File object. I'd need additional tools like
fileor similar to retrieve the built binary. Could you please provide the necessary tool to access the built binary file from the container?
man, it's sooo close but i'm seeing some other subtle misbehavior resulting from it, i fear the problem may run deep:
โ โ โ๐ค selectContainerTools(
โ โ โ โ โ tools: ["withMountedDirectory", "withWorkdir", "withExec", "directory"]
โ โ โ โ โ object: โ Container.withEnvVariable(name: "BUSTER", value: "8-2025-03-27 04:24:48.203165361 +0000 UTC m=+0.178601237"): Container! 0.0s
โ โ โ โ ): String! 0.0s
โ โ โ
โ โ โ๐ค 0.5s โ Input Tokens: 1,377 โ Output Tokens: 12
โ โ โ
โ โ โ $ Container.withMountedDirectory(
โ โ โ โ โ path: "/app"
โ โ โ โ โ source: $ GitRef.tree: Directory! 0.1s CACHED
โ โ โ โ ): Container! 0.1s CACHED
โ โ โ
โ โ โ๐ค 0.5s โ Input Tokens: 1,410 โ Output Tokens: 8
โ โ โ
โ โ โ โ Container.withWorkdir(path: "/app"): Container! 0.0s
โ โ โ
โ โ โ๐ค 0.6s โ Input Tokens: 1,439 โ Output Tokens: 23
โ โ โ
โ โ โ โ Container.withExec(args: ["go", "build", "-o", "booklit", "./cmd/booklit"]): Container! 2.9s
โ โ โ
โ โ โ๐ค 0.6s โ Input Tokens: 1,482 โ Output Tokens: 9
โ โ โ
โ โ โ โ Container.directory(path: "/app/booklit"): Directory! 0.0s
โ โ โ ! path /app/booklit is a file, not a directory
โ โ โ
โ โ โ๐ค My apologies. It seems I made a mistake and tried to retrieve the compiled binary as a directory. I should have retrieved it as a file instead. However, there is no tool to directly retrieve a file. I will have to retrieve the parent directory.
โ โ โ โ 1.0s โ Input Tokens: 1,516 โ Output Tokens: 58
โ โ โ
โ โ โ โ Container.directory(path: "/app"): Directory! 0.0s
โ โ โ
โ โ โ๐ค Now that I have the directory containing the binary, I need to select the right tools to work with it.
โ โ โ โ 0.7s โ Input Tokens: 951 โ Output Tokens: 32
โ โ โ
โ โ โ๐ค selectDirectoryTools(
โ โ โ โ โ tools: ["file"]
โ โ โ โ โ object: โ Container.directory(path: "/app"): Directory! 0.0s
โ โ โ โ ): String! 0.0s
โ โ โ
โ โ โ๐ค 0.4s โ Input Tokens: 1,031 โ Output Tokens: 6
โ โ โ
โ โ โ โ Directory.file(path: "booklit"): File! 0.0s
basically, it:
- grabbed Container.directory (not sure why actually)
- later realized it needed to grab a file
- instead of enabling Container.file, it just YOLO'd and ran Container.directory
- eventually got there in the end (via Directory.file), but that seems a bit seems lucky
(though I'm making a lot of inferences here, in reality the train might have gone off the rails at step 1)
Tried a few tricks with injecting prompts into tool descriptions, but not quite there yet. Ah well. Will keep tinkering tomorrow but this probably can't safely make the release.
Pushed here if anyone wants to try some things: https://github.com/dagger/dagger/pull/9986
Here's how I'm testing it: dagger-dev -m github.com/vito/daggerverse/botsbuildingbots --model gemini-2.0-flash call --attempts 10 evaluate --model gemini-2.0-flash --eval BuildMulti
Another bizarre mistake:
โ๐ง Evaluate BuildMulti against gpt-4o, tell me the success rate, and analyze the results for every attempt.
โ โ 0.0s
โ
โ๐ค I need to access the evaluation results to determine the success rate and analyze the results for each attempt. I will start by selecting the 'evaluate' tool for the current workspace. Then I will need to know the operations available for the evaluate tool.
โ โ 0.9s โ Input Tokens: 423 โ Output Tokens: 59
โ
โ๐ค selectWorkspaceTools(
โ โ โ tools: ["evaluate"]
โ โ โ object: โ LLM.workspace: Workspace! 0.6s
โ โ ): String! 0.0s
โ
โ๐ค I have selected the 'evaluate' tool. Now I need to run the evaluation using the provided 'eval' and 'model' names. I will use the Workspace_evaluate function.
โ โ 0.8s โ Input Tokens: 530 โ Output Tokens: 49
โ
โ โ Workspace.evaluate(eval: "$eval", model: "$model"): String! 0.5s
โ ! unknown evaluation: $eval
โ
โ๐ค It seems like the evaluation name $eval is not recognized. I need to find the correct evaluation name. I will use the evalNames tool to list the available evaluations.
โ โ 0.7s โ Input Tokens: 595 โ Output Tokens: 49
the code:
return m.llm().
WithPromptVar("eval", eval).
WithPromptVar("model", model).
WithPrompt(`You are a QA engineer running an LLM eval against a model`).
WithPrompt(`Evaluate $eval against $model, tell me the success rate, and analyze the results for every attempt.`).
Workspace().
Findings(ctx)
It tried to use variables directly as arguments to functions even though I expanded them in my prompt. I guess the readVariable tool is making those vars a little too discoverable now?
cc @pliant elk this might be a general risk, it's the sort of thing that makes me worried about a toolset that has extra dimensions to it. We could expand vars in arguments like this, but it feels a little wishy-washy, what if it's legitimately trying to pass strings around that contain unexpanded variables? For example, calling withEnvVariable with expand: true - it would want those expanded in the container context, not the LLM context
also I guess we can keep the separate threads for the separate approaches, so we don't cross streams
actually, maybe withPromptVar variables just shouldn't be exposed via readVariable, they should literally just be for prompt expansion
I'm not caught up on the current set of tools. What is readVariable?
You're like 10 rev iterations ahead of all of us
that was to support passing string vars to the LLM without having to directly interpolate them into the prompt
so it's a tool that lets you get the value for one, and its description lists the available vars, so it can discover them
If we take the "shadowing" approach, where variables are exposed like regular functions, wouldn't each string variable just get its own tool?
I can try that instead - was trying to avoid having too many ways for the # of tool slots to be consumed, but I think that would mitigate this yeah
maybe i'm missing context, but in the snippet above, it just looks like the LLM decides out of nowhere to use $foo and $bar for no apparent reason.
is it because of hints we're giving it behind the scenes?
yep, the readVariable description lists them, so it saw them there
i think if they were tools, it wouldn't, but we still have to be extra careful to not have it think of them like variables
i suspect it would make the same mistake if so
(especially if we use $foo syntax)
yeah basically - erase the concept of variables from the BBI entirely -right?
yeah
๐ฌ I'm seeing gpt-4o still treat these like variables with this framing:
{
"function": {
"name": "read_myContent",
"description": "Read the `myContent` value provided to you.",
"parameters": {
"additionalProperties": false,
"properties": {},
"required": [],
"strict": true,
"type": "object"
}
},
"type": "function"
},
=> later it tries to pass it in by name
{
"content": "",
"tool_calls": [
{
"id": "call_RCaGGpj31JeDNtdGwPeIL2JM",
"function": {
"arguments": "{\"path\":\"/weird.txt\",\"source\":\"myContent\"}",
"name": "Directory_withFile"
},
"type": "function"
}
],
"role": "assistant"
},
and that was after it even called read_myContent to get the value
even when it does the right thing, it sometimes butchers the content:
โ โ โ Error: Not equal:
โ โ โ expected: "-$@!&* BEGIN WEIRD FILE -$@!&*\nim some fun content\n---- END WEIRD FILE----"
โ โ โ actual : "-$@!\x7f!&* BEGIN WEIRD FILE -$@!\x7f!&*\nim some fun content\n---- END WEIRD FILE----"
โ โ โ
โ โ โ Diff:
โ โ โ --- Expected
โ โ โ +++ Actual
โ โ โ @@ -1,2 +1,2 @@
โ โ โ --$@!&* BEGIN WEIRD FILE -$@!&*
โ โ โ +-$@!!&* BEGIN WEIRD FILE -$@!!&*
โ โ โ im some fun content
โ โ โ Test: eval
trace: https://v3.dagger.cloud/dagger/traces/f18637aed198046b9359fca890b9e12d (note that there's an error response which still mentions vars, but I don't think that influenced it, the wheels fell off the wagon by then)
well there's still the read_
how would you frame it?
doesn't that tool talk about variables in the description?
it's kind of hard to escape the framing entirely - the tool's sole purpose is to read a named value
i'll try anyway
I would expect the tool to be called myContent and the description to be something you can pass with the binding. Plus maybe a prefix like "returns a string"
ah was thinking about that but didn't know how you'd do that in .denv
So from the LLM's point of view:
// Returns a string
myContent()
i'm also getting a little worried about the LLM's ability to even do this, considering the diff above, it seems to frequently fail at accurately reproducing inputs. I've seen it make more subtle mistakes too, like turning even just gpt-4o into gpt-4
but, soldiering on
saw it still try to do wild & crazy things with a simple iteration without descriptions, working on adding required descriptions to all objects + variables now to see if it helps
https://v3.dagger.cloud/dagger/traces/ed479b8e9b7ccdae42c50d3bb8756cb9?listen=f1a4375b37e56626&listen=edb2c8da3c537b43
(this was with just "myContent: Returns a string")
๐ค The evaluation ReadImplicitVars against model gpt-4o was successful in all 3 attempts, resulting in a 100% success rate. The model correctly wrote the content to the specified file in the directory in each attempt.
there is hope, yet!
https://dagger.cloud/dagger/traces/aa5fd10f6cb83ed5c230f6c1f7c5146c
have not had much luck with other models yet. GPT-4o figures it out pretty consistently (8/9), but Claude 3.7 (1/9) and Gemini (3/10) struggle.
I really like the idea of making sure each binding has a description, regardless, the UX feels better just from having them. But 1) I've still seen it try passing those names around as arguments, and 2) I'm a little worried that depending on what people name things, it'll confuse the model - the descriptions are extremely load-bearing, and in the past I've learned models don't consistently apply very strong weighting to them. For example, I saw the outer model (the one that runs + analyzes my evals) try calling eval, which is just the var that has the evaluation name, instead of actually running the evaluations. That was cleared up by adding a description, but it's a sign of some of the risk that can come with putting so many arbitrarily named tools in the tool namespace
I'll undo function masking and see how it does without it, since that was already causing some shaky behavior
@pliant elk if you're working in the same area (tool calling scheme / environments / etc) maybe I should just context switch to other things for a bit? And maybe I'll just remove readVariable for now to keep the tool scheme focused?
not sure how much we're overlapping atm
I can adapt to your priorities honestly. I'm trying to make my Environment API branch the most superficial possible diff from your branch. But haven't made tons of progress today, got distracted by other things.
Narrowing the tool scheme focus seems like a good idea
Going to try to make my environment API branch mergeable by tonight - but I don't want to constrain what you work on
I think it's a good time to context switch for a bit anyway, and once the environments API is taking shape I'll have a better idea of what to do when I get back to the tool calling scheme
I'll remove readVariable and any other things that seem like halfway measures
things I'm thinking to context switch to:
- picking up lifeAlert
- being able to press a key to splice into a message loop (interject) - like lifeAlert but in the other direction
- using
-ito auto-interject into any message loop that ends without returning the desired value - retry logic (on rate limits /
overloaded)
Can we add "replace .llm with variables" to the list? ๐
All of those seem great. I would prioritize UX changes over implementation changes (even painful ones like rate limits) because there's less penalty to changing them later
yeah i left that one at the bottom since it's more easily delegateable
another item for the list: optimizing v3 to send logs over one connection, instead of one per span, since right now larger LLM traces can absolutely kill the UI
@pliant elk here's the variables cleanup pr: https://github.com/dagger/dagger/pull/9993
Ha ha I was just working on that part ๐ Easy manual rebase, thanks
testing it was pretty fun: https://v3.dagger.cloud/dagger/traces/7b3cec315c93c12e3efb07f3a8cd2de1
the dream would be running that in CI and having it post a github review/comment if the PR touches files with llm in the path
need to set up secrets for that 
...and maybe think about the cost implications