Eval: go-programm-qa | Dagger | Page 1

rapid gull Feb 27, 2025, 10:34 PM

#

🧵

#

(cc @daring snow in case you're interested)

#

Issues in the first run:

#

Regression in the withExec error propagation: the agent is not getting stderr anymore

#

Model is way more verbose than before. Might be a model upgrade, maybe I'm getting gpt 4.5? Note to self: pin model version 🙂

#

There's definitely something wrong on the "qa" part: prompts and model replies don't appear. I see "naked container.withExec but I don't know where they're coming from (I gave a high-level Workspace object). This one could be a mix of TUI + regression in my module

hardy relic Feb 27, 2025, 10:37 PM

#

rapid gull 2. Model is way more verbose than before. Might be a model upgrade, maybe I'm ge...

works locally but there may be other paths that don't collect stdout/stderr thinkspin

#

also it'd be really helpful to link to the traces for these

rapid gull Feb 27, 2025, 10:38 PM

#

Run 1:

--> The QA agent actually got it wrong, keeps making dumb Container.withExec calls and not recovering

asciinema.org

untitled

Recorded by shykes

#

Since BBI itself is still in active development, and it's not an immutable truth that tool calls & dagger function calls will map so perfectly 1-1 (even now it's not 100% perfect mapping, there is the flattening trick to support chaining etc) -> I think keeping the 🤖💻 span would be useful

hardy relic Feb 27, 2025, 10:41 PM

#

i think that'll hinder readability by breaking the illusion of chaining, which is kind Dagger's bread and butter

#

since it's passthrough it'll always reveal whatever spans it ran, even if it's not 1:1

#

but, keeping it on the radar anyway. unfortunately bringing back the "thinking" phase also breaks chaining, since in reality there's an API roundtrip inbetween all those

#

maybe with a different BBI it'd be submitting fully chained calls instead of doing one at a time

rapid gull Feb 27, 2025, 10:48 PM

#

Oh I see, the issue is the auto-chaining?

#

Ha ha that kind of brings back the topic of function / artifact views 🙂 That's a problem specific to the trace view (not to sidetrack - obviously we need all this to work great with the trace view)

hardy relic Feb 27, 2025, 10:49 PM

#

yeah, the fact that they chain in the UI right now is dependent on those spans appearing directly to adjacent to one another, which currently works because of the passthrough trick - if we wrap them in another non-passthrough span, that'll go away

rapid gull Feb 27, 2025, 10:49 PM

#

Is there a quick fix for the stderr pass through?

#

Or a repro?

#

I have a demo in 10mn

#

(just realized)

rapid gull Feb 27, 2025, 10:51 PM

#

rapid gull 3. There's definitely something wrong on the "qa" part: prompts and model replie...

correction I do give a raw Container to the QA bot, so it makes sense that it call sit

#

Observation: all the spans are there in web UI. seems like a TUI issue that some spans are not visible

hardy relic Feb 27, 2025, 10:53 PM

#

rapid gull Observation: all the spans are there in web UI. seems like a TUI issue that some...

ah right, still need to figure that one out - still a bit of a mystery. it's reproducible at least. will get on that soon

hardy relic Feb 27, 2025, 10:55 PM

#

rapid gull Is there a quick fix for the stderr pass through?

nothing quick, sorry - it might be the case that it doesn't work when a module does the container sync, and only works if the model directly calls Container.withExec.[something]. nothing should have regressed there, I can see it working with the repro above:

llm | with-container $(container | from alpine | with-exec sh --stdin 'echo sdfsdfsdfsdf; exit 1') | with-prompt "you are in the context of a container that printed a message and exited 1. what was printed to stdout?" | loop | with-prompt "show me the raw tool call result that you read that from" | loop

rapid gull Feb 27, 2025, 11:00 PM

#

@hardy relic repro:

.cd github.com/dagger/agents/toy-programmer/toy-workspace
llm | with-toy-workspace $(write main.go 'wrong') | with-prompt "call the build tool and report back with exactly the result you received" | last-reply

#

https://dagger.cloud/dagger/traces/b3eea197d89c2199ab427a9a83fa9090

#

@daring snow I have to join my call (swyx...) but in case you're still around: could you push a workaround to toy-workspace/main.go? basically add expect: any to withExec in build() so that stderr is in the error? 🙏

#

otherwise my demo will bomb

#

(I will work around it)

daring snow Feb 27, 2025, 11:03 PM

#

rapid gull <@135620352201064448> I have to join my call (swyx...) but in case you're still ...

on it

daring snow Feb 27, 2025, 11:08 PM

#

rapid gull <@135620352201064448> I have to join my call (swyx...) but in case you're still ...

https://github.com/dagger/agents/pull/15