#Eval: go-programm-qa
1 messages ยท Page 1 of 1 (latest)
๐งต
(cc @daring snow in case you're interested)
Issues in the first run:
- Regression in the withExec error propagation: the agent is not getting stderr anymore
- Model is way more verbose than before. Might be a model upgrade, maybe I'm getting gpt 4.5? Note to self: pin model version ๐
- There's definitely something wrong on the "qa" part: prompts and model replies don't appear. I see "naked
container.withExecbut I don't know where they're coming from (I gave a high-levelWorkspaceobject). This one could be a mix of TUI + regression in my module
llm | with-container $(container | from alpine | with-exec sh --stdin 'echo sdfsdfsdfsdf; exit 1') | with-prompt "you are in the context of a container that printed a message and exited 1. what was printed to stdout?" | loop | with-prompt "show me the raw tool call result that you read that from" | loop
works locally but there may be other paths that don't collect stdout/stderr 
also it'd be really helpful to link to the traces for these
Run 1:
- https://dagger.cloud/dagger/traces/31a33fae2e632657f5d267f12ee01af5
- https://asciinema.org/a/63bFwbhxruDwJ6vWdlowtGZK6
--> The QA agent actually got it wrong, keeps making dumb Container.withExec calls and not recovering
Since BBI itself is still in active development, and it's not an immutable truth that tool calls & dagger function calls will map so perfectly 1-1 (even now it's not 100% perfect mapping, there is the flattening trick to support chaining etc) -> I think keeping the ๐ค๐ป span would be useful
i think that'll hinder readability by breaking the illusion of chaining, which is kind Dagger's bread and butter
since it's passthrough it'll always reveal whatever spans it ran, even if it's not 1:1
but, keeping it on the radar anyway. unfortunately bringing back the "thinking" phase also breaks chaining, since in reality there's an API roundtrip inbetween all those
maybe with a different BBI it'd be submitting fully chained calls instead of doing one at a time
Oh I see, the issue is the auto-chaining?
Ha ha that kind of brings back the topic of function / artifact views ๐ That's a problem specific to the trace view (not to sidetrack - obviously we need all this to work great with the trace view)
yeah, the fact that they chain in the UI right now is dependent on those spans appearing directly to adjacent to one another, which currently works because of the passthrough trick - if we wrap them in another non-passthrough span, that'll go away
Is there a quick fix for the stderr pass through?
Or a repro?
I have a demo in 10mn
(just realized)
correction I do give a raw Container to the QA bot, so it makes sense that it call sit
Observation: all the spans are there in web UI. seems like a TUI issue that some spans are not visible
ah right, still need to figure that one out - still a bit of a mystery. it's reproducible at least. will get on that soon
nothing quick, sorry - it might be the case that it doesn't work when a module does the container sync, and only works if the model directly calls Container.withExec.[something]. nothing should have regressed there, I can see it working with the repro above:
llm | with-container $(container | from alpine | with-exec sh --stdin 'echo sdfsdfsdfsdf; exit 1') | with-prompt "you are in the context of a container that printed a message and exited 1. what was printed to stdout?" | loop | with-prompt "show me the raw tool call result that you read that from" | loop
@hardy relic repro:
.cd github.com/dagger/agents/toy-programmer/toy-workspace
llm | with-toy-workspace $(write main.go 'wrong') | with-prompt "call the build tool and report back with exactly the result you received" | last-reply
@daring snow I have to join my call (swyx...) but in case you're still around: could you push a workaround to toy-workspace/main.go? basically add expect: any to withExec in build() so that stderr is in the error? ๐
otherwise my demo will bomb
(I will work around it)