#debugging workspace
1 messages ยท Page 1 of 1 (latest)
in this run the agent perfectly identified the error and rewrote the code and tests including the fixtures (line 1210 of step 4): https://github.com/vikram-dagger/fastapi-sample-app/actions/runs/13655010596/job/38171940189#step:4:1210
but then when it called the test tool to confirm, the tests failed. looking at the log, it seems like the changes made by the agent didn't stick, because the tests are still running with the original fixtures instead of the rewritten ones (line 1608 of step 4): https://github.com/vikram-dagger/fastapi-sample-app/actions/runs/13655010596/job/38171940189#step:4:1620
any suggestions to figure out what's going on?
Looks like it could have gotten confused because it called last_exec_output after the failed test. I can't tell if the files are changed because the file is so long, but based on the workspace code it looks ok
Trying to understand the goal - the tests are broken on main and the agent is supposed to identify the test failure and fix it? What is broken with the tests?
This was also my guess
I'm afk but I was going to try to remove the last_exec tool to see if that helped
cool yeah it probably doesn't need exec at all since you just need test to pass and that only requires modifying a file
The goal is that the agent runs when a PR is opened and the test pipeline fails. It then identifies the error and updates the relevant files, tests again, keeps going until the tests pass. Eventually the idea is to have the agent write the passing code back to the same pr (not got there yet).
got it. what is the correct solution to the broken tests? so I know what to look for in the diffs
The PR adds a new field to the api (i think called genre)
The correct solution is to update two files models.py and test_main.py to account for the new field
models.py has the api request and response data model, as well as the sql model. The PR only updates the sql model, so the agent updates the others
test_main.py has unit tests for the repository and client tests, so the agent updates both those to assert the new field
Then runs tests again and if passing writes the code back to the PR
Alternatively it could post a comment on the PR explaining the issue and the solution
Got it. Well in that case the diff actually looks correct, but the test failures do not look related to that change. Should the tests be passing with just that change? Specifically check against the tool versions in your workspace.test
I believe they should pass but let me check once again
it looks like the write_file method is causing the issue. see this trace: https://v3.dagger.cloud/dagger/traces/04a9b50c7508e1059574180fdbb70439?listen=e3918b1d0811f68d. the agent is updating the test_main.py file and modifying the fixtures to reflect the new field, at the same time blowing up everything else in the file. then in the next step, since all the tests are deleted, the test runner errors.
I was looking at this diff: https://v3.dagger.cloud/dagger/traces/f4645e41d7b11c18b8f68cf248d3c00a?span=081c5b5588dbddd3
which looks right. But yes that can also be a thing models do sometimes. In my experience when this happens the test will fail and it will try to figure it out by trying something different (like just modifying the existing file). What you really want, especially in a context this big (yes it's a small project but you're feeding the llm several large files) is something like a partial write or diff write. I haven't actually tried this yet but it's necessary for a more advanced workspace. See this thread for more conversation on that #1345613543256490064 message
here's an example of the agent getting it right when I tried it again - https://v3.dagger.cloud/dagger/traces/2d5ca2522e94c6dd2631238a0e438ee9
now if only it would do this ^^ every time
Yeah ๐ That's the "narrowing scope" part of improving consistency. I do think a "diff write" would be a big improvment given the size of the files in this demo. I can try to implement it in my workspace today and link it to you for tomorrow
I don't know if the scope is narrowed though. I just tried it again, just changing the name of the field. Trace at https://v3.dagger.cloud/dagger/traces/7f457560345f4ed964de53674efaf832. Once again it identified the changes correctly but the tests failed. There's no output explaining why so the agent doesn't know what to do next. Is this related to the tool, the model or the prompt? It's odd that each run is a snowflake when the challenge and the prompt is the same.
It's odd that each run is a snowflake when the challenge and the prompt is the same.
That's LLMs working as intended ๐ If they were fully deterministic, they wouldn't be useful
There's no output explaining why so the agent doesn't know what to do next. Is this related
This is why in theexecit uses returntype.ANY and checks the exit code, then raises a detailed error: https://github.com/vikram-dagger/fastapi-sample-app/blob/main/.dagger/workspace/src/workspace/main.py#L105-L108
In test you could do that, and if exit code is 0 then return stdout()
I don't know if the scope is narrowed though
I was referring to creating adiff_writetype of function like mentioned in the thread i linked. This narrows the scope of work for the LLM because it doesn't have to generate the whole file, just the changes, which should greatly increase it's rate of success
In test you could do that, and if exit code is 0 then return stdout()
I made this change but it doesn't seem to make any difference? Not sure if I did it wrong but it looks like the tests aren't running or running and the output is being dropped. https://github.com/vikram-dagger/fastapi-sample-app/blob/1f2f4c3684fc6460f3eb1351e267025cd13d77f6/.dagger/workspace/src/workspace/main.py#L69 and trace at https://dagger.cloud/dagger/traces/c75062bbf04dbf6cca04004626793097
I was referring to creating a diff_write type of function like mentioned in the thread i linked. This narrows the scope of work for the LLM because it doesn't have to generate the whole file, just the changes, which should greatly increase it's rate of success
Yes, makes sense
oh yeah on test don't return self, return the stdout / stderr. You can discard ctr
@function
async def test(
self
) -> str:
postgresdb = (
dag.container()
.from_("postgres:alpine")
.with_env_variable("POSTGRES_DB", "app_test")
.with_env_variable("POSTGRES_PASSWORD", "secret")
.with_exposed_port(5432)
.as_service(args=[], use_entrypoint=True)
)
cmd = (
self.ctr
.with_service_binding("db", postgresdb)
.with_env_variable("DATABASE_URL", "postgresql://postgres:secret@db/app_test")
.with_exec(["sh", "-c", "pytest"], expect=ReturnType.ANY)
)
if await cmd.exit_code() != 0:
raise Exception(f"Tests failed. \nError: {await cmd.stderr()}")
return await cmd.stdout()
sure, in #911305510882513037?
I've also updated the agent to post a comment on the PR with a summary of changes it made, but although the comment tool works standalone, the agent doesn't seem to use it
yes
Hey, I tried the approach you showed me but I got part of it wrong because I see 75 : ! 'coroutine' object has no attribute 'llm' 1 : diagnose --repository=env://GITHUB_REPOSITORY --ref=env://GITHUB_REF_NAME --token=env://GITHUB_API_TOKEN ERROR [10.5s] 1 : ! exit code 1 1 : [10.4s] | Error: input: agent.diagnose 'coroutine' object has no attribute 'llm' 1 : [10.4s] |
Can you please have a quick peek at these lines? https://github.com/vikram-dagger/fastapi-sample-app/blob/c768e95a16ceef7c1bfea0a756bb90883dc6123a/.dagger/src/agent/main.py#L92
oh wait, it might be a simple syntax error, llm instead of llm()
nope, same error: diagnose --repository=env://GITHUB_REPOSITORY --ref=env://GITHUB_REF_NAME --token=env://GITHUB_API_TOKEN 75 : โ Agent.diagnose ERROR [1.5s] 75 : โ ! 'coroutine' object has no attribute 'llm' 1 : diagnose --repository=env://GITHUB_REPOSITORY --ref=env://GITHUB_REF_NAME --token=env://GITHUB_API_TOKEN ERROR [10.8s] 1 : ! exit code 1
sounds like somethng that was awaited didn't need to be? or something needs to be awaited that isn't?
oh wait
await after.llm.with_prompt("summarize your changes").last_reply()
should be
await after.with_prompt("summarize your changes").last_reply()
I'm getting closer
I'm having trouble getting a secret from the workflow into the workspace
can you please have a quick look Kyle?
line 4453 of https://github.com/vikram-dagger/fastapi-sample-app/actions/runs/13673892583/job/38229984604 has the error
ok I have that fixed now
and I now have the agent making changes and posting a comment on the PR
i have some weirdness in my agent loop
in this run https://github.com/vikram-dagger/fastapi-sample-app/actions/runs/13680959484/job/38252948712#step:4:1553 - line 1567 - the agent seems to have correctly identified and fixed the two issues in the code - tests are passing and the progress report makes sense. But instead of stopping there, it continues working and messes up the fixed code, eventually posting a message that makes it seem unsuccessful
any ideas why this is happening?
this is actually clearer in the trace https://v3.dagger.cloud/dagger/traces/1fa83b5a1c7df60bdf467b41334d6200
sometimes llms do that. I'd say just more prompt tweaking is usually the answer. Maybe add something like "when the tests pass, you are done."
ok let me try that
Oh also maybe add in summary = await after.with_prompt something like "Do not make any more changes to the workspace"
nice, trying
yeah i think that should do it. It seems like that's where it's going off the rails
So in this case https://github.com/vikram-dagger/fastapi-sample-app/actions/runs/13683853883/job/38262449607#step:4:2521 - line 2521 - it identifies the error as a wrong import, fixes it and runs the tests. But then the test runner fails without any output https://github.com/vikram-dagger/fastapi-sample-app/actions/runs/13683853883/job/38262449607#step:4:2602 - line 2599.
But we already updated the code to make sure that the test tool always returns stderr or stdout, in this case it isn't returning anything
The 0.0 is sus, maybe it was a cached failure?
this is where we catch the exit code https://github.com/vikram-dagger/fastapi-sample-app/blob/c0dd0b1a516b00ab46b3054e813e4faf1e31ebe2/.dagger/workspace/src/workspace/main.py#L74
hmm could be, should I put a cachebuster into the test tool?
or if exit code is not 0, should we return both stdout + stderr, assuming that one of them has something useful to report
I don't know if that's necessary. If the test failure was cached it means the test code should fail tests. I think the function return should be still getting the stderr/stdout from that even if it was cached, it's just deduped from the logs
ok so maybe return both stdout + stderr instead of just stderr?
it's weird that the test is failing with no output, this doesn't happen if you run that fn manually
That's what I mean, I think the caller should still be getting the cached response, it's just not logged AFAIK cc @tacit bane
the "do not make more changes to the workspace" seems to fail hard ๐
it's solved the problem completely here but then says no changes were made: https://github.com/vikram-dagger/fastapi-sample-app/actions/runs/13684650154/job/38265111635
maybe that needs to be reworded or sth
so i made 3 changes:
- reworded the final prompt
- added cachebuster to test tool
- added both stdout and stderr to the output when test tool fails
so far so good - i tried it again this time introducing two errors into the code and it was able to resolve: https://github.com/vikram-dagger/fastapi-sample-app/pull/46