debugging workspace | Dagger | Page 1

upbeat vigil Mar 4, 2025, 2:04 PM

#

in this run the agent perfectly identified the error and rewrote the code and tests including the fixtures (line 1210 of step 4): https://github.com/vikram-dagger/fastapi-sample-app/actions/runs/13655010596/job/38171940189#step:4:1210

#

but then when it called the test tool to confirm, the tests failed. looking at the log, it seems like the changes made by the agent didn't stick, because the tests are still running with the original fixtures instead of the rewritten ones (line 1608 of step 4): https://github.com/vikram-dagger/fastapi-sample-app/actions/runs/13655010596/job/38171940189#step:4:1620

#

any suggestions to figure out what's going on?

fallow frigate Mar 4, 2025, 3:28 PM

#

Looks like it could have gotten confused because it called last_exec_output after the failed test. I can't tell if the files are changed because the file is so long, but based on the workspace code it looks ok

#

Trying to understand the goal - the tests are broken on main and the agent is supposed to identify the test failure and fix it? What is broken with the tests?

upbeat vigil Mar 4, 2025, 3:31 PM

#

This was also my guess

#

I'm afk but I was going to try to remove the last_exec tool to see if that helped

fallow frigate Mar 4, 2025, 3:32 PM

#

cool yeah it probably doesn't need exec at all since you just need test to pass and that only requires modifying a file

upbeat vigil Mar 4, 2025, 3:33 PM

#

The goal is that the agent runs when a PR is opened and the test pipeline fails. It then identifies the error and updates the relevant files, tests again, keeps going until the tests pass. Eventually the idea is to have the agent write the passing code back to the same pr (not got there yet).

fallow frigate Mar 4, 2025, 3:35 PM

#

got it. what is the correct solution to the broken tests? so I know what to look for in the diffs

upbeat vigil Mar 4, 2025, 3:38 PM

#

The PR adds a new field to the api (i think called genre)

#

The correct solution is to update two files models.py and test_main.py to account for the new field

#

models.py has the api request and response data model, as well as the sql model. The PR only updates the sql model, so the agent updates the others

#

test_main.py has unit tests for the repository and client tests, so the agent updates both those to assert the new field

#

Then runs tests again and if passing writes the code back to the PR

#

Alternatively it could post a comment on the PR explaining the issue and the solution

fallow frigate Mar 4, 2025, 3:50 PM

#

Got it. Well in that case the diff actually looks correct, but the test failures do not look related to that change. Should the tests be passing with just that change? Specifically check against the tool versions in your workspace.test

upbeat vigil Mar 4, 2025, 3:51 PM

#

I believe they should pass but let me check once again

upbeat vigil Mar 4, 2025, 4:45 PM

#

it looks like the write_file method is causing the issue. see this trace: https://v3.dagger.cloud/dagger/traces/04a9b50c7508e1059574180fdbb70439?listen=e3918b1d0811f68d. the agent is updating the test_main.py file and modifying the fixtures to reflect the new field, at the same time blowing up everything else in the file. then in the next step, since all the tests are deleted, the test runner errors.

fallow frigate Mar 4, 2025, 4:51 PM

#

I was looking at this diff: https://v3.dagger.cloud/dagger/traces/f4645e41d7b11c18b8f68cf248d3c00a?span=081c5b5588dbddd3
which looks right. But yes that can also be a thing models do sometimes. In my experience when this happens the test will fail and it will try to figure it out by trying something different (like just modifying the existing file). What you really want, especially in a context this big (yes it's a small project but you're feeding the llm several large files) is something like a partial write or diff write. I haven't actually tried this yet but it's necessary for a more advanced workspace. See this thread for more conversation on that #1345613543256490064 message

Dagger Cloud

Browse and visualize Dagger traces.

upbeat vigil Mar 4, 2025, 4:52 PM

#

here's an example of the agent getting it right when I tried it again - https://v3.dagger.cloud/dagger/traces/2d5ca2522e94c6dd2631238a0e438ee9

#

now if only it would do this ^^ every time

fallow frigate Mar 4, 2025, 4:55 PM

#

Yeah 🙂 That's the "narrowing scope" part of improving consistency. I do think a "diff write" would be a big improvment given the size of the files in this demo. I can try to implement it in my workspace today and link it to you for tomorrow

upbeat vigil Mar 4, 2025, 6:40 PM

#

I don't know if the scope is narrowed though. I just tried it again, just changing the name of the field. Trace at https://v3.dagger.cloud/dagger/traces/7f457560345f4ed964de53674efaf832. Once again it identified the changes correctly but the tests failed. There's no output explaining why so the agent doesn't know what to do next. Is this related to the tool, the model or the prompt? It's odd that each run is a snowflake when the challenge and the prompt is the same.

fallow frigate Mar 4, 2025, 6:53 PM

#

It's odd that each run is a snowflake when the challenge and the prompt is the same.
That's LLMs working as intended 🙂 If they were fully deterministic, they wouldn't be useful

There's no output explaining why so the agent doesn't know what to do next. Is this related
This is why in the exec it uses returntype.ANY and checks the exit code, then raises a detailed error: https://github.com/vikram-dagger/fastapi-sample-app/blob/main/.dagger/workspace/src/workspace/main.py#L105-L108

In test you could do that, and if exit code is 0 then return stdout()

I don't know if the scope is narrowed though
I was referring to creating a diff_write type of function like mentioned in the thread i linked. This narrows the scope of work for the LLM because it doesn't have to generate the whole file, just the changes, which should greatly increase it's rate of success

upbeat vigil Mar 4, 2025, 7:51 PM

#

In test you could do that, and if exit code is 0 then return stdout()
I made this change but it doesn't seem to make any difference? Not sure if I did it wrong but it looks like the tests aren't running or running and the output is being dropped. https://github.com/vikram-dagger/fastapi-sample-app/blob/1f2f4c3684fc6460f3eb1351e267025cd13d77f6/.dagger/workspace/src/workspace/main.py#L69 and trace at https://dagger.cloud/dagger/traces/c75062bbf04dbf6cca04004626793097

#

I was referring to creating a diff_write type of function like mentioned in the thread i linked. This narrows the scope of work for the LLM because it doesn't have to generate the whole file, just the changes, which should greatly increase it's rate of success
Yes, makes sense

fallow frigate Mar 4, 2025, 7:52 PM

#

oh yeah on test don't return self, return the stdout / stderr. You can discard ctr

#

    @function
    async def test(
        self
    ) -> str:
        postgresdb = (
            dag.container()
            .from_("postgres:alpine")
            .with_env_variable("POSTGRES_DB", "app_test")
            .with_env_variable("POSTGRES_PASSWORD", "secret")
            .with_exposed_port(5432)
            .as_service(args=[], use_entrypoint=True)
        )

        cmd = (
            self.ctr
            .with_service_binding("db", postgresdb)
            .with_env_variable("DATABASE_URL", "postgresql://postgres:secret@db/app_test")
            .with_exec(["sh", "-c", "pytest"], expect=ReturnType.ANY)
        )
        if await cmd.exit_code() != 0:
            raise Exception(f"Tests failed. \nError: {await cmd.stderr()}")
        return await cmd.stdout()

upbeat vigil Mar 4, 2025, 7:54 PM

#

ah ok

#

trying

upbeat vigil Mar 4, 2025, 9:31 PM

#

seems to be the same, works on and off

#

do you have some time now to pair on this?

fallow frigate Mar 4, 2025, 9:32 PM

#

sure, in #911305510882513037?

upbeat vigil Mar 4, 2025, 9:33 PM

#

I've also updated the agent to post a comment on the PR with a summary of changes it made, but although the comment tool works standalone, the agent doesn't seem to use it

#

yes

upbeat vigil Mar 4, 2025, 10:10 PM

#

Hey, I tried the approach you showed me but I got part of it wrong because I see 75 : ! 'coroutine' object has no attribute 'llm' 1 : diagnose --repository=env://GITHUB_REPOSITORY --ref=env://GITHUB_REF_NAME --token=env://GITHUB_API_TOKEN ERROR [10.5s] 1 : ! exit code 1 1 : [10.4s] | Error: input: agent.diagnose 'coroutine' object has no attribute 'llm' 1 : [10.4s] |
Can you please have a quick peek at these lines? https://github.com/vikram-dagger/fastapi-sample-app/blob/c768e95a16ceef7c1bfea0a756bb90883dc6123a/.dagger/src/agent/main.py#L92

#

oh wait, it might be a simple syntax error, llm instead of llm()

#

nope, same error: diagnose --repository=env://GITHUB_REPOSITORY --ref=env://GITHUB_REF_NAME --token=env://GITHUB_API_TOKEN 75 : │ Agent.diagnose ERROR [1.5s] 75 : │ ! 'coroutine' object has no attribute 'llm' 1 : diagnose --repository=env://GITHUB_REPOSITORY --ref=env://GITHUB_REF_NAME --token=env://GITHUB_API_TOKEN ERROR [10.8s] 1 : ! exit code 1

fallow frigate Mar 4, 2025, 10:19 PM

#

sounds like somethng that was awaited didn't need to be? or something needs to be awaited that isn't?

#

oh wait

#

await after.llm.with_prompt("summarize your changes").last_reply()
should be
await after.with_prompt("summarize your changes").last_reply()

upbeat vigil Mar 5, 2025, 10:39 AM

#

I'm getting closer

#

I'm having trouble getting a secret from the workflow into the workspace

#

can you please have a quick look Kyle?

#

line 4453 of https://github.com/vikram-dagger/fastapi-sample-app/actions/runs/13673892583/job/38229984604 has the error

upbeat vigil Mar 5, 2025, 12:11 PM

#

ok I have that fixed now

upbeat vigil Mar 5, 2025, 1:29 PM

#

and I now have the agent making changes and posting a comment on the PR

upbeat vigil Mar 5, 2025, 7:12 PM

#

i have some weirdness in my agent loop

#

in this run https://github.com/vikram-dagger/fastapi-sample-app/actions/runs/13680959484/job/38252948712#step:4:1553 - line 1567 - the agent seems to have correctly identified and fixed the two issues in the code - tests are passing and the progress report makes sense. But instead of stopping there, it continues working and messes up the fixed code, eventually posting a message that makes it seem unsuccessful

#

any ideas why this is happening?

#

this is actually clearer in the trace https://v3.dagger.cloud/dagger/traces/1fa83b5a1c7df60bdf467b41334d6200

fallow frigate Mar 5, 2025, 7:17 PM

#

sometimes llms do that. I'd say just more prompt tweaking is usually the answer. Maybe add something like "when the tests pass, you are done."

upbeat vigil Mar 5, 2025, 7:17 PM

#

ok let me try that

fallow frigate Mar 5, 2025, 7:20 PM

#

Oh also maybe add in summary = await after.with_prompt something like "Do not make any more changes to the workspace"

upbeat vigil Mar 5, 2025, 7:20 PM

#

nice, trying

fallow frigate Mar 5, 2025, 7:21 PM

#

yeah i think that should do it. It seems like that's where it's going off the rails

upbeat vigil Mar 5, 2025, 7:58 PM

#

So in this case https://github.com/vikram-dagger/fastapi-sample-app/actions/runs/13683853883/job/38262449607#step:4:2521 - line 2521 - it identifies the error as a wrong import, fixes it and runs the tests. But then the test runner fails without any output https://github.com/vikram-dagger/fastapi-sample-app/actions/runs/13683853883/job/38262449607#step:4:2602 - line 2599.

#

But we already updated the code to make sure that the test tool always returns stderr or stdout, in this case it isn't returning anything

fallow frigate Mar 5, 2025, 7:59 PM

#

The 0.0 is sus, maybe it was a cached failure?

upbeat vigil Mar 5, 2025, 8:00 PM

#

this is where we catch the exit code https://github.com/vikram-dagger/fastapi-sample-app/blob/c0dd0b1a516b00ab46b3054e813e4faf1e31ebe2/.dagger/workspace/src/workspace/main.py#L74

#

hmm could be, should I put a cachebuster into the test tool?

#

or if exit code is not 0, should we return both stdout + stderr, assuming that one of them has something useful to report

fallow frigate Mar 5, 2025, 8:02 PM

#

I don't know if that's necessary. If the test failure was cached it means the test code should fail tests. I think the function return should be still getting the stderr/stdout from that even if it was cached, it's just deduped from the logs

upbeat vigil Mar 5, 2025, 8:04 PM

#

ok so maybe return both stdout + stderr instead of just stderr?

#

it's weird that the test is failing with no output, this doesn't happen if you run that fn manually

fallow frigate Mar 5, 2025, 8:06 PM

#

That's what I mean, I think the caller should still be getting the cached response, it's just not logged AFAIK cc @tacit bane

upbeat vigil Mar 5, 2025, 8:15 PM

#

the "do not make more changes to the workspace" seems to fail hard 🙂

#

it's solved the problem completely here but then says no changes were made: https://github.com/vikram-dagger/fastapi-sample-app/actions/runs/13684650154/job/38265111635

#

maybe that needs to be reworded or sth

upbeat vigil Mar 5, 2025, 10:04 PM

#

so i made 3 changes:

reworded the final prompt
added cachebuster to test tool
added both stdout and stderr to the output when test tool fails

#

so far so good - i tried it again this time introducing two errors into the code and it was able to resolve: https://github.com/vikram-dagger/fastapi-sample-app/pull/46

#debugging workspace