#debugging workspace

1 messages ยท Page 1 of 1 (latest)

upbeat vigil
#

any suggestions to figure out what's going on?

fallow frigate
#

Looks like it could have gotten confused because it called last_exec_output after the failed test. I can't tell if the files are changed because the file is so long, but based on the workspace code it looks ok

#

Trying to understand the goal - the tests are broken on main and the agent is supposed to identify the test failure and fix it? What is broken with the tests?

upbeat vigil
#

This was also my guess

#

I'm afk but I was going to try to remove the last_exec tool to see if that helped

fallow frigate
#

cool yeah it probably doesn't need exec at all since you just need test to pass and that only requires modifying a file

upbeat vigil
#

The goal is that the agent runs when a PR is opened and the test pipeline fails. It then identifies the error and updates the relevant files, tests again, keeps going until the tests pass. Eventually the idea is to have the agent write the passing code back to the same pr (not got there yet).

fallow frigate
#

got it. what is the correct solution to the broken tests? so I know what to look for in the diffs

upbeat vigil
#

The PR adds a new field to the api (i think called genre)

#

The correct solution is to update two files models.py and test_main.py to account for the new field

#

models.py has the api request and response data model, as well as the sql model. The PR only updates the sql model, so the agent updates the others

#

test_main.py has unit tests for the repository and client tests, so the agent updates both those to assert the new field

#

Then runs tests again and if passing writes the code back to the PR

#

Alternatively it could post a comment on the PR explaining the issue and the solution

fallow frigate
#

Got it. Well in that case the diff actually looks correct, but the test failures do not look related to that change. Should the tests be passing with just that change? Specifically check against the tool versions in your workspace.test

upbeat vigil
#

I believe they should pass but let me check once again

upbeat vigil
fallow frigate
#

I was looking at this diff: https://v3.dagger.cloud/dagger/traces/f4645e41d7b11c18b8f68cf248d3c00a?span=081c5b5588dbddd3
which looks right. But yes that can also be a thing models do sometimes. In my experience when this happens the test will fail and it will try to figure it out by trying something different (like just modifying the existing file). What you really want, especially in a context this big (yes it's a small project but you're feeding the llm several large files) is something like a partial write or diff write. I haven't actually tried this yet but it's necessary for a more advanced workspace. See this thread for more conversation on that #1345613543256490064 message

upbeat vigil
#

now if only it would do this ^^ every time

fallow frigate
#

Yeah ๐Ÿ™‚ That's the "narrowing scope" part of improving consistency. I do think a "diff write" would be a big improvment given the size of the files in this demo. I can try to implement it in my workspace today and link it to you for tomorrow

upbeat vigil
#

I don't know if the scope is narrowed though. I just tried it again, just changing the name of the field. Trace at https://v3.dagger.cloud/dagger/traces/7f457560345f4ed964de53674efaf832. Once again it identified the changes correctly but the tests failed. There's no output explaining why so the agent doesn't know what to do next. Is this related to the tool, the model or the prompt? It's odd that each run is a snowflake when the challenge and the prompt is the same.

fallow frigate
#

It's odd that each run is a snowflake when the challenge and the prompt is the same.
That's LLMs working as intended ๐Ÿ™‚ If they were fully deterministic, they wouldn't be useful

There's no output explaining why so the agent doesn't know what to do next. Is this related
This is why in the exec it uses returntype.ANY and checks the exit code, then raises a detailed error: https://github.com/vikram-dagger/fastapi-sample-app/blob/main/.dagger/workspace/src/workspace/main.py#L105-L108

In test you could do that, and if exit code is 0 then return stdout()

I don't know if the scope is narrowed though
I was referring to creating a diff_write type of function like mentioned in the thread i linked. This narrows the scope of work for the LLM because it doesn't have to generate the whole file, just the changes, which should greatly increase it's rate of success

upbeat vigil
#

I was referring to creating a diff_write type of function like mentioned in the thread i linked. This narrows the scope of work for the LLM because it doesn't have to generate the whole file, just the changes, which should greatly increase it's rate of success
Yes, makes sense

fallow frigate
#

oh yeah on test don't return self, return the stdout / stderr. You can discard ctr

#
    @function
    async def test(
        self
    ) -> str:
        postgresdb = (
            dag.container()
            .from_("postgres:alpine")
            .with_env_variable("POSTGRES_DB", "app_test")
            .with_env_variable("POSTGRES_PASSWORD", "secret")
            .with_exposed_port(5432)
            .as_service(args=[], use_entrypoint=True)
        )

        cmd = (
            self.ctr
            .with_service_binding("db", postgresdb)
            .with_env_variable("DATABASE_URL", "postgresql://postgres:secret@db/app_test")
            .with_exec(["sh", "-c", "pytest"], expect=ReturnType.ANY)
        )
        if await cmd.exit_code() != 0:
            raise Exception(f"Tests failed. \nError: {await cmd.stderr()}")
        return await cmd.stdout()
upbeat vigil
#

ah ok

#

trying

upbeat vigil
#

seems to be the same, works on and off

#

do you have some time now to pair on this?

fallow frigate
upbeat vigil
#

I've also updated the agent to post a comment on the PR with a summary of changes it made, but although the comment tool works standalone, the agent doesn't seem to use it

#

yes

upbeat vigil
#

Hey, I tried the approach you showed me but I got part of it wrong because I see 75 : ! 'coroutine' object has no attribute 'llm' 1 : diagnose --repository=env://GITHUB_REPOSITORY --ref=env://GITHUB_REF_NAME --token=env://GITHUB_API_TOKEN ERROR [10.5s] 1 : ! exit code 1 1 : [10.4s] | Error: input: agent.diagnose 'coroutine' object has no attribute 'llm' 1 : [10.4s] |
Can you please have a quick peek at these lines? https://github.com/vikram-dagger/fastapi-sample-app/blob/c768e95a16ceef7c1bfea0a756bb90883dc6123a/.dagger/src/agent/main.py#L92

#

oh wait, it might be a simple syntax error, llm instead of llm()

#

nope, same error: diagnose --repository=env://GITHUB_REPOSITORY --ref=env://GITHUB_REF_NAME --token=env://GITHUB_API_TOKEN 75 : โ”‚ Agent.diagnose ERROR [1.5s] 75 : โ”‚ ! 'coroutine' object has no attribute 'llm' 1 : diagnose --repository=env://GITHUB_REPOSITORY --ref=env://GITHUB_REF_NAME --token=env://GITHUB_API_TOKEN ERROR [10.8s] 1 : ! exit code 1

fallow frigate
#

sounds like somethng that was awaited didn't need to be? or something needs to be awaited that isn't?

#

oh wait

#

await after.llm.with_prompt("summarize your changes").last_reply()
should be
await after.with_prompt("summarize your changes").last_reply()

upbeat vigil
#

I'm getting closer

#

I'm having trouble getting a secret from the workflow into the workspace

#

can you please have a quick look Kyle?

upbeat vigil
#

ok I have that fixed now

upbeat vigil
#

and I now have the agent making changes and posting a comment on the PR

upbeat vigil
#

i have some weirdness in my agent loop

#

any ideas why this is happening?

fallow frigate
#

sometimes llms do that. I'd say just more prompt tweaking is usually the answer. Maybe add something like "when the tests pass, you are done."

upbeat vigil
#

ok let me try that

fallow frigate
#

Oh also maybe add in summary = await after.with_prompt something like "Do not make any more changes to the workspace"

upbeat vigil
#

nice, trying

fallow frigate
#

yeah i think that should do it. It seems like that's where it's going off the rails

upbeat vigil
#

But we already updated the code to make sure that the test tool always returns stderr or stdout, in this case it isn't returning anything

fallow frigate
#

The 0.0 is sus, maybe it was a cached failure?

upbeat vigil
#

hmm could be, should I put a cachebuster into the test tool?

#

or if exit code is not 0, should we return both stdout + stderr, assuming that one of them has something useful to report

fallow frigate
#

I don't know if that's necessary. If the test failure was cached it means the test code should fail tests. I think the function return should be still getting the stderr/stdout from that even if it was cached, it's just deduped from the logs

upbeat vigil
#

ok so maybe return both stdout + stderr instead of just stderr?

#

it's weird that the test is failing with no output, this doesn't happen if you run that fn manually

fallow frigate
#

That's what I mean, I think the caller should still be getting the cached response, it's just not logged AFAIK cc @tacit bane

upbeat vigil
#

the "do not make more changes to the workspace" seems to fail hard ๐Ÿ™‚

#

maybe that needs to be reworded or sth

upbeat vigil
#

so i made 3 changes:

  • reworded the final prompt
  • added cachebuster to test tool
  • added both stdout and stderr to the output when test tool fails