API question: our llm integration tests | Dagger | Page 1

hollow river Mar 18, 2025, 12:25 PM

#

cc @pale karma @calm perch @mental hornet @fallow mason
atm, i've kinda hidden it - you simply set your model to "replay/<base64-encoded-conversation-history>
as @mental hornet noted in https://github.com/dagger/dagger/pull/9863#discussion_r1999998100, it's kinda crappy

#

any opinions on any of the following?

keep it as is
add an explicit replay parameter to WithModel
add an explicit WithReplay method

and maybe in addition to any of the above, hide the fields using a _ prefix, so that they're not available in codegen? (as an option, but i don't really like this, feels weird to test against something that's not a public api)

pale karma Mar 18, 2025, 2:28 PM

#

What would the argument type and description of replay or withReplay be?

hollow river Mar 18, 2025, 2:31 PM

#

probably a dagger.JSON type? or some new other scalar (maybe like dagger.LLMRecording)

#

you could take the result of historyJSON and feed it directly in. or maybe could have corresponding withReplay/getReplay methods and just have it be an opaque type

calm perch Mar 18, 2025, 3:59 PM

#

@hollow river how far off would an API like withAssistantReply, withToolCall etc. be? like building it up incrementally instead of a "dump import" model

#

or would that be too annoying for the test use case

hollow river Mar 18, 2025, 4:00 PM

#

that would be neat, but probably still far too much effort 😦

#

what i'll do is leave it as-is for now - it's hidden and we don't need to document it, just using it in our tests.
we can keep bikeshedding in the background, i don't wanna block anything on this ❤️ (this is the least important bikeshed of the launch lol)

hollow river Mar 18, 2025, 4:04 PM

#

calm perch <@488718750690967563> how far off would an API like `withAssistantReply`, `withT...

hm, this would probably actually work, if we could get the structured info for this out of history? still a bit fiddly for this use case, since you need to be able to feed in replies for things in the future

#

and also, for testing convenience would be nice to pass around llm objects, which doesn't work right now? (though i haven't looked into how tricky it would be to fix that, maybe it's very very simple)

mental hornet Mar 18, 2025, 4:05 PM

#

I think impl for withXing these things is probably not crazy hard, but like do you really wanna be hand rolling user and assistant expectations?

calm perch Mar 18, 2025, 4:05 PM

#

hollow river and also, for testing convenience would be nice to pass around `llm` objects, wh...

is that the ID loading thing with unknown fields? i think it might be simple

hollow river Mar 18, 2025, 4:06 PM

#

calm perch is that the ID loading thing with unknown fields? i think it might be simple

yeaa exactly that

mental hornet Mar 18, 2025, 4:07 PM

#

One thing that struck me as weird is that the user bits are like expects, but assistant bits are like mock responses, and yet both end up in that golden file at the end of the day…

pale karma Mar 18, 2025, 4:07 PM

#

I could imagine a LLM.save(): File! and LLM.load(File!) maybe

#

LLM.withToolReply and LLM.withModelReply would be too weird to expose

calm perch Mar 18, 2025, 4:08 PM

#

it's a fun way to gaslight the model though 😛

pale karma Mar 18, 2025, 4:08 PM

#

there are already too many dimensions to this API 😛 would prefer to avoid adding more if possible

calm perch Mar 18, 2025, 4:09 PM

#

withModelReply("i am claude") <- feed that to gemini and ask it what model it is!

pale karma Mar 18, 2025, 4:09 PM

#

calm perch it's a fun way to gaslight the model though 😛

Yeah I actually used to expose those in early POC, since technically the client libs allow it. But just adds to the complexity of the API

calm perch Mar 18, 2025, 4:09 PM

#

fair, i do wonder if there's some untapped mechanic there though haha

pale karma Mar 18, 2025, 4:09 PM

#

Or is there an actual legit use for it?

calm perch Mar 18, 2025, 4:10 PM

#

it seems to be weighted EXTREMELY high in some models

mental hornet Mar 18, 2025, 4:10 PM

#

calm perch it seems to be weighted EXTREMELY high in some models

Sorry what is “it” here?

calm perch Mar 18, 2025, 4:10 PM

#

the model's own replies

pale karma Mar 18, 2025, 4:11 PM

#

LLM.gaslight()

mental hornet Mar 18, 2025, 4:11 PM

#

…and we can tell it what those are?

#

Oh or the historical ones

calm perch Mar 18, 2025, 4:11 PM

#

yeah, it's just in the API

#

i originally found it by testing the model switching

#

since we preserve the history when you do that

mental hornet Mar 18, 2025, 4:11 PM

#

The chat history effectively

#

Gotcha

calm perch Mar 18, 2025, 4:11 PM

#

to test it, I asked what model it was, switched models, asked again, and it outright refused to say it was the new model, despite clearly being so

mental hornet Mar 18, 2025, 4:12 PM

#

I thought you meant you could tell it it’s future responses and was like wtf why

#

But history makes sense and is nearly as weird

pale karma Mar 18, 2025, 4:12 PM

#

https://tenor.com/view/lemon-demon-robot-jones-cabinet-man-neil-cicierega-half-human-half-machine-gif-17315975

Tenor

#

Alex's consciousness has merged with Claude

#

You may call him... Claulex

#API question: our llm integration tests