#Custom LLM overhead?
1 messages · Page 1 of 1 (latest)
Moving here to keep state
do you mind running a test on what's the latency of the last sentence on your side?
if it's 1.6s then it could be your server doesn't really stream it?
it's suspicious the first byte and sentence are so close on our side but not yours
bumping this, because I noticed this as well, but didn't test as thoroughly as Lukas
when did you notice? there was a recent fix like a 1-2 weeks back
This week. I'm noticing it today as well.
Sure, for the example above the last byte was received/forwarded at 19:55:10.781 CET, i.e. in this call the custom LLM server completed after 0.99 seconds. In other words, even if the streaming doesn't work properly - which i dont think since we can see it properly stream in a open-ai style chat interface - there still seems to be 0.6 sec overhead..
Let me know if you need anything else to help find the root-cause.. 🙂 Happy to share the server-side code etc.
Will add some logging and then ask you to share me a conversation id tmw
adding sentry trace from @forest dirge because @uncut panther is tracking here
Hi @forest dirge we added some logging, can you make a new call and send a conversation id?
@uncut panther 9BxX8j5zcB6YJbdQfKfX thanks again. appreciate you looking into this
what times are you seeing?
if the transcript gets sent via websocket as soon as you get it, then it's interesting that there's apparently a delay between that and when the custom_llm is called. Seems like that plus waiting for the whole completion to be done
oh shoot that wasn't running on my logging branch, that was in production, give me a sec
7kirV0IYxfqsGbISx41A anecdotally it felt a bit faster.
I whipped this tracing up really quick and I don't have unique id's to correlate transcript to custom llm call so don't take these timings as gospel truth, but I'd be curious how they compare
so for the 9BxX8j5zcB6YJbdQfKfX second's answer I see unexplained ~120ms at the beginning but other then that, when request is sent:
2025-03-16 03:49:28.110
2025-03-16 03:49:29.702 (Received chunk of size 1531)
2025-03-16T03:49:29.792 (Received chunk of size 301)
....
First one is 5x larger than others
7kirV0IYxfqsGbISx41A -> the first message is 1.6s but the others are 700ms
7kirV0IYxfqsGbISx41A -> first message:
extra 154ms to make a request due to some security stuff which we need to optimize then:
2025-03-16T03:54:31.868 (request)
2025-03-16 03:54:33.386 (received chunk of size 330)
now all are similar size
second message:
110ms start overhead, then:
2025-03-16 03:54:41.328 (request)
2025-03-16 03:54:41.899 (received chunk of size 330)
Fast - I guess what you see in yours?
then each new chunk (either 300 or 600 bytes):
e.g. for the 899 we finish processing it 925
and 42.006 we already have a sentence accepted and starting tts generation
so the biggest difference between the two is just the delay to get the reply hm, by any chance don't you have some middleware or ddos layer that takes long time? could you try creating a VM in a US region and query your server and meassure latency?
e.g. in this convo it was fast except first message, but in your prod one wasn't
anecdotally the first chunk received is smaller in your dev call than prod
Oh interesting. We’re on railway. I’m not sure what ddos or middleware they have in front of us, but can setup another test
It’s consistently fast on our other provider local or in production, but all the communication with them is via websocket
@uncut panther are you tracing the time from transcription to request? From my perspective I am sending audio all the time from twilio, so I can't detect turns without custom logic. My anchor point for tracing is when I first get the transcription, then I can track it through it's lifecycle of transcript -> request -> first token -> first audio -> final token
appears like the delay is from transcription to when custom llm is called based on watching the logs while running it.
I can DM you a test prod number to call on monday if you'd like to call a bunch and keep watching on your end. I have a test number for our current provider as well if you want to compare.
appreciate you looking into this!
I am tracing when I make the http request to your server, in the dev call first time it took 1.5s to get an answer from your server, subsequent ones were fast
You can try putting openai api address and your openai dev key and you will see it's the same fast (+100ms) as if selecting gpt in the llm list
yeah we can also implement the ws option
if easier
based on my traces it looked like everything was extremely fast except for the delay between transcript received via ws and when I received your http custom llm call. That's the piece I'm suspecting is adding slowness.
@uncut panther this is how millis does it. https://docs.millis.ai/tutorials/setup-custom-llm-websocket the nice part is that as soon as you get the transcript you can start streaming audio back and you already have a secure connection, but either way works as long as it's fast
Step-by-step guide on how to set up a custom LLM for your Millis AI voice agent
In case you need another example to trace (with custom LLM) - have a look at this call id gtlC0dJm91V9auF760IB from today (again approx 1.8 seconds, where the LLM only takes approx 1.1 seconds)
@uncut panther I saw you mention in the main chat that things got faster. Would you consider this “fixed”?
Yes, but that was due to some older issue we had
our analysis above is already taking that into account
currently there is up to 100ms overhead for custom llm but shouldn't be more
we will implement the ws custom llm option maybe within 3 weeks
Anecdotally it has seemed quite a bit faster reliably
nvm it was really slow this morning. not really usable for us right now when it's so up and down. Wish we could pay you guys for your voice activity detection
I'll check back in about the custom llm option
yes also for us it seems super volatile in terms of speed (sometimes uper fast, sometimes super slow) - do you know what this could be? and can you tell me what format you need for us to trigger a language change? i tried a format analogous to the 'end_call' but cant get it to work
@distant canopy could you advise on the language change tool?
@static crater are you also using custom llm?
@forest dirge can yous end a recent one that was slow?
@static crater also if you can share one would be great
yes im using a custom llm which is trying to change the language using your standard system tool call 'language_detection' (i.e. exactly the same way as we are using 'end_call' - which works really nicely). We have the tool 'activated' in the convo-ai ui. @distant canopy would be fantastic if you can advise on what I'm doing wrong - i suspect some kind of syntax issue - have tried quite a lot of different versions (type = 'system', type = 'function', with another empty delta: {} trigger at the end & without).
this is our current structure (following open-ai schema - finish reason is none since we have normal text snippets that follow a language change):
(more context for the request above) >> just to check if it works i passed the arguments to the end_call tool and see that its 'correctly' passed to elevenlabs >> see below (i.e. the method/approach of passing data seems to work) >> perhaps your language_detection tool follows a slightly different nomenclature/style?
so for sure the type should be system
ok i think i already tried that but ill try again. (FYI when we do end_call the type is function and works)
and it doesnt work for system?
nope
i just did another test with 'system' (unsuccessful) >> if you have logging on your side please have a look at call k3IOk3s1cCLUczy0mbki
on our side i can confirm we sent the tool-call (before the text Hallo! NatĂĽrlich, ich kann ....)
alright - appreciate the help 🙂
🤣
servus
looking at this again
i think is because you need to stream back a final finish_reason = "tool_calls" at the end
ok, pretty sure i tried that aswell yesterday but will try again. >> will i still be able to contineu streaming text after i do that or will it end the connectioon?
(or do i send the final finish reason after the last piece of text)?
you can still return text
ok, ill give it a try and report back in 10 min
i see no errors in the logs but can see that you're agent just gets blocked and then times out
very weird
i just tried again with the old openai finish_reason = 'function_call' (instead of tool_calls) and that also didnt work.. interestingly when finish_reason = 'tool_calls' it times out / hangs up, when finish_reason = 'function_call' it just ignores the tool call.
yeah i think because in our codebase when we have something like this if output.choices[0].delta.content: yield the text elif output.choices[0].delta.tool_calls: set tool info elif output.choices[0].finish_reason == "tool_calls": yield tool request elif ( output.choices[0].finish_reason == "stop" ): yield tool request break
so function_call is just ignored
but something wrong is happening with "tool_calls" but i can see that it is setting the tool info correctly
hm ok - could it be something with the format of the arguments (currently assuming arguments within tool_calls needs to be in json format
.. or a minimum timeout needed after setting tool info and sending the finish_reason = tool_calls? (see screenshot above - both are essentially sent 'immediatly' after each other)
tried adding a 0.05 second delay between setting the tool and 'finish_reason'='tool_calls - also didnt work
Seems faster, but I’m realizing my previous traces were incorrect. You seem to be hitting custom llm before the transcript. I thought you were doing the opposite.
yes, would mean a lot if you check!
yeah we send transcript after because we only commit to transcript after we send the audio
user can resume speech and its transcript will change
@distant canopy appreciate your support thus far. Have you managed to look into whats happening with 'tool_calls' / is this somethign on 'your' side or something that we need to fix on our side? (if so what?)
hey sorry i didnt get round to this today, will take a look this weekend