Custom LLM overhead? | ElevenLabs | Page 1

uncut panther Mar 13, 2025, 7:10 PM

#

Moving here to keep state

#

do you mind running a test on what's the latency of the last sentence on your side?

#

if it's 1.6s then it could be your server doesn't really stream it?

#

it's suspicious the first byte and sentence are so close on our side but not yours

forest dirge Mar 13, 2025, 7:19 PM

#

bumping this, because I noticed this as well, but didn't test as thoroughly as Lukas

uncut panther Mar 13, 2025, 7:25 PM

#

forest dirge bumping this, because I noticed this as well, but didn't test as thoroughly as L...

when did you notice? there was a recent fix like a 1-2 weeks back

forest dirge Mar 13, 2025, 7:34 PM

#

This week. I'm noticing it today as well.

static crater Mar 13, 2025, 9:10 PM

#

uncut panther if it's 1.6s then it could be your server doesn't really stream it?

Sure, for the example above the last byte was received/forwarded at 19:55:10.781 CET, i.e. in this call the custom LLM server completed after 0.99 seconds. In other words, even if the streaming doesn't work properly - which i dont think since we can see it properly stream in a open-ai style chat interface - there still seems to be 0.6 sec overhead..

#

Let me know if you need anything else to help find the root-cause.. 🙂 Happy to share the server-side code etc.

uncut panther Mar 14, 2025, 12:19 AM

#

Will add some logging and then ask you to share me a conversation id tmw

distant canopy Mar 15, 2025, 11:50 AM

#

adding sentry trace from @forest dirge because @uncut panther is tracking here

uncut panther Mar 16, 2025, 2:05 AM

#

Hi @forest dirge we added some logging, can you make a new call and send a conversation id?

forest dirge Mar 16, 2025, 3:51 AM

#

@uncut panther 9BxX8j5zcB6YJbdQfKfX thanks again. appreciate you looking into this

uncut panther Mar 16, 2025, 3:53 AM

#

what times are you seeing?

forest dirge Mar 16, 2025, 3:53 AM

#

if the transcript gets sent via websocket as soon as you get it, then it's interesting that there's apparently a delay between that and when the custom_llm is called. Seems like that plus waiting for the whole completion to be done

#

oh shoot that wasn't running on my logging branch, that was in production, give me a sec

#

7kirV0IYxfqsGbISx41A anecdotally it felt a bit faster.

I whipped this tracing up really quick and I don't have unique id's to correlate transcript to custom llm call so don't take these timings as gospel truth, but I'd be curious how they compare

uncut panther Mar 16, 2025, 4:07 AM

#

so for the 9BxX8j5zcB6YJbdQfKfX second's answer I see unexplained ~120ms at the beginning but other then that, when request is sent:
2025-03-16 03:49:28.110
2025-03-16 03:49:29.702 (Received chunk of size 1531)
2025-03-16T03:49:29.792 (Received chunk of size 301)
....

First one is 5x larger than others

#

7kirV0IYxfqsGbISx41A -> the first message is 1.6s but the others are 700ms

#

7kirV0IYxfqsGbISx41A -> first message:
extra 154ms to make a request due to some security stuff which we need to optimize then:
2025-03-16T03:54:31.868 (request)
2025-03-16 03:54:33.386 (received chunk of size 330)

now all are similar size

#

second message:
110ms start overhead, then:

2025-03-16 03:54:41.328 (request)
2025-03-16 03:54:41.899 (received chunk of size 330)

Fast - I guess what you see in yours?

#

then each new chunk (either 300 or 600 bytes):

#

e.g. for the 899 we finish processing it 925

#

and 42.006 we already have a sentence accepted and starting tts generation

#

so the biggest difference between the two is just the delay to get the reply hm, by any chance don't you have some middleware or ddos layer that takes long time? could you try creating a VM in a US region and query your server and meassure latency?

#

e.g. in this convo it was fast except first message, but in your prod one wasn't

#

anecdotally the first chunk received is smaller in your dev call than prod

forest dirge Mar 16, 2025, 4:32 AM

#

Oh interesting. We’re on railway. I’m not sure what ddos or middleware they have in front of us, but can setup another test

forest dirge Mar 16, 2025, 2:24 PM

#

It’s consistently fast on our other provider local or in production, but all the communication with them is via websocket

forest dirge Mar 16, 2025, 6:56 PM

#

@uncut panther are you tracing the time from transcription to request? From my perspective I am sending audio all the time from twilio, so I can't detect turns without custom logic. My anchor point for tracing is when I first get the transcription, then I can track it through it's lifecycle of transcript -> request -> first token -> first audio -> final token

appears like the delay is from transcription to when custom llm is called based on watching the logs while running it.

I can DM you a test prod number to call on monday if you'd like to call a bunch and keep watching on your end. I have a test number for our current provider as well if you want to compare.

#

appreciate you looking into this!

uncut panther Mar 17, 2025, 5:40 AM

#

I am tracing when I make the http request to your server, in the dev call first time it took 1.5s to get an answer from your server, subsequent ones were fast

#

You can try putting openai api address and your openai dev key and you will see it's the same fast (+100ms) as if selecting gpt in the llm list

#

yeah we can also implement the ws option

#

if easier

forest dirge Mar 17, 2025, 2:41 PM

#

based on my traces it looked like everything was extremely fast except for the delay between transcript received via ws and when I received your http custom llm call. That's the piece I'm suspecting is adding slowness.

forest dirge Mar 17, 2025, 4:56 PM

#

@uncut panther this is how millis does it. https://docs.millis.ai/tutorials/setup-custom-llm-websocket the nice part is that as soon as you get the transcript you can start streaming audio back and you already have a secure connection, but either way works as long as it's fast

Millis AI

Setup a custom LLM Websocket - Millis AI

Step-by-step guide on how to set up a custom LLM for your Millis AI voice agent

static crater Mar 17, 2025, 8:23 PM

#

In case you need another example to trace (with custom LLM) - have a look at this call id gtlC0dJm91V9auF760IB from today (again approx 1.8 seconds, where the LLM only takes approx 1.1 seconds)

forest dirge Mar 20, 2025, 4:56 PM

#

@uncut panther I saw you mention in the main chat that things got faster. Would you consider this “fixed”?

uncut panther Mar 21, 2025, 9:33 AM

#

Yes, but that was due to some older issue we had

#

our analysis above is already taking that into account

#

currently there is up to 100ms overhead for custom llm but shouldn't be more

#

we will implement the ws custom llm option maybe within 3 weeks

forest dirge Mar 23, 2025, 6:50 PM

#

Anecdotally it has seemed quite a bit faster reliably

uncut panther Mar 24, 2025, 4:37 PM

#

haha : )

#

no clue why

#

happy to hear so tho!

forest dirge Mar 25, 2025, 3:37 PM

#

nvm it was really slow this morning. not really usable for us right now when it's so up and down. Wish we could pay you guys for your voice activity detection

#

I'll check back in about the custom llm option

static crater Mar 25, 2025, 5:44 PM

#

yes also for us it seems super volatile in terms of speed (sometimes uper fast, sometimes super slow) - do you know what this could be? and can you tell me what format you need for us to trigger a language change? i tried a format analogous to the 'end_call' but cant get it to work

uncut panther Mar 26, 2025, 4:48 PM

#

@distant canopy could you advise on the language change tool?

#

@static crater are you also using custom llm?

#

@forest dirge can yous end a recent one that was slow?

#

@static crater also if you can share one would be great

static crater Mar 27, 2025, 10:09 AM

#

uncut panther <@1311435912449626183> are you also using custom llm?

yes im using a custom llm which is trying to change the language using your standard system tool call 'language_detection' (i.e. exactly the same way as we are using 'end_call' - which works really nicely). We have the tool 'activated' in the convo-ai ui. @distant canopy would be fantastic if you can advise on what I'm doing wrong - i suspect some kind of syntax issue - have tried quite a lot of different versions (type = 'system', type = 'function', with another empty delta: {} trigger at the end & without).

this is our current structure (following open-ai schema - finish reason is none since we have normal text snippets that follow a language change):

#

static crater Mar 27, 2025, 11:33 AM

#

(more context for the request above) >> just to check if it works i passed the arguments to the end_call tool and see that its 'correctly' passed to elevenlabs >> see below (i.e. the method/approach of passing data seems to work) >> perhaps your language_detection tool follows a slightly different nomenclature/style?

#

distant canopy Mar 27, 2025, 12:01 PM

#

so for sure the type should be system

static crater Mar 27, 2025, 12:03 PM

#

ok i think i already tried that but ill try again. (FYI when we do end_call the type is function and works)

distant canopy Mar 27, 2025, 12:03 PM

#

and it doesnt work for system?

static crater Mar 27, 2025, 12:04 PM

#

nope

#

i just did another test with 'system' (unsuccessful) >> if you have logging on your side please have a look at call k3IOk3s1cCLUczy0mbki
on our side i can confirm we sent the tool-call (before the text Hallo! Natürlich, ich kann ....)

distant canopy Mar 27, 2025, 12:17 PM

#

thx lukas

#

gotta jump onto the call but think i have found sth

#

not 100% sure

static crater Mar 27, 2025, 12:17 PM

#

alright - appreciate the help 🙂

distant canopy Mar 27, 2025, 12:17 PM

#

will take a look at that id when im back

#

bis bald

static crater Mar 27, 2025, 12:17 PM

#

🤣

distant canopy Mar 27, 2025, 1:34 PM

#

servus

#

looking at this again

#

i think is because you need to stream back a final finish_reason = "tool_calls" at the end

static crater Mar 27, 2025, 1:46 PM

#

ok, pretty sure i tried that aswell yesterday but will try again. >> will i still be able to contineu streaming text after i do that or will it end the connectioon?

#

(or do i send the final finish reason after the last piece of text)?

distant canopy Mar 27, 2025, 1:47 PM

#

you can still return text

static crater Mar 27, 2025, 1:48 PM

#

ok, ill give it a try and report back in 10 min

distant canopy Mar 27, 2025, 1:48 PM

#

thx

#

first chunk looks good though

#

that can stay the same

static crater Mar 27, 2025, 1:58 PM

#

still no luck.. see call id WgEOeJIlErcvhzYtJhzs

#

for reference @distant canopy:

#

distant canopy Mar 27, 2025, 2:44 PM

#

i see no errors in the logs but can see that you're agent just gets blocked and then times out

#

very weird

static crater Mar 27, 2025, 2:54 PM

#

i just tried again with the old openai finish_reason = 'function_call' (instead of tool_calls) and that also didnt work.. interestingly when finish_reason = 'tool_calls' it times out / hangs up, when finish_reason = 'function_call' it just ignores the tool call.

#

distant canopy Mar 27, 2025, 3:01 PM

#

yeah i think because in our codebase when we have something like this if output.choices[0].delta.content: yield the text elif output.choices[0].delta.tool_calls: set tool info elif output.choices[0].finish_reason == "tool_calls": yield tool request elif ( output.choices[0].finish_reason == "stop" ): yield tool request break

#

so function_call is just ignored

#

but something wrong is happening with "tool_calls" but i can see that it is setting the tool info correctly

static crater Mar 27, 2025, 3:06 PM

#

hm ok - could it be something with the format of the arguments (currently assuming arguments within tool_calls needs to be in json format
.. or a minimum timeout needed after setting tool info and sending the finish_reason = tool_calls? (see screenshot above - both are essentially sent 'immediatly' after each other)

static crater Mar 27, 2025, 3:37 PM

#

tried adding a 0.05 second delay between setting the tool and 'finish_reason'='tool_calls - also didnt work

forest dirge Mar 27, 2025, 11:23 PM

#

Seems faster, but I’m realizing my previous traces were incorrect. You seem to be hitting custom llm before the transcript. I thought you were doing the opposite.

uncut panther Mar 28, 2025, 8:06 AM

#

forest dirge Seems faster, but I’m realizing my previous traces were incorrect. You seem to b...

yes, would mean a lot if you check!

#

yeah we send transcript after because we only commit to transcript after we send the audio

#

user can resume speech and its transcript will change

static crater Mar 28, 2025, 9:56 AM

#

distant canopy but something wrong is happening with "tool_calls" but i can see that it is sett...

@distant canopy appreciate your support thus far. Have you managed to look into whats happening with 'tool_calls' / is this somethign on 'your' side or something that we need to fix on our side? (if so what?)

distant canopy Mar 28, 2025, 8:09 PM

#

hey sorry i didnt get round to this today, will take a look this weekend

static crater Mar 31, 2025, 2:41 PM

#

can you give me a short heads up once youve looked into this so i can reactive the feature on our agents? would be super important for us to get multi-lingual support working

#

(sorry to be a pain - appreciate that you will most definitely have a full backlog)

#Custom LLM overhead?