#Custom LLM overhead?

1 messages · Page 1 of 1 (latest)

uncut panther
#

Moving here to keep state

#

do you mind running a test on what's the latency of the last sentence on your side?

#

if it's 1.6s then it could be your server doesn't really stream it?

#

it's suspicious the first byte and sentence are so close on our side but not yours

forest dirge
#

bumping this, because I noticed this as well, but didn't test as thoroughly as Lukas

uncut panther
forest dirge
#

This week. I'm noticing it today as well.

static crater
# uncut panther if it's 1.6s then it could be your server doesn't really stream it?

Sure, for the example above the last byte was received/forwarded at 19:55:10.781 CET, i.e. in this call the custom LLM server completed after 0.99 seconds. In other words, even if the streaming doesn't work properly - which i dont think since we can see it properly stream in a open-ai style chat interface - there still seems to be 0.6 sec overhead..

#

Let me know if you need anything else to help find the root-cause.. 🙂 Happy to share the server-side code etc.

uncut panther
#

Will add some logging and then ask you to share me a conversation id tmw

distant canopy
#

adding sentry trace from @forest dirge because @uncut panther is tracking here

uncut panther
#

Hi @forest dirge we added some logging, can you make a new call and send a conversation id?

forest dirge
#

@uncut panther 9BxX8j5zcB6YJbdQfKfX thanks again. appreciate you looking into this

uncut panther
#

what times are you seeing?

forest dirge
#

if the transcript gets sent via websocket as soon as you get it, then it's interesting that there's apparently a delay between that and when the custom_llm is called. Seems like that plus waiting for the whole completion to be done

#

oh shoot that wasn't running on my logging branch, that was in production, give me a sec

#

7kirV0IYxfqsGbISx41A anecdotally it felt a bit faster.

I whipped this tracing up really quick and I don't have unique id's to correlate transcript to custom llm call so don't take these timings as gospel truth, but I'd be curious how they compare

uncut panther
#

so for the 9BxX8j5zcB6YJbdQfKfX second's answer I see unexplained ~120ms at the beginning but other then that, when request is sent:
2025-03-16 03:49:28.110
2025-03-16 03:49:29.702 (Received chunk of size 1531)
2025-03-16T03:49:29.792 (Received chunk of size 301)
....

First one is 5x larger than others

#

7kirV0IYxfqsGbISx41A -> the first message is 1.6s but the others are 700ms

#

7kirV0IYxfqsGbISx41A -> first message:
extra 154ms to make a request due to some security stuff which we need to optimize then:
2025-03-16T03:54:31.868 (request)
2025-03-16 03:54:33.386 (received chunk of size 330)

now all are similar size

#

second message:
110ms start overhead, then:

2025-03-16 03:54:41.328 (request)
2025-03-16 03:54:41.899 (received chunk of size 330)

Fast - I guess what you see in yours?

#

then each new chunk (either 300 or 600 bytes):

#

e.g. for the 899 we finish processing it 925

#

and 42.006 we already have a sentence accepted and starting tts generation

#

so the biggest difference between the two is just the delay to get the reply hm, by any chance don't you have some middleware or ddos layer that takes long time? could you try creating a VM in a US region and query your server and meassure latency?

#

e.g. in this convo it was fast except first message, but in your prod one wasn't

#

anecdotally the first chunk received is smaller in your dev call than prod

forest dirge
#

Oh interesting. We’re on railway. I’m not sure what ddos or middleware they have in front of us, but can setup another test

forest dirge
#

It’s consistently fast on our other provider local or in production, but all the communication with them is via websocket

forest dirge
#

@uncut panther are you tracing the time from transcription to request? From my perspective I am sending audio all the time from twilio, so I can't detect turns without custom logic. My anchor point for tracing is when I first get the transcription, then I can track it through it's lifecycle of transcript -> request -> first token -> first audio -> final token

appears like the delay is from transcription to when custom llm is called based on watching the logs while running it.

I can DM you a test prod number to call on monday if you'd like to call a bunch and keep watching on your end. I have a test number for our current provider as well if you want to compare.

#

appreciate you looking into this!

uncut panther
#

I am tracing when I make the http request to your server, in the dev call first time it took 1.5s to get an answer from your server, subsequent ones were fast

#

You can try putting openai api address and your openai dev key and you will see it's the same fast (+100ms) as if selecting gpt in the llm list

#

yeah we can also implement the ws option

#

if easier

forest dirge
#

based on my traces it looked like everything was extremely fast except for the delay between transcript received via ws and when I received your http custom llm call. That's the piece I'm suspecting is adding slowness.

forest dirge
static crater
#

In case you need another example to trace (with custom LLM) - have a look at this call id gtlC0dJm91V9auF760IB from today (again approx 1.8 seconds, where the LLM only takes approx 1.1 seconds)

forest dirge
#

@uncut panther I saw you mention in the main chat that things got faster. Would you consider this “fixed”?

uncut panther
#

Yes, but that was due to some older issue we had

#

our analysis above is already taking that into account

#

currently there is up to 100ms overhead for custom llm but shouldn't be more

#

we will implement the ws custom llm option maybe within 3 weeks

forest dirge
#

Anecdotally it has seemed quite a bit faster reliably

uncut panther
#

haha : )

#

no clue why

#

happy to hear so tho!

forest dirge
#

nvm it was really slow this morning. not really usable for us right now when it's so up and down. Wish we could pay you guys for your voice activity detection

#

I'll check back in about the custom llm option

static crater
#

yes also for us it seems super volatile in terms of speed (sometimes uper fast, sometimes super slow) - do you know what this could be? and can you tell me what format you need for us to trigger a language change? i tried a format analogous to the 'end_call' but cant get it to work

uncut panther
#

@distant canopy could you advise on the language change tool?

#

@static crater are you also using custom llm?

#

@forest dirge can yous end a recent one that was slow?

#

@static crater also if you can share one would be great

static crater
# uncut panther <@1311435912449626183> are you also using custom llm?

yes im using a custom llm which is trying to change the language using your standard system tool call 'language_detection' (i.e. exactly the same way as we are using 'end_call' - which works really nicely). We have the tool 'activated' in the convo-ai ui. @distant canopy would be fantastic if you can advise on what I'm doing wrong - i suspect some kind of syntax issue - have tried quite a lot of different versions (type = 'system', type = 'function', with another empty delta: {} trigger at the end & without).

this is our current structure (following open-ai schema - finish reason is none since we have normal text snippets that follow a language change):

static crater
#

(more context for the request above) >> just to check if it works i passed the arguments to the end_call tool and see that its 'correctly' passed to elevenlabs >> see below (i.e. the method/approach of passing data seems to work) >> perhaps your language_detection tool follows a slightly different nomenclature/style?

distant canopy
#

so for sure the type should be system

static crater
#

ok i think i already tried that but ill try again. (FYI when we do end_call the type is function and works)

distant canopy
#

and it doesnt work for system?

static crater
#

nope

#

i just did another test with 'system' (unsuccessful) >> if you have logging on your side please have a look at call k3IOk3s1cCLUczy0mbki
on our side i can confirm we sent the tool-call (before the text Hallo! NatĂĽrlich, ich kann ....)

distant canopy
#

thx lukas

#

gotta jump onto the call but think i have found sth

#

not 100% sure

static crater
#

alright - appreciate the help 🙂

distant canopy
#

will take a look at that id when im back

#

bis bald

static crater
#

🤣

distant canopy
#

servus

#

looking at this again

#

i think is because you need to stream back a final finish_reason = "tool_calls" at the end

static crater
#

ok, pretty sure i tried that aswell yesterday but will try again. >> will i still be able to contineu streaming text after i do that or will it end the connectioon?

#

(or do i send the final finish reason after the last piece of text)?

distant canopy
#

you can still return text

static crater
#

ok, ill give it a try and report back in 10 min

distant canopy
#

thx

#

first chunk looks good though

#

that can stay the same

static crater
#

still no luck.. see call id WgEOeJIlErcvhzYtJhzs

#

for reference @distant canopy:

distant canopy
#

i see no errors in the logs but can see that you're agent just gets blocked and then times out

#

very weird

static crater
#

i just tried again with the old openai finish_reason = 'function_call' (instead of tool_calls) and that also didnt work.. interestingly when finish_reason = 'tool_calls' it times out / hangs up, when finish_reason = 'function_call' it just ignores the tool call.

distant canopy
#

yeah i think because in our codebase when we have something like this if output.choices[0].delta.content: yield the text elif output.choices[0].delta.tool_calls: set tool info elif output.choices[0].finish_reason == "tool_calls": yield tool request elif ( output.choices[0].finish_reason == "stop" ): yield tool request break

#

so function_call is just ignored

#

but something wrong is happening with "tool_calls" but i can see that it is setting the tool info correctly

static crater
#

hm ok - could it be something with the format of the arguments (currently assuming arguments within tool_calls needs to be in json format
.. or a minimum timeout needed after setting tool info and sending the finish_reason = tool_calls? (see screenshot above - both are essentially sent 'immediatly' after each other)

static crater
#

tried adding a 0.05 second delay between setting the tool and 'finish_reason'='tool_calls - also didnt work

forest dirge
#

Seems faster, but I’m realizing my previous traces were incorrect. You seem to be hitting custom llm before the transcript. I thought you were doing the opposite.

uncut panther
#

yeah we send transcript after because we only commit to transcript after we send the audio

#

user can resume speech and its transcript will change

static crater
distant canopy
#

hey sorry i didnt get round to this today, will take a look this weekend

static crater
#

can you give me a short heads up once youve looked into this so i can reactive the feature on our agents? would be super important for us to get multi-lingual support working

#

(sorry to be a pain - appreciate that you will most definitely have a full backlog)