#function execution timeout

209 messages · Page 1 of 1 (latest)

iron saffron
#

we are still struggling with the timeout issue when executing the function. I just migrated from self-host to appwrite cloud. but I still face a problem, 1-2 out of 10 requests are falling.

we need to check the logs and fix the problem.

failed function id: 65bca0e9badb30f85e9d
project id: 65bc9440a1dc894315ea

#

<@&564164014339391498> who can check the logs?

sudden fossil
pseudo plover
#

@iron saffron additionally, can you share which runtime you're using and if there are any differences in input data or any other surrounding details?

crisp hatch
iron saffron
#

I use nodejs 18

iron saffron
iron saffron
#

if you help, it would be very good, we are making a startup with very good prospects. we initially chose appwrite and wrote the functions. we now have a choice between rewriting the entire project to another backend or using firebase or others.

vale peak
#

Timeout error means that code didnt finish for a long time. In JS this could mean some await that never finish, or often I see this with Promise.all.
In context you get req and res, but also log.
I would recommend to try to use log() function in few places of your code. When function times out, you still see logs that occured before. Thanks to that you can find a spot of your code that is causing the freeze.

#

You could also write a test file to execute function with some sample data and run locally with proper debugger, if that's what you would prefer

iron saffron
#

when we get the timeout logs do not remain, I tried to do this.

vale peak
iron saffron
#

logs empty

#

error
Operation timed out after 30000 milliseconds with 0 bytes received with status code 0\nError Code: 0

vale peak
#

Hmm 🤔 That's either bug, code is doing cpu-heavy operation for over 1 second or timeout is outside of your code.
If thats fine by you, could you please give me access to the repository? I want to check for some possible problems with the use of await in combination with arrays, or possibly problematic libraries.

#

So we only have 2 more options ✨ Slowly getting there 😅

iron saffron
iron saffron
#

it seems to me that the request is stuck somewhere.

#

Have you received an invite?

vale peak
#

Yes, thanks. I accepted and checked out the code.
I cant spot any problem. Only thing that comes to my mind is something getting stuck in attempt to communicate to Appwrite database or your PosterService API

#

But that doesnt explain why logs are not present from before the timeout

heady gull
vale peak
#

Can you please go to function settings and set timeout to let's say 15 seconds? Looking at your execution screenshot, usually it takes below 1 second so it shouldnt affect anything.

iron saffron
vale peak
iron saffron
vale peak
#

Let's set it back to 15 and lets again try to add some logs - one immidietelly after the start of function.
If Steven's concern is right, we should see logs now that timeout is 15 seconds

iron saffron
#

after some recommendations, I even migrated my self-hosting to the appwrite cloud, but in some cases I get a timeout there.

vale peak
#

If we manage to isolate problem, Ill setup a function to reproduce it and fix it quickly

iron saffron
#

There is a change.
error logs: Execution timed out.

vale peak
#

Amazig yeah, that is expected. Cool, Im making myself issue to ensure sync executions timeout properly.

#

Now you can try to add some verbose logs to find out in which part the code freezes.

#

I can understand this will be annoying process and we are working on improving this experience in next version of functions.

vale peak
#

Ill be stepping away for a while, Ill be back in ~1 hour

iron saffron
#

there may be a problem when getting a getBranch

#
    async getBranch(id: string): Promise<Branch> {
        try {
            return await this.databases.getDocument(this.databaseId, this.branchCollectionId, id);
        } catch {
            throw Error('Филиал не найден');
        }
    }

but it seems to me that this function is not very heavy.

vale peak
#

^ Both, seems like. Currently on Cloud

vale peak
iron saffron
#

my branch model

vale peak
#

Relationships 🤔🤔🤔

vale peak
# iron saffron my branch model

Ill ping @ebon valley here, seems like we are getting function timeout on databases.getDocument with this schema ^
@ebon valley can you see possible reason why this could cause a freeze?

#

(Jake is differnt timezone, he will likely only answer tomorrow. Or Monday)

iron saffron
#

if we find a problem, it will help a lot. I've seen a lot of requests in the forum.))

#

we used to use Bun, because of the timeout we switched to nodejs 18, but even then the problem did not disappear.

#

receipt model

vale peak
#

You already provided a lot of useful information and I am very grateful 🙌
We will do our best to identify the problem and get it fixed.

With that said, if you would be willing to try to reproduce it on a new project with simpler collections & relations setup, it would be highly appreciated. With a simple flow we could quickly reproduce it, fix it, and even introduce a test for that specific scenario to ensure it never happens agian.

iron saffron
ebon valley
# iron saffron receipt model

Are you able to share the full schema of your relationships? So far I see you have:

branches <-> organization
branches <-> receipts
branches -> pos_config
branches -> address
receipts <-> transactions
receipts -> pos_system

iron saffron
#

Yesterday I created a demo project with exact copies. I can invite you there if you give me your email.

#

@ebon valley

iron saffron
#

wow i can't only one member

#

i will send personal message

#

@ebon valley

iron saffron
ebon valley
#

Or just share the full code over DM if it's easier

iron saffron
#

i send invite

#

@ebon valley I can invite you to join the appwrite which is self-hosting installed.

#

there is more error observed there

#

pls send me your email

iron saffron
#

can relationship queries turn into recursion?

#

Two-way relationship could it be a source of recursion?

ebon valley
# iron saffron can relationship queries turn into recursion?

Yeah, internally the relationships are fetched recursively, but there's a max of 3 levels of recursion, which is why on the third level you get null for any further relationships.

We also detect cyclic relationships across collections and short-circuit them so they're only fetched once to avoid an infinite loop.

Fetching from any of the collections via console, client and server SDKs are all working okay, so I would rule out a relationship bug.

Tried the same code with a PHP function instead and got the same issue so it's not node specific, seems to me to be function related. Back to you @vale peak 😅

iron saffron
ebon valley
iron saffron
ebon valley
vale peak
#

Morning 👋
@iron saffron Could you please share the code and credentials with me as well? Ill reproduce it on my end and pin-point when in function execution flow this happens. That will let us move forward and find the bug

heady gull
ebon valley
iron saffron
#

The problem is definitely not the relationship.

heady gull
iron saffron
#

for example we have otp module. from 10 requests 2 failed(timeout).

#

the logic of this function is very simple.
we could not use the ftp services that you offer, as they are very expensive for our country.

  1. we create a session for the phone on behalf of the administrator.
  2. we send the received secret code to the user through a local service.
#

we get a timeout even for regular functions, especially if the function has not worked for a while.

vale peak
#

@heady gull I was able to reproduce it (very weirdly, we need to find better way to reproduce) on production. Next best steps for us would be:

  • Find easy way to reproduce, a simple function that can cause it + benchmarking script to trigger it. Should be as simple as Appwrite Function with query to listDocuments
  • Both functions and on Stage, and QA server. Will help us prove if it's infrastructure issue or not
  • Function on Stage (or QA) and Database on Production. If it works, it might be DNS issue

I have started investigating but some urgent tasks popped up, so I had to step away for a few days.

vapid gazelle
#

There are lots of posts with this problem. We also have it.

pine olive
iron saffron
#

Solving this bug solved many problems.

pine olive
iron saffron
iron saffron
crisp hatch
#

Appreciate your patience and help!

#

These flakey bugs are the hardest to debug

pine olive
#

Any updates, I am still struggling with this and I cant even find my own support thread anymore 😦

iron saffron
#

@pine olive I'm also really looking forward to releasing Mate <@&634618551491100692>. This is a really critical bug. We are already considering using other firebase services or others. But appwrite would be a handy tool, since a large amount of ram business logic is written.

pine olive
sudden fossil
karmic dove
#

In my experience from many times messing with Functions, it’s typically an uncaught error somewhere in my code tbh that causes this. Or too many promises when I was awaiting many at the same time

#

It seems like Appwrite’s fault because the error doesn’t tell us what’s happening, I don’t know why it has that error message, but usually I can fix it myself by debugging more and stopping execution before things using return statements to find where it happens

#

But once it works it won’t happen, at least it hasn’t to me in ~23,000 executions

iron saffron
iron saffron
karmic dove
iron saffron
iron saffron
karmic dove
#

Wait this is self hosted or cloud?

iron saffron
#

Self host and cloud

karmic dove
#

so I self host all my instances, is the issue primarily on cloud or primarily self host?

iron saffron
#

We noticed this problem with both usage scenarios.

karmic dove
#

Huh, is there a link to the reproduction of it?

#

or would you like me to try to help you debug tomorrow to see if we can find a workaround?

#

Maybe the Python SDK won’t do it, maybe it’s an issue with promises, I know it can be frustrating but there’s always an answer 🙂 worst case scenario I can help you spin up a portainer instance and host an API server or something to solve the issue so you can launch

pine olive
# karmic dove Huh, is there a link to the reproduction of it?

Can I send you my code that is having the issue? I have a simple function that either completes in milliseconds, or fails after 15 seconds. We're still in development but its basically grabbing two documents from a collection (there's literally 2 in the entire collection), checking something and returning the value. With literally every single variable being equal through repeated tests it fails anywhere from 20-50% of the time.

vale peak
#

Hey community 👋
I can see multiple threads about function timeouts, Ill post update here as it seems the most active.

We managed to isolate and reproduce the problem. It affects both self-hosted and Cloud. Cloud is affected in rare scenarios, and self-hosted common-to-rare, depending on amount of CPU cores.

Solution for self-hosted will be released as part of patch versions in 1.5.x version. I would still recommend increasing amount of CPU cores of self-hosted machines to reduce this problem, or scale horizontally as Appwrite API containers are stateless.

For Cloud users, this week we will work on multiple solutions, each reducing occurrence of function timeout. First fixes will be deployed as soon as today or tomorrow. To our knowledge, combinations of proposed solutions should make a very big difference in both occurrence rate as well as recovery speed.

Those are quick short-term solutions, but we also plan to work on long-term solution to fully avoid this problem for all users, and allow even better scaling. While such solution doesn't take too much time, a lot of preparation in our libraries needs to be done, and we want to very firmly test these changes. I don't have any ETA for it, but we plan to start working on it soon after having above mentioned quick fixes deployed.


Thanks for all the reports and information you all provided, without your active feedback this wouldn't be possible ❤️ With that said, now Ill start working on the fixes so we can celebrate soon 🙌

twin horizon
karmic dove
twin horizon
#

Still, 6 cores for just that, since it's not a production used docker yet, is really unneccessary
so basically appwrite is doing nothing, so all cores are basically just for that one function, which fails anyway

twin horizon
#

Since _APP_FUNCTIONS_CPUS is 0, and last time you said worker_per_core in the other thread (which is on 8 atm), i'm guessing you mean worker_per_core
(which to be working correctly, I guess this needs to be 12+ at the moment)

vale peak
#

For instance, Open RUntimes Proxy uses PER_CORE=100 and works perfectly fine. Problem is that Appwrite will make 1 DB connection for each worker thread so you need to make sure your SQL can accept as much as you increase, including all workers. Best to try and fail, to find the limit (with benchmark)

#

✨We got some new updates ✨

I have been doing benchmarks all day today. One bottleneck was fixed and it got confirmed to work as expected, but it had only little impact on this function timeout issue (but it had impact 🎉). It's a good thing it got fixed, it would become bottleneck very soon.

Another fix was implemented today, and we expect much higher impact. I am waiting for PR reviews before I do benchmark again.

#

This second fix will have same impact both Cloud and self-hosted. First fix only affected instances using Proxy for functions (Cloud + a few self-hosted instances).

twin horizon
vale peak
pine olive
vale peak
vapid gazelle
twin horizon
#

@vale peak Imagine doing a normal:

import { getUsers } from './utils/getUsers.js'

export default async({ req, res, log, error }) => {
  if (req.method === 'GET') {
    switch (req.path) {
      case '/':
        return res.send('Hello, World!')
      case '/getUsers':
        const users = await getUsers()
        if (!users) {
          return res.send('Error fetching users')
        }
        return res.send(users)
      default:
        return res.send('Not found')
    }
  }
};

With my /getUsers endpoint being:

export async function getUsers () {
  const response = await fetch(
    `${process.env.APPWRITE_API_URL}/v1/databases/65527f2aafa5338cdb57/collections/65564fa28d1942747a72/documents`,
    {
      method: 'GET',
      headers: {
        'Content-Type': 'application/json',
        'X-Appwrite-Project': `${process.env.APPWRITE_PROJECT_ID}`,
        'X-Appwrite-Response-Format': '1.4.0',
      },
    },
  )

  if (!response.ok) {
    return false
  }

  const data = await response.json()
  return data.documents
}

It's timing out, how? It's a simple fetch 1lolol

vale peak
#

--

Update time

We have recently finished libraries needed for our long-term solution that will prevent this problem, and we will be starting implementation in Appwrite. Finger crossed for smooth transition 🤞

twin horizon
pallid pewter
vale peak
#

It's a rare scenario when one API container thhreads are filled only with synchronous execution, and a request outgoing from an execution is randomly assigned to the same container, and it becomes frozen until execution times out.
I consider them rare because there needs to be as many requests from executions as there are threads, times amount of API containers in infrastructure.

Solution we are attempting will allow us to 10-50x amount of concurrently working threads, allowing us to mitigate those occurances for many months. In the meantime we might shift to coroutine approach, or different solution we find.

pallid pewter
vale peak
#

We made great progress on a long-term solution ✨ The benchmark shows much better concurrency results. Benchmark req/sec results are worse, but that's expected as the solution switches from a multi-core to a single-core approach. With 8 containers on 8 CPU cores, results are much much better 🎉

We also noticed that the function timeout issue isn't 100% solved with this, so we continued the investigation on that front. As far as I understand, there is some bug with Node 18 specifically because Node 21 and Python 3.9 don't have the same issue. I am working actively on understanding the cause and proposing a solution to that too.

iron saffron
twin horizon
#

lol, my import function is no longer "party failing", it is now "all failing" 2HAhaa

vale peak
#

Update time ✨

I got very good news, we spotted and managed to reproduce a scenario when a freeze is caused by Swoole, an underlying runtime that Appwrite uses. For anyone interested in technical details, you can check out a public issue: https://github.com/swoole/swoole-src/issues/5274

No spam on the issue please, lets be respectful towards Swoole maintainers 🙏 They are doing an amazing job 🔥

While waiting to hear back from them, we are considering infrastructure changes that will prevent dependent request, and prevent timeout issues.

For Cloud users, we will handle it all.
For self-hosted, to do those changes, you will neede to spin up second instance of Apppwrite API, expose it to runtimes network and use it instead of your domain endpoint inside Functions code.

GitHub

Please answer these questions before submitting your issue. What did you do? If possible, provide a simple script for reproducing the error. I configured an async-style HTTP server to have two work...

twin horizon
#

also is that a permanent solution or just to temporarily fix it?

vale peak
# twin horizon also is that a permanent solution or just to temporarily fix it?

From feedback we got on the issue, it seems like we can customize Swoole for our needs on this front. I will be trying it next week and likely deploy it to Cloud. If we manage to resolve timeout issues on Cloud, we will release it to open source as well - Ill fight for patch 1.5.x version for this bugfix. If all of that goes well, simply upgrading your Appwrite version will resolve the problem. Ill keep you posted 🙌

#

Oh and yes, it will be permanent fix. We will still explore coroutines as they can speed-up Appwrite APIs in general, but it will no longer have urgency from this function timeout issue.

twin horizon
vale peak
# twin horizon Hey, any update? Would like to publish some sites soon and waiting on this.

Good morning 👋
We managed to customize how the underlying HTTP server distributes work, and managed to prevent the freeze - avoiding all function timeouts. As per our understanding, this may worsen performance a little, but it's perfectly fine, as we can always scale horizontally. Right now we are benchmarking the before&after, and are testing the solution on stage to ensure stability and memory usage.

Soon, we plan to deploy this on Cloud - likely next week

With successful deployment on Cloud, and no new reports on this issue, I we will release this as a patch version in 1.5.x.

For more technical details, you can check out a PR that addresses this problem: https://github.com/appwrite/appwrite/pull/7855

GitHub

What does this PR do?

Fixes random function timeout

Test Plan

manual tests and benchmarks

Related PRs and Issues

(Related PR or issue)

Checklist

Have you read the Contributing Guidelines on...

twin horizon
vale peak
twin horizon
#

Thanks!

fossil pine
trim coral
fossil pine
#

why🥹

twin horizon
vale peak
# twin horizon Any update on the cloud deployment, Matej? 🙂

Good morning ☀️

We deployed an update to Cloud and are currently monitoring the effect. I am not actively working on it right now, so I sadly dont have all the details. What's very positive is that deployment was a success, as we did a huge change to an underlying logic of request processing - and everything remains stable. There are some configurations we might need to adjust before our final solution.

vale peak
fossil pine
vale peak
#

Such amazing news! 🔥 cc @sudden fossil @cosmic dew

hybrid umbra
#

so this issue is still present in self-hosted right?

faint basin
twin horizon
vale peak
# twin horizon Heyo, any update? :)

Hi 👋 We are currently in waiting phase. From your feedback we believe the problem is resolved, but on Sentry we are seeing almost identical amount of timeouts as before. We didn't yet manage to create a conclusion out of that, and we are waiting to have a bit more data before we do. Ill talk to Eldad about it by the end of this week, and we mark it as resolved for Cloud, we will add it to self-hosted.
cc @cosmic dew, if you have any more relevant info

twin horizon
#

That's a bummer! It's almost like you "suppressed" the error? haha

pine olive
#

Any updates on this?

faint basin
twin horizon
wooden flax
faint basin
twin horizon
#

@vale peak grr

vale peak
#

A few folks are OOO, once they get back we will sync all cloud changes to open-source. There are more bug fixes, so we can release them all in 1 patch version

#

I think all could be merged by end of this week

trim coral
vale peak
iron saffron
#

@vale peak wow good job 🔥🔥🔥

twin horizon
twin horizon
#

bump !

wooden flax
#

Bump also :p

vale peak
#

Good news. Thanks everyone for pushing on this topic. You made it clear this fix is important. We considered it temporary fix, and we still do, before we have proper concurrency support. With that said, it helped Cloud in the meantime, and it can help self-hosted instances too.

We will be releasing the fixes in 1.5.7, next patch release.

faint basin
#

@vale peak @iron saffron I think we can mark this as solved now? 🙂

twin horizon
#

Ye

iron saffron
twin horizon
faint basin
#

Test it yourself and confirm

silk monolith
#

This fixed all timeout issues for me on self-hosted.

misty lodge
vale peak
vagrant lodge
#

Also, I can confirm. Version 1.5.7 has fixed this bug.