#newtreyes-webhook-delivery
1 messages ยท Page 1 of 1 (latest)
Can you provide an example of what you mean? If you're not subscribed to an event type when it is emitted, there will not be any future delivery of that event.
Seems like when the health of a webhook endpoint is bad, Stripe will queue events for sending them but will not send them right away.
It is not that the whole endpoint is turned off but only that some events are not sent immediately (only queued to be sent later on).
This was not an expected behavior. We were expecting that events were either sent immediately (even when failing) or that the whole endpoint was turned off when the endpoint health was bad.
But we were seeing some event arriving while other events were just queued.
That will cause out of order events with delays on some events that could span from minutes to hours.
So we wanted to read about how this works and if there is any way we can have control over this "throttling" mechanism.
Right, so we may do this in test mode when your endpoint seems particularly unhealthy (lots of errors) ahead of fully disabling it. There is no control over this behaviour as it's not an expected situation, you should ensure your endpoint it able to handle the traffic you've subscribed to or otherwise disable it.
Interesting
So in production, this does not happen?
A production endpoint is either enabled or disabled and it can be disabled by the user of by Stripe on cases like these.
but if disabled, it will be completely disabled.
Am I reading that correctly?
but if disabled, it will be completely disabled.
what do you mean by "completely disabled", not sure I understand
but re:
Are there any docs about how Stripe queues events when a specific webhook endpoint starts returning errors?
I don't think there's anything documented here
So we wanted to read about how this works and if there is any way we can have control over this "throttling" mechanism.
no control over this either, basically the best thing to do is build resilient webhook handlers that expect events out of order but also make sure that your webhook handler is not returning errors or timing out
what do you mean by "completely disabled", not sure I understand
Deactivated because of its health. I mean, we were expecting an expoint to be deactivated. We were not expecting some events arriving and other events not arriving until minutes / hours later.
We were not expecting some events arriving and other events not arriving until minutes / hours later.
IIRC for unhealthy webhook endpoints, the rate at which events are sent can be slowed down. Don't think there's any documented duration either (mins vs hrs). Do you have an example of a webhook event that was sent hours after creation though? Can have a look
@hybrid dock Is this the original thread relating to the screenshot you just posted int he main channel?
Yes
Event id: evt_3JbqeALNS7fsUd0U1oZN8EMq
As you can see, the event was not even attempted
but other events were sent.
Gotcha - give ma few minutes to read back!
Sorry for the wait - I'm still looking!
@hybrid dock Hey, I'm a bit confused by your explanations earlier in the thread. To my knowledge we do not do the behaviour your are describing where we'd hold events for minutes or hours on end. Events out of order is totally normal and something you have to be resilient against but that's it.
The event id you gave, I'm not sure what you need me to look for on that one
That event was created on 2021-09-20 17:51:26 UTC and then sent immediately to your endpoint, where your server failed to respond so we retried it an hour later
ugh and now that I say all of this, obviously I'm clearly wrong adn someone changed that logic my bad
๐
let me dig into this, this doesn;t match my understanding of how webhook delivery work
Okayyyy, sorry for the confusion. We talked as a team and it is expected and it's been like this for a year. I have no memory of knowing this but I see I even read the announcement email internally ๐
Based on the code and internal documentation, this behaviour is only for Test mode today. The idea is to stop overwhelming our servers with an endpoint that is mis-behaving in Test mode and failing sometimes millions of events per hour
Looking at the code, it could happen in Live mode one day if your endpoint is clearly down/mis-behaving, but it's only enabled for Test mode today because that's where the majority of the bad load was coming from.