#Please rate and give advice on my design of retry system for async requests

1 messages · Page 1 of 1 (latest)

azure skiff
#

What we have:

  • a third-party API, that is really unreliable
  • events, for which we have to get response from this api
  • the API is async, so it works like this:
    1. First cron job is making requests to it. API doesn't return real result, but a message id, which is saved to DB
    2. Second cron job is then goes over these saved message ids, and tries to get a real response from API

Problem:

  • that this API might go down for like a day, and usual retries don't really work. We are looking at 3-5 days retry window lol

My solution:

  • create a table, for storing info about retries
  • if request fails (from either job) - check the table: if the time of the first retry is null, or less then retry period - put it back to it's table, to be picked up later again by first or the second job. Otherwise - fail

The solution seems really simple, I wonder, what I might have missed there? Would be grateful for your ideas and advices
Of course I'm gonna talk to my tech lead about this, but I want to eliminate "stupid mistakes" before that

spice kilnBOT
#

<@&987246841693360200> please have a look, thanks.

spice kilnBOT
#

While you are waiting for getting help, here are some tips to improve your experience:

Code is much easier to read if posted with syntax highlighting and proper formatting.

If nobody is calling back, that usually means that your question was not well asked and hence nobody feels confident enough answering. Try to use your time to elaborate, provide details, context, more code, examples and maybe some screenshots. With enough info, someone knows the answer for sure.

Don't forget to close your thread using the command </help-thread close:1027500463647621170> when your question has been answered, thanks.

#

Here is an AI assisted attempt to answer your question 🤖. Maybe it helps! In any case, a human is on the way 👍. To continue talking to the AI, you can use </chatgpt:1108714622413963314>.

#

design of a retry system for handling async requests to this API.

Rating: 8/10

Advice:

  1. Consider implementing an exponential backoff algorithm for retrying failed requests. This means that each subsequent retry will have a longer delay than the previous one, giving the API more time to recover. This can help prevent overwhelming the API with too many retries in a short period of time.

  2. Implement a maximum number of retries to prevent an infinite loop in case the API is down for an extended period of time. You can set a reasonable limit based on your application's requirements and the expected downtime of the API.

  3. Add some form of jitter to the retry delays to avoid all requests being retried at exactly the same time. This can help distribute the load on the API more evenly and reduce contention.

  4. Consider implementing a circuit breaker pattern to temporarily stop sending requests to the API if it consistently fails over a certain threshold. This can help protect your application from continuously retrying requests that are unlikely to succeed.

  5. Monitor and log failed requests and their corresponding retries for debugging purposes. This will help you identify any patterns or issues with the third-party API and adjust your retry strategy accordingly.

  6. If possible, consider caching responses from successful requests for a certain period of time to reduce reliance on the unreliable third-party API. This can help improve performance and availability of your application.

  7. Test your retry system thoroughly by simulating different failure scenarios, such as network errors or timeouts, to ensure it behaves as expected and handles failures gracefully.

Overall, your design looks solid, but incorporating these suggestions will further enhance its reliability and resilience when dealing with an unreliable third-party API.

glacial walrus
#

chatgpt did good, i was also going to suggest exponential backoff and some of those suggestions

azure skiff
glacial walrus
#

the way it works is lets say you wait 10 minutes before trying again, if it fails, you wait 20 mins, then 40 mins then 80 mins., etc

#

since the api can come back whenever