Good Retry, Bad Retry: An Incident Story (medium.com)

submitted 10 months ago by beejjorgensen@lemmy.sdf.org to c/technology@lemmy.world

8 comments fedilink hide all child comments

I've never run a big system like this, but like the lead character in the story, I always figured exponential backoff would be enough. Turns out there's more.

top 8 comments

sorted by: hot top controversial new old

[-] catloaf@lemm.ee 14 points 10 months ago* (last edited 10 months ago)

tl;dr:

Each request takes exactly one second to process, and a new request arrives every second

That's their core issue. They were never able to process requests fast enough, and the moment there was any delay it all came down like a house of cards. If you're already running at 100%, yeah no shit you're going to have problems if anything changes even slightly.

Further, it doesn't seem like retries backed off enough, or maybe should have just given up eventually.

The writing style also made it kind of hard to follow. Technical articles work better when they're not written like a children's story, but with technical writing.

[-] RubberElectrons@lemmy.world 3 points 10 months ago* (last edited 10 months ago)

Hmm... I'd say that was an obvious example to cause the situation, the real point was exposing the more subtle problems with feedback loops.

What happens if the server in question was at 80% capacity, and due to hardware faults, that leads to 100% utilization? Can you reconfigure your services if there's a cascading overload through enough of the system without actually adding to the system load? What do you do about the fact that these loops gets ever more powerful and sudden the larger the system?

The author seemed to be suggesting that we carefully consider how to avoid open feedback loops, and build stability in. This article clued me in that stability problems can be borne from "industry standard" advice if you don't carefully think about it.

[-] catloaf@lemm.ee 2 points 10 months ago

Ideally, you'd limit your resource utilization to always leave enough of a buffer that your management tools can run. But even if that's not the case, you should also be able to disable incoming traffic so that your servers stop even seeing the requests. Or you can just plain destroy and recreate with a new version.

But none of that addresses the fact that your retrying clients are basically DDoSing you. That can be mitigated by your WAF filtering requests so that only a fraction are passed to the server, as mentioned in the article, but preferably you'd just scale up to handle the load, or fix your clients to retry less frequently so that they don't DDoS you with retries. Even a large number of clients shouldn't be retrying so frequently that it overwhelms your system. Even if you're selling Taylor Swift tickets, where millions of clients are hammering you, you can scale horizontally to at least implement a queue for users so they're not hitting refresh every time they get a blank screen.

[-] RubberElectrons@lemmy.world 1 points 10 months ago

All of what you're saying seems correct. I think this is more of a meta discussion, on how (in this case) retries, even with exponential back off, aren't a solution by themselves when you look at the system overall. There are interesting hidden caveats to any common solutions, this is one I personally wasn't aware of.

Practically, adding a timeout budget so that the clients themselves just error out (forcing a manual refresh) sorta accomplishes the same as what you're positing.

[-] RubberElectrons@lemmy.world 3 points 10 months ago

Very interesting, thanks for this article. It's funny how I notice ever more repetition of phenomena through different branches of engineering; metastable failure caused by feedback loops is possible both in mechanical and electrical engineering. Named differently though, resonance and ringing, respectively.

[-] beejjorgensen@lemmy.sdf.org 3 points 10 months ago

https://archive.ph/H3dIq

[-] saroh@lemmy.world 2 points 10 months ago* (last edited 10 months ago)

A circuit breaker could prematurely cut off all requests to a service, even if only one shard was failing.

They only circuit break retries ?

If a single node is down, then it should not receive traffic via k8s or whatever you use to route based on liveness probe.

Why does your software need to retry anyways? I prefer not implementing live retries, stuff breaks sometimes. Tasks will retry themselves.

You can circuit break the connection to other services so that you stop contacting them if they are down. Giving them some breathing room.

The Wikipedia implem looks simple and good enough to me: https://en.m.wikipedia.org/wiki/Circuit_breaker_design_pattern

[-] RagnarokOnline@programming.dev 2 points 10 months ago

Loved this read, thanks for sharing. A good illustration of how chasing an issue with a quick solution can lead to bigger issues.

this post was submitted on 06 Oct 2024

48 points (92.9% liked)

Technology

73606 readers

1526 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws

L4s@hackingne.ws