tl;dr:
Each request takes exactly one second to process, and a new request arrives every second
That's their core issue. They were never able to process requests fast enough, and the moment there was any delay it all came down like a house of cards. If you're already running at 100%, yeah no shit you're going to have problems if anything changes even slightly.
Further, it doesn't seem like retries backed off enough, or maybe should have just given up eventually.
The writing style also made it kind of hard to follow. Technical articles work better when they're not written like a children's story, but with technical writing.