It is definitely an under provisioning problem. But that under provisioning problem is caused by the customers usually being very very stingy about what they are willing to spend. Also, to be clear, it isn't buckling. It is doing exactly The thing it was designed to do. Which is to stop writes to the DB since there is no disk space left. And before this time, it's constantly throwing warnings to the end user. Usually these customers tend to ignore those errors until they reach this stop writes state.
In fact, we just had to give an RCA to the c-suite detailing why we had not scaled a customer when we should have, but we have a paper trail of them refusing the pricing and refusing to engage.
We get the same errors, and we usually reach out via email to each of these customers to help project where their data is going and scale appropriately. More frequently though, they are adding data at such a fast clip that them not responding for 2 hours would lead them directly into the stop writes status.
This has led us to guessing what our customers are going to end up at. Oftentimes being completely wrong and eating to scale multiple times.
Workload spikes are the entire reason why our database technology exists. That's the main thing we market ourselves as being able to handle (provided you gave the DB enough disk and the workload isn't sustained for a long enough to fill the discs.)
There is definitely an automation problem. Unfortunately, this particular line of our managed services will not be able to be automated. We work with special customers, with special requirements, usually fortune 100 companies that have extensive change control processes. Custom security implementations. And sometimes even no access to their environment unless they flip a switch.
To me it just seems to all go back to management/c-suite trying to sell a fantasy version of our product and setting us up for failure.
Admin of tucson.social here - when UM signed up at tucson.social he made some crucial mistakes that made him easy to identify as a bot. Unfortunately, since this affects my security posture, I'm not keen on publicly posting what it is as he still makes the same mistakes.
However, let me add this - there are multiple places we should be validating are accessed from the same IP in a registration flow - all to many bot farms centralize certain aspects of their operations and use the same IP every time for only certain parts of a given flow.
I'll also add that many admins are either stupid about site security, or actively complicit in the bot problem.