24
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
this post was submitted on 01 Nov 2023
24 points (100.0% liked)
Programming
17314 readers
80 users here now
Welcome to the main community in programming.dev! Feel free to post anything relating to programming here!
Cross posting is strongly encouraged in the instance. If you feel your post or another person's post makes sense in another community cross post into it.
Hope you enjoy the instance!
Rules
Rules
- Follow the programming.dev instance rules
- Keep content related to programming in some way
- If you're posting long videos try to add in some form of tldr for those who don't want to watch videos
Wormhole
Follow the wormhole through a path of communities !webdev@programming.dev
founded 1 year ago
MODERATORS
I think the issue is that customers can escalate directly to SRE.
SRE is supposed to work on the health and reliability of the service. It does sound like there is a reliability issue when loading large datasets. But this should be project work, not incident response work.
Is your service violating your internal SLOs when this happens?
Where I work, customers escalate to a support team, who tries to work with them. It's only after the support team decides it's a product issue that it makes it to SRE. Even then, 90% of the time, the support staff will file a ticket to be handled at business hours rather than page SRE.
If this auto scaling delay is expected, I'd try to do two things:
Produce better error messages, so that the customer can know what's happening and hopefully not need to escalate.
Work with the rest of the company (typically the Product or Support teams if you have them) to make sure customers understand these limitations.
Edit Oh, also don't let customers page you for known limitations. Design a better process around this.
And if it's that bad, SRE should invest in project work to make the autoscaling less painful.
Edit: Your service should return some kind of client error (i.e. exempt from SLO) in this situation. In gRPC, that would probably be RESOURCE_EXHAUSTED, and the error message should be something like "Yo your DB is out of disk, chill out while we fetch more disk. To avoid these errors in the future, pre-scale before large writes."