256

All I wanted to do was push my changes and log off... [Github Outage] (dubvee.org)

submitted 1 year ago* (last edited 1 year ago) by ptz@dubvee.org to c/programmer_humor@programming.dev

19 comments fedilink hide all child comments

Github seems to be down.

Edit: After I made this, their status page finally updated to indicate an issue.

Update - We are experiencing interruptions in multiple public GitHub services. We suspect the impact is due to a database infrastructure related change that we are working on rolling back.

all 20 comments

sorted by: hot top controversial new old

[-] thesmokingman@programming.dev 72 points 1 year ago

This is a common problem. Same thing happens with AWS outages too. Business people get to manually flip the switches here. It’s completely divorced from proper monitoring. An internal alert triggers, engineers start looking at it, and only when someone approves publishing the outage does it actually appear on the status page. Outages for places like GitHub and AWS are tied to SLAs that are tied to payouts or discounts for huge customers so there’s an immense incentive to not declare an outage even though everything is on fire. I have yelled at AWS, GitHub, Azure, and a few smaller vendors for this exact bullshit. One time we had a Textract outage for over six hours before AWS finally decided to declare one. We were fucking screaming at our TAM by the end because no one in our collective networks could use it but they refused to declare an outage.

[-] Lemminary@lemmy.world 27 points 1 year ago

It's manual?? Holy shit, that explains some previous hair pulling.

[-] thesmokingman@programming.dev 20 points 1 year ago

To be clear, usually there’s an approval gate. Something is generated automatically but a product or business person has to actually approve the alert going out. Behind the scenes everyone internal knows shit is on fire (unless they have shitty monitoring, metrics, and alerting which is true for a lot of places but not major cloud or SaaS providers).

[-] RegalPotoo@lemmy.world 14 points 1 year ago

Or, alternatively, coms management is important and formally declaring an incident is an important part of outage response - going from "hey Bob something isn't looking right can you check when you get a sec" to "ok, shits broken, everyone put down what you are working on and help with this. Jim is in charge of coordinating the technical people so we don't make things worse, and should feed updates to Mike who is going to handle comms to non-technical internal people and to externals" takes management input

[-] azertyfun@sh.itjust.works 19 points 1 year ago

You're describing proper incident response but I fail to see what that has to do with the status page. They have core metrics that they could display on that status page without a human being involved.

IMO a customer-friendly status page would automatically display elevated error rates as "suspected outage" or whatever. Then management can add more detail and/or say "confirmed outage". In fact that's how the reddit status page works (or at least used to work), it even shows little graphs with error rates and processing backlogs.

There are reasons why these automated systems don't exist, but none of these reasons align with user interests.

[-] thesmokingman@programming.dev 8 points 1 year ago

Speaking from 10+ YoE developing metrics, dashboards, uptime, all that shit and another 5+ on top of that at an exec level managing all that, this is bullshit. There is a disconnect between the automated systems that tell us something is down and the people that want to tell the outside world something is down. If you are a small company, there’s a decent chance you’ve launched your product without proper alerting and monitoring so you have to manually manage outages. If you are GitHub or AWS size, you know exactly when shit hits the fan because you have contracts that depend on that and you’re going to need some justification for downtime. Assuming a healthy environment, you’re doing a blameless postmortem but you’ve done millions of those at that scale and part of resolving them is ensuring you know before it happens again. Internally you know when there is an outage; exposing that externally is always about making yourself look good not customer experience.

What you’re describing is the incident management process. That also doesn’t require management input because you’re not going to wait for some fucking suit to respond to a Slack message. Your alarms have severities that give you agency. Again, small businesses sure you might not, but at large scale, especially with anyone holding anything like a SOC2, you have procedures in place and you’re stopping the bleeding. You will have some level of leadership that steps in and translates what the individual contributors are doing to business speak; that doesn’t prevent you from telling your customers shit is fucked up.

The only time a company actually needs to properly evaluate what’s going on before announcing is a security incident. There’s a huge difference between “my honeypot blew up” and “the database in this region is fucked so customers can’t write anything to it; they probably can’t use our product.” My honeypot blowing up might be an indication I’m fucked or that the attackers blew up the honeypot instead of anything else. Can’t send traffic to a region? Literally no reason the customer would be able to so why am I not telling them?

I read your response as either someone who knows nothing about the field or someone on the business side who doesn’t actually understand how single panes of glass work. If that’s not the case, I apologize. This is a huge pet peeve for basically anyone in the SRE/DevOps space who consumes these shitty status pages.

[-] kautau@lemmy.world 12 points 1 year ago

Yeah my second spot if a status page is green is always https://downdetector.com/ since it’s user generated

[-] Midnitte@beehaw.org 22 points 1 year ago

So... you crashed github? 😋

[-] SanicHegehog@lemm.ee 8 points 1 year ago

Before I die I want to be called out by name in a GitHub RCA.

[-] ptz@dubvee.org 8 points 1 year ago

Oops!

[-] Hemi03@lemmy.blahaj.zone 15 points 1 year ago

its all red, how did the fuck up that badly.

[-] CountVon@sh.itjust.works 25 points 1 year ago

From https://www.githubstatus.com/ (emphasis mine):

We suspect the impact is due to a database infrastructure related change that we are working on rolling back.

If you fuck up the database, you fuck up errythang.

[-] ptz@dubvee.org 10 points 1 year ago

Yeah, everything's dropping. I first noticed it when my push failed.

[-] zzx@lemmy.world 9 points 1 year ago

I'm always on-call whenever shit blows up... Nothing I can do GitHub is literally hard down

[-] 30p87@feddit.org 1 points 1 year ago

https://git.30p87.de with 90% uptime (I reboot the server 10% of the time due to bleeding-edge arch testing kernel updates)

[-] ptz@dubvee.org 5 points 1 year ago

Oh, I run my own Gogs server internally, but for collaboration, we all use Github.

[-] BlueKey@fedia.io 3 points 1 year ago

Funny how I just hit a downtime when trying to explore your GL.

[-] 30p87@feddit.org 1 points 1 year ago

Then you hit the 10% of the time where I reboot for kernel updates lmao

this post was submitted on 14 Aug 2024

256 points (98.9% liked)

Programmer Humor

27945 readers

129 users here now

Welcome to Programmer Humor!

This is a place where you can post jokes, memes, humor, etc. related to programming!

For sharing awful code theres also Programming Horror.

Rules

Keep content in english
No advertisements
Posts must be related to programming or programmer topics

founded 2 years ago

MODERATORS

Feyter@programming.dev

anzo@programming.dev

BurningTurtle@programming.dev

pylapp@programming.dev