9
Cloudfare outage post mortem (blog.cloudflare.com)

The issue was not caused, directly or indirectly, by a cyber attack or malicious activity of any kind. Instead, it was triggered by a change to one of our database systems' permissions which caused the database to output multiple entries into a “feature file” used by our Bot Management system. That feature file, in turn, doubled in size. The larger-than-expected feature file was then propagated to all the machines that make up our network.

The software running on these machines to route traffic across our network reads this feature file to keep our Bot Management system up to date with ever changing threats. The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail.

top 13 comments
sorted by: hot top controversial new old
[-] echodot@feddit.uk 2 points 4 months ago

So I work in the IT department of a pretty large company. One of the things that we do on a regular basis is staged updates, so we'll get a small number of computers and we'll update the software on them to the latest version or whatever. Then we leave it for about a week, and if the world doesn't end we update the software onto the next group and then the next and then the next until everything is upgraded. We don't just slap it onto production infrastructure and then go to the pub.

But apparently our standards are slightly higher than that of an international organisation who's whole purpose is cyber security.

[-] floquant@lemmy.dbzer0.com 1 points 4 months ago

Their motivation is that that file has to change rapidly to respond to threats. If a new botnet pops up and starts generating a lot of malicious traffic, they can't just let it run for a week

[-] echodot@feddit.uk 1 points 4 months ago

There are technical solutions to this. You update half your servers, and then if they die you just disconnect them from the network while you fix them and then have your own unaffected servers take up the load. Now yes, this doesn't get a fixout quickly, but if you update kills your entire system, you're not going to get the fix out quickly anyway.

[-] JcbAzPx@lemmy.world 1 points 4 months ago

This is just the beginning of the coming vibe code apocalypse.

[-] melsaskca@lemmy.ca 1 points 4 months ago

We are going to see a lot more of this type of bullshit now that there are no standards anymore. Fuck everything else and make that money people!

[-] mech@feddit.org 1 points 4 months ago

A permissions change in one database can bring down half the Internet now.

[-] dan@upvote.au 0 points 4 months ago

When are people going to realise that routing a huge chunk of the internet through one private company is a bad idea? The entire point of the internet is that it's a decentralized network of networks.

[-] echodot@feddit.uk 1 points 4 months ago

I hate it but there really isn't much in the way of an alternative. Which is why they're dominant, they're the only game in town

[-] edgemaster72@lemmy.world 0 points 4 months ago
[-] Whimsical418@aussie.zone 1 points 4 months ago

Wasn’t it crowdstrike? Close enough though

[-] MonkderVierte@lemmy.zip 0 points 4 months ago* (last edited 4 months ago)

Meaning, internal error, like the other two prior.

Almost like one big provider with 99.9999% availability is worse than 10 with maybe 99.9%

[-] Jason2357@lemmy.ca 1 points 4 months ago

Except, if you chose the wrong 1 of that 10 and your company is the only one down for a day, you get fire-bombed. If "TEH INTERNETS ARE DOWN" and your website is down for a day, no one even calls you.

[-] jj4211@lemmy.world 1 points 4 months ago

Note that this outage by itself, based on their chart, was kicking out errors over the span of about 8 hours. This one outage would have almost entirely blown their downtown allowance under 99.9% availability criteria.

If one big provider actually provided 99.9999%, that would be 30 seconds of all outages over a typical year. Not even long enough for people to generally be sure there was an 'outage' as a user. That wouldn't be bad at all.

this post was submitted on 19 Nov 2025
9 points (100.0% liked)

Technology

82936 readers
685 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related news or articles.
  3. Be excellent to each other!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
  9. Check for duplicates before posting, duplicates may be removed
  10. Accounts 7 days and younger will have their posts automatically removed.

Approved Bots


founded 2 years ago
MODERATORS