9
Cloudfare outage post mortem
(blog.cloudflare.com)
This is a most excellent place for technology news and articles.
So I work in the IT department of a pretty large company. One of the things that we do on a regular basis is staged updates, so we'll get a small number of computers and we'll update the software on them to the latest version or whatever. Then we leave it for about a week, and if the world doesn't end we update the software onto the next group and then the next and then the next until everything is upgraded. We don't just slap it onto production infrastructure and then go to the pub.
But apparently our standards are slightly higher than that of an international organisation who's whole purpose is cyber security.
Their motivation is that that file has to change rapidly to respond to threats. If a new botnet pops up and starts generating a lot of malicious traffic, they can't just let it run for a week
There are technical solutions to this. You update half your servers, and then if they die you just disconnect them from the network while you fix them and then have your own unaffected servers take up the load. Now yes, this doesn't get a fixout quickly, but if you update kills your entire system, you're not going to get the fix out quickly anyway.