Cloudfare outage post mortem

homura1650@lemmy.world · 5 hours ago

Cloudfare outage post mortem

Echo Dot@feddit.uk · 3 hours ago

So I work in the IT department of a pretty large company. One of the things that we do on a regular basis is staged updates, so we’ll get a small number of computers and we’ll update the software on them to the latest version or whatever. Then we leave it for about a week, and if the world doesn’t end we update the software onto the next group and then the next and then the next until everything is upgraded. We don’t just slap it onto production infrastructure and then go to the pub.

But apparently our standards are slightly higher than that of an international organisation who’s whole purpose is cyber security.

floquant@lemmy.dbzer0.com · 1 hour ago

Their motivation is that that file has to change rapidly to respond to threats. If a new botnet pops up and starts generating a lot of malicious traffic, they can’t just let it run for a week

unexposedhazard@discuss.tchncs.de · edit-2 7 minutes ago

How about an hour? 10 minutes? Would have prevented this. I very much doubt that their service is so unstable and flimsy that they need to respond to stuff on such short notice. It would be worthless to their customers if that were true.

Restarting and running some automated tests on a server should not take more than 5 minutes.

codemankey@programming.dev · 2 hours ago

My assumption is that the pattern you describe is possible/doable on certain scales and in certain combinations of technologies. But doing this across a distributed system with as many nodes and as many different nodes as CloudFlare has, and still have a system that can be updated quickly (responding to DDOS attacks for example) is a lot harder.

If you really feel like you have a better solution please contact them and consult for them, the internet would thank you for it.

Cloudfare outage post mortem

Cloudfare outage post mortem

Cloudflare outage on November 18, 2025