Are there really a billion systems in the world that run Crowdstrike? That seems implausible. Is it just hyperbole?
Probably includes a bunch of virtual machines.
Yeah, our VMs completely died at work. Has to set up temporary stuff on hardware we had laying around today. Was kinda fun, but stressful haha.
Could you just revert VMs to a snapshot before the update? Or do you not take periodic snapshots? You could probably also mount the VM's drive on the host and delete the relevant file that way.
Yes you can just go into safe mode on an affected machine and delete the offending file. The problem is it took a couple hours before that resolution was found, and it has to be done by hand on every VM. I can’t just run an Ansible playbook against hundreds of non-booted VMs. Then you have to consider in the case of servers, there might be a specific start up order, certain things might have to be started before other things and further fixing might be required given that every VM hard crashed. At the minimum it took many companies 6-12 hours to get back up and running and on many more it could take days.
Makes sense - thanks for the details.
I doubt it's too much of a stretch, since even here in australia, we've had multiple airlines, news stations, banks, supermarkets and many others, including the aluminium extrusion business my father works at, all go down, scale this do hundreds of countries with populations tenfold of ours, it puts it into perspective that there may even be more than a billion machines affected
Despite how it may seem on Lemmy, most people have not yet actually switched to Linux. This stat is legit.
I know that Windows is everywhere, I just don't know the percentage of Windows computers that run Crowdstrike.
Keep in mind, it's not just clients, but servers too. A friend of mine works for a decently sized company that has about 1600 (virtual) servers internationally. And yes, all of them were affected.
Yes
Sounds pretty plausible to me. An organization doesn’t have to be very big to get into the hundreds or thousands of devices on a network when you account for servers and VM.
A company with 40 employees all accessing and RDS server using a company laptop is looking at 85+ devices already
Whoda thunk automatic updates to critical infrastructure was a good idea? Just hope healthcare life support was not affected.
Many compliance frameworks require security utilities to receive automatic updates. It's pretty essential for effective endpoint protection considering how fast new threats spread.
The problem is not the automated update, it's why it wasn't caught in testing and how the update managed to break the entire OS.
It is pretty easy to imagine separate streams of updates that affect each other negatively.
CrowdStrike does its own 0-day updates, Microsoft does its own 0-day updates. There is probably limited if any testing at that critical intersection.
If Microsoft 100% controlled the release stream, otoh, there'd be a much better chance to have caught it. The responsibility would probably lie with MS in such a case.
(edit: not saying that this is what happened, hence the conditionals)
I don't think that is what happened here in this situation though, I think the issue was caused exclusively by a Crowdstrike update but I haven't read anything official that really breaks this down.
Some comments yesterday were claiming the offending file was several kb of just 0s. All signs are pointing to a massive fuckup from an individual company.
Which makes me wonder, did the company even test it at all on their own machines first?
Hospital stuff was affected. Most engineers are smart enough to not connect critical equipment to the Internet, though.
I’m not in the US, but my other medical peers who are mentioned that EPIC (the software most hospitals use to manage patient records) was not affected, but Dragon (the software by Nuance that we doctors use for dictation so we don’t have to type notes) was down. Someone I know complained that they had to “type notes like a medieval peasant.” But I’m glad that the critical infrastructure was up and running. At my former hospital, we used to always maintain physical records simultaneously for all our current inpatients that only the medical team responsible for those specific patients had access to just to be on the safe side.
That's actually a very smart idea, keeping physical records of every inpatient. Wonder why the ai companies don't do transcription of medical notes, instead of trying to add ai features to my washer/ dryer combo. Just seems like a very practical use of the tech
This is pretty much correct. I work in an Epic shop and we had about 150 servers to remediate and some number of workstations (I’m not sure how many). While Epic make not have been impacted, it is a highly integrated system and when things are failing around it then it can have an impact on care delivery. For example if a provider places a stat lab order in Epic, that lab order gets transmitted to an integration middleware which then routes it to the lab system. If the integration middleware or the lab system are down, then the provider has no idea the stat order went into a black hole.
I’m an Epic analyst - while Epic was fine, many of our third party integrations shit the bed. Cardiology (where I work) was mostly unaffected aside from Omnicell being down, but the laboratory was massively fucked due to all the integrations they have. Multiple teams were quite busy, I just got to talk to them about it eventually.
There is no learning, companies just move to different antivirus. The new hotness, the cycle repeats over and over until the new antivirus does this same shit. Look at McAfee in 2010, in fact the CEO of Crowdstrike was the CTO of McAfee then. That easily took down millions of windows XP machines.
Combing over it's Wikipedia article, this company already had a series of other issues.
Sucks to anyone who ever relied on them. Oh look at that, they've been acquiring other security startups and companies. Perhaps that should also be looked into as well?
There is learning here.
As companies, we put faith in an external entity with goals not identical to our own: a lot of faith, and a lot of control.
That company had the power to destroy our businesses, cripple travel and medicine and our courts, and delay daily work that could include some timely and critical tasks.
This is not crowdstrike's fault; for the bad code yes, but for the indirect effects of that no. We knew - please tell me we had the brains god gave a gnat and we knew - that putting so much control in the hands of outsiders not concerned or aware of our detailed needs and priorities, was a negligent and foolish thing to do.
The lesson is to do our jobs: we need to ensure we have the ability to make the decisions to which were entrusted, and the power that authority gives us that our decisions when accepted are not threatened by a negligent mistake so boneheaded it's all but the whim of a simpleton. We cannot choose to manage our part of our organization effectively, no matter how (un)important that organization or part is, and then share control with a force that we've seen can run roughshod over it.
It's exactly like the leopards eating our face, except people didn't see they were leopards. No one blames the leopards, as they're just conforming to their nature, eventually.
And no one should blame this company for a small mistake, just because we let the jaws get so close to our faces that we became complacent.
Have you never worked in corporate IT or something? Of course we should blame Crowdstrike, that way we don't get a sev 1 on our scorecard.
It's funny that corporate IT will be one of the groups getting the blame in this case, despite it being in most cases not their decision that a company lacks a separate test and production environment. The executives that decided that usually gets off scot free.
Hahah, no doubt, while popping in and out of the outage call repeating the phrases "can I get an update?", " Is there an ETA on recovery?" and "We need to get this back online"
That company had the power to destroy our businesses, cripple travel and medicine and our courts, and delay daily work that could include some timely and critical tasks.
Unless you have the ability and capacity to develop your own ISA/CPU architecture, firmware, OS, and every tool you use from the ground up, you will always be, at some point, “relying on others stuff” which can break on you at a moments notice.
That could be Intel, or Microsoft, or OpenSSH, or CrowdStrike^0. Very, very, very few organizations can exist in the modern computing world without relying on others code/hardware (with the main two that could that come to mind outside smaller embedded systems being IBM and Apple).
I do wish that consumers had held Microsoft more to account over the last few decades to properly use the Intel Protection Rings (if the CrowdStrike driver were able to run in Ring 1, then it’s possible the OS could have isolated it and prevented a BSOD, but instead it runs in Ring 0 with the kernel and has access to damage anything and everything) — but that horse appears to be long out of the gate (enough so that X86S proposes only having Ring 0 and Ring 3 for future processors).
But back to my basic thesis: saying “it’s your fault for relying on other peoples code” is unhelpful and overly reductive, as in the modern day it’s virtually impossible to do so. Even fully auditing your stacks is prohibitive. There is a good argument to be made about not living in a compute monoculture^1; and lots of good arguments against ever using Windows^2 (especially in the cloud) — but those aren’t the arguments you’re making. Saying “this is your fault for relying on other peoples stuff” is unhelpful — and I somehow doubt you designed your own ISA, CPU architecture, firmware, OS, network stack, and application code to post your comment.
——-
^0 — Indeed, all four of these organizations/projects have let us down like this; Intel with Spectre/Meltdown, Microsoft with the 28 day 32-bit Windows reboot bug, and OpenSSH just announced regreSSHion.
^1 — My organization was hit by the Falcon Sensor outage — our app tier layers running on Linux and developer machines running on macOS were unaffected, but our DBMS is still a legacy MS SQL box, so the outage hammered our stack pretty badly. We’ve fortunately been well funded to remove our dependency on MS SQL (and Windows in general), but that’s a multi-year effort that won’t pay off for some time yet.
^2 — my Windows hate is well documented elsewhere.
Technology
This is a most excellent place for technology news and articles.
Our Rules
- Follow the lemmy.world rules.
- Only tech related content.
- Be excellent to each another!
- Mod approved content bots can post up to 10 articles per day.
- Threads asking for personal tech support may be deleted.
- Politics threads may be removed.
- No memes allowed as posts, OK to post as comments.
- Only approved bots from the list below, to ask if your bot can be added please contact us.
- Check for duplicates before posting, duplicates may be removed