CrowdStrike IT outage affected 8.5 million Windows devices, Microsoft says

@MicroWave · 6 months ago

CrowdStrike IT outage affected 8.5 million Windows devices, Microsoft says

@[email protected] · 6 months ago

All i know is that I had to personally fix 450 servers myself and that doesn’t include the workstations that are probably still broke and will need to be fixed on Monday

😮‍💨

@[email protected] · 6 months ago

Is there any automation available for this? Do you fix them sequentially or can you parallelize the process? How long did it take to fix 450?

Real clustermess, but curious what fixing it looks like for the boots on the ground.

@[email protected] · edit-2 6 months ago

Thankfully I had cached credentials and our servers aren’t bitlocker’d. Majority of the servers had iLO consoles but not all. Most of the servers are on virtual hosts so once I got the fail over cluster back, it wasn’t that hard just working my way through them. But the hardware servers without iLO required physically plugging in a monitor and keyboard to fix, which is time consuming. 10 of them took a couple hours.

I worked 11+ hours straight. No breaks or lunch. That got our production domain up and the backup system back on. The dev and test domains are probably half working. My boss was responsible for those and he’s not very efficient.

So for the most part I was able to do most of the work from my admin pc in my office.

For the majority of them, I’d use the Widows recovery menu that they were stuck at to make them boot into safe mode with network support ( in case my cached credentials weren’t up-to-date). Then start a cmd and type out that famous command

Del c:\windows\system32\drivers\crowdstrike\c-00000291*.sys

I’d auto complete the folders with tab and the 5 zero’s … Probably gonna have that file in my memory forever

Edit: one painful self inflicted problem was my password is 25 random LastPass generatied password. But IDK how I managed it, I never typed it wrong. Yay for small wins

@[email protected] · 6 months ago

You need to boot into emergency mode and replace a file. Afaik it’s not very automatable.

@Jtee · 6 months ago

Especially if you have bitlocker enabled. Can’t boot to safe mode without entering the key, which typically only IT has access to.

@[email protected] · 6 months ago

You can give up the key to user and force a replacement on next DC connection, but get people to enter a key that’s 32 characters long over the phone… Not automatable anyway.

@HeyJoe · 6 months ago

Servers would probably be way easier than workstations if you ask me. If they were virtual, just bring up the remote console and you can do it all remotely. Even if they were physical I would hope they have an IP KVM attached to each server so they can also remotely access them as well. 450 sucks but at least they theoretically could have done every one of them without going anywhere.

There are also options to do workstations as well, but almost nobody ever uses those services so those probably need to be touched one by one.

@prashanthvsdvn · 6 months ago

I read this in a passing YouTube comment, but I think theoretically be possible to setup an ipxe boot server that sets up an Windows PE environment and can deploy the fix there and then all you have to do in the affected machines is to configure the boot option to the ipxe server you setup. Not fully sure though if it’s feasible or not.

@LavenderDay3544 · 6 months ago

deleted by creator

@[email protected] · 6 months ago

Because my expertise is Windows and that’s the environment I get paid to administer. We have Linux servers too but they didn’t have any of these problems. BUT they have had their own issues in the past and finding Linux system admins isn’t really as easy as you might expect. Running your own Linux system at home is not the same as running a 175TB CEPH Node

@danc4498 · 6 months ago

I wonder how much this cost people & businesses.

For instance, people’s flights were canceled because of this resulting in them having to stay in hotels overnight. I’m sure there’s many other examples.

@TexasDrunk · 6 months ago

For businesses, a lot of them are hiring IT companies (consultants, MSPs, VARs, and whoever the hell else they can get) at a couple to a few hundred bucks an hour per person to get boots on the ground to fix it. Some of them have everyone below the C levels with any sort of technical background doing entry level work so there’s also lost opportunity cost.

I was in that industry for a long time and still have a lot of colleagues there. There’s a guy I know making almost $200k/yr out there at desks trying to help fix it. He moved into an SRE role years ago so that’s languishing this week while he’s going desk to desk and office to office with support staff and IT contractors.

At least two large companies have an API where they’re paying for a pile of compute and currently have a small fraction of use. Companies are paying to use those APIs but can’t.

I don’t know if there’s a good way to actually figure out how much this is costing because there are so many variables. But you can bet there are a few people at the top funneling that money directly to themselves, never to be seen again.

@danc4498 · 6 months ago

That’s kind of what I thinking. There’s countless ways this costs money. And not an insignificant amount either.

Also, I work IT and have been in vacation. So sad I am missing all this!

@TexasDrunk · 6 months ago

Something I didn’t think about but has since come to my attention (group chat is getting spicy) is that there are a lot of mid level IT folks on salary who are getting the absolute dog shit worked out of them right now without seeing an extra dime. So the costs are beyond monetary.

@[email protected] · 6 months ago

8.5M worldwide? I was expecting higher numbers, interesting

@ArtVandelay · 6 months ago

Even if 8.5m is correct, with many being servers, the total people affected is much much higher.

NegativeNull · 6 months ago

The downstream effects are likely much much greater. If an auth server/DB server/API server/etc (for example) got taken down, the failure cascades

@teejay · 6 months ago

The idea that any such servers would be running windows… shudder

@PlutoniumAcid · 6 months ago

In the corpo that I work in, we had about 3000 servers down, plus probably twice as many workstations including laptops of remote workers. Yeah, fun!

mel · 6 months ago

For some of these systems, I don’t understand why they are not running openbsd like medical equipment that should be as secure as possible… And more broadly, most of the world depending on one OS and its environment is only a path for disasters (this one, wanna cry, spying from three letters agencies…)

@markr · 6 months ago

There are a lot of misunderstandings about what happened. First, the ‘update’ was to a data file used by the crowdstrike kernel components (specifically ‘falcon’.) while this file has a ‘.sys’ name, it is not a driver, it provides threat definition data. It is read by the falcon driver(s), not loaded as an executable.

Microsoft doesn’t update this file, crowdstrike user mode services do that, and they do that very frequently as part of their real-time threat detection and mitigation.

The updates are essential. There is no opportunity for IT to manage or test these updates other than blocking them via external firewalls.

The falcon kernel components apparently do not protect against a corrupted data file, or the corruption in this case evaded that protection. This is such an obvious vulnerability that i am leaning toward a deliberate manipulation of the data file to exploit a discovered vulnerability in their handling of a malformed data file. I have no evidence for that other than resilience against malformed data input is very basic software engineering and crowdstrike is a very sophisticated system.

I’m more interested in how the file got corrupted before distribution.

@PlutoniumAcid · 6 months ago

Yeah, how the hell did this failure pass testing, is what I want to know!

@[email protected] · 6 months ago

That’s the neat thing, Crowdstrike bypassed the rigorous testing process to get Kernel software updates signed by Microsoft by having the part that was tested and signed by Microsoft load another update file. Still unclear how Crowdstrike missed it before releasing it though.

This is a pretty good break down of what happened by a retired windows dev. Including how software operates between Kernel and user zones. The break down of what he thinks happened is about 6:40.

AutoTL;DR · 6 months ago

This is the best summary I could come up with:

Microsoft says it estimates that 8.5m computers around the world were disabled by the global IT outage.It’s the first time that a number has been put on the incident, which is still causing problems around the world.The glitch came from a cyber security company called CrowdStrike which sent out a corrupted software update to its huge number of customers.Microsoft, which is helping customers recover said in a blog post: “we currently estimate that CrowdStrike’s update affected 8.5 million Windows devices.”

The post by David Weston, vice-president, enterprise and OS at the firm, says this number is less than 1% of all Windows machines worldwide, but that “the broad economic and societal impacts reflect the use of CrowdStrike by enterprises that run many critical services”.The company can be very accurate on how many devices were disabled by the outage as it has performance telemetry to many by their internet connections.The tech giant - which was keen to point out that this was not an issue with it’s software - says the incident highlights how important it is for companies such as CrowdStrike to use quality control checks on updates before sending them out.“It’s also a reminder of how important it is for all of us across the tech ecosystem to prioritize operating with safe deployment and disaster recovery using the mechanisms that exist,” Mr Weston said.The fall out from the IT glitch has been enormous and was already one of the worst cyber-incidents in history.The number given by Microsoft means it is probably the largest ever cyber-event, eclipsing all previous hacks and outages.The closest to this is the WannaCry cyber-attack in 2017 that is estimated to have impacted around 300,000 computers in 150 countries.

There was a similar costly and disruptive attack called NotPetya a month later.There was also a major six-hour outage in 2021 at Meta, which runs Instagram, Facebook and WhatsApp.

But that was largely contained to the social media giant and some linked partners.The massive outage has also prompted warnings by cyber-security experts and agencies around the world about a wave of opportunistic hacking attempts linked to the IT outage.Cyber agencies in the UK and Australia are warning people to be vigilant to fake emails, calls and websites that pretend to be official.And CrowdStrike head George Kurtz encouraged users to make sure they were speaking to official representatives from the company before downloading fixes.

“We know that adversaries and bad actors will try to exploit events like this,” he said in a blog post.Whenever there is a major news event, especially one linked to technology, hackers respond by tweaking their existing methods to take into account the fear and uncertainty.According to researchers at Secureworks, there has already been a sharp rise in CrowdStrike-themed domain registrations – hackers registering new websites made to look official and potentially trick IT managers or members of the public into downloading malicious software or handing over private details.Cyber security agencies around the world have urged IT responders to only use CrowdStrike’s website to source information and help.The advice is mainly for IT managers who are the ones being affected by this as they try to get their organisations back online.But individuals too might be targeted, so experts are warning to be to be hyper vigilante and only act on information from the official CrowdStrike channels.

The original article contains 551 words, the summary contains 552 words. Saved -0%. I’m a bot and I’m open source!

dentoid · 6 months ago

Upvoted just for the tagline “reduced article from 551 to 552 words” 😁 Wacky bot

Resol van Lemmy · 6 months ago

Y2K, delayed 24 years, 7 months, and 19 days.

What worries me even more is that something pretty similar could happen to 32-bit devices in 2038.

@SeattleRain · 6 months ago

It’s the cyber 9/11 they always worried about.

@[email protected] · 6 months ago

In case you needed to another reason to switch to Linux.

Windows is so unreliable that even Microsoft runs Linux internally.

Blaster M · 6 months ago

When this happened to Linux and MacOS users of Crowdstrike some time ago, no one cared.

Rentlar · edit-2 6 months ago

https://forums.rockylinux.org/t/crowdstrike-freezing-rockylinux-after-9-4-upgrade/14041

The bug seems to have only affected certain Linux kernels and versions. Of course no one cared, because it didn’t simultaneously take out hospital systems and airline systems worldwide to an extent that you’d only think you’d see in movies.

Linux has comparitive advantages for being so diverse. Since there are so many different update channels it would be hard to pull off such a large outage, intentionally or unintentionally. Yet, if we imagine a totally equivalent scenario of a CrowdStrike update causing kernel panics in most Linux distribitions, this is what could be done:

Ubuntu, Redhat, and other organizations who make money from supporting and ensuring reliability of their customers’ systems, would be on the case to find a working configuration, as soon as they find out it’s not an isolated incident or user error.
If one finds a solution, it will likely quickly be shared to other organizations and adapted.
The error logs, and inner workings of the kernel and where it fails are clearly available to admins, customer support personnel and tech nerds, so they aren’t fully at the mercy of the maintainers of the proprietary blobs (both Microsoft and Crowdstrike, for Windows, but only Crowdstrike for Linux) to determine the cause and potential solutions that would be available.
The Linux internet-facing component updates can be rolled back and inspected/installed separately to the Crowdstrike updates. The buggy update to Microsoft Azure and from Crowdstrike happening together on the same day muddied the waters as to what exactly went wrong in the first several hours of the outage.
There’s more flexibility to adjust the behaviour of the kernel itself, even in a scenario CrowdStrike was dragging its feet. Emergency kernel patches could just set to ignore panics caused by the faulty configuration files identified, at least as a potential temporary fix.

@[email protected] · edit-2 6 months ago

deleted by creator

@[email protected] · 6 months ago

“Don’t encypt your drives containing sensitive company data” is a hard sell.

r00ty · 6 months ago

I think there’s a good argument for bitlocker on laptops.

It’s much less of a sell for servers and workstations in what should be secure locations.

Having said that, where I work they just enabled enforced windows hello pin with only numeric pins with minimum 6 digits. Seems like a pretty good way to entirely negate the protection bitlocker provides. But hey ho.

John Richard · 6 months ago

CrowdStrike will ultimately have contract terms that put responsibility on the companies, and truth be told the companies should be able to handle this situation with relative ease. Maybe the discussion here should be on the fragility of Windows and why Linux is a better option.

Avid Amoeba · 6 months ago

Linux could have easily been bricked in a similar fashion by pushing a bad kernel or kernel module update that wasn’t tested enough. Not saying it’s the same as Windows, but this particular scenario where someone can push a system component just like that can fuck up both.

John Richard · 6 months ago

Yes it can, but a kernel update is a completely different scenario, and managed individually by companies as part of their upgrades. It is usually tested and rolled out incrementally.

Furthermore, Linux doesn’t blue screen. I know some scenarios where Linux has issues, but I can count on one finger the amount of times I’ve had an update cause issues booting… and that was because I was using some newer encryption settings as part of systemd.

However, it would take all my fingers & toes, and then some, to count the number of blue screens I’ve gotten with Windows… and I don’t think I’m alone in that regard.

@[email protected] · 6 months ago

Linux doesn’t blue screen, no. A kernel panic is a black screen.

@[email protected] · 6 months ago

And you’re running corporate kernel level security software on your encrypted Linux server?

John Richard · edit-2 6 months ago

I guess it depends on what you consider corporate kernel level security. Would that include AppArmor, SELinux, and other tools that are open-source but used in some of the most secure corporate and government environments? Or are you asking if I’m running proprietary untrusted code on a Linux server with access to the system kernel?

@[email protected] · edit-2 6 months ago

deleted by creator

@[email protected] · 6 months ago

Tell me you’ve never administered at scale without telling me you’ve never administered at scale.

@[email protected] · 6 months ago

Bruh, disk encryption is not optional in many environments and dealing with unbootable LUKS Linux is pretty much on par with an unbootable Bitlocker Windows machine.

@[email protected] · 6 months ago

In this case, it’s really not a Linux/windows thing except by the most tenuous reasoning.

A corrupted piece of kernel level software is going to cause issues in any OS.
Cloudstrike itself has actually caused kernel panics on Linux before, albeit less because of a corrupted driver and more because of programming choices interacting with kernel behavior. (Two bugs: you shouldn’t have done that, and it shouldn’t have let you).

Tenuously, Linux is a better choice because it doesn’t need this type of software as much. It’s easier and more efficient to do packet inspection via dedicated firewall for infrastructure, and the other parts are already handled by automation and reporting tools you already use.
You still need something in this category if you need to solve the exact problem of “realtime network and filesystem event monitoring on each host”, but Linux makes it easier to get right up to that point without diving into the kernel.
Also vendors managing auto update is just less of a thing on Linux, so it’s more the cultural norm to manage updates in a way that’s conducive to staggering that would have caught this.

Contract wise, I’m less confident that crowd strike has favorable terms.
It’s usually consumers who are straddled with atrocious terms because they neither have power nor the interest in digging into the specifics too far.
Businesses, particularly ones that need or are interested in this category of software, inevitably have lawyers to go over contract terms in much more detail and much more ability to refuse terms and have it matter to the vendor. United airlines isn’t going to accept the contract terms of caveat emptor.

John Richard · 6 months ago

You assume that businesses operate in good faith. That they thoroughly review contracts to ensure that they are fair and in the best interests of all its employees. Do you really think Greg, a VP of Cloud Solutions that makes 500k a year, who gets his IT advice on the golf course by AWS, Microsoft, & Oracle reps. Who gets wined & dined almost weekly by these reps, and a speaking spot at re:Invent, and believes Gartner when it says spending $5 million a month on cloud hosting and $90/TB on Egress traffic is normal, has the company’s best interests in mind?

I’ve seen companies pay millions for things they never used, or that weren’t ever provided by the vendor. You go to your managers, and say… “hey, why are we paying for this?” and suddenly you’re the bad guy. I’d love for you to prove me wrong. I’ve found pieces of progress before, within isolated teams when a manager wanted to actually accomplish something. It never lasts though… its like being an ice cube in a glass full of warm water.

@[email protected] · 6 months ago

There’s a big difference between “buying stuff you don’t need”, and “not having legal review a contract”, or “accepting terms that include no liability”.
Buying stuff you don’t need is in the authority of a VP seeing as their job is to make choices. Bypassing legal review and accounting diligence controls typically isn’t at any company big enough to matter.
I trust your hypothetical VP to not want to get fired from his nice job by skipping the paperwork for a done deal.

Do you honestly think that Amazon just didn’t read the contract? Microsoft? Google? The US government?

They’re getting sued, and they’re gonna have to pay some money. Cynicism is one thing, but taking it to the degree of believing that people are signing unread contracts that waive liability for direct, attributable damage caused by unprofessional negligence is just assinine.

@[email protected] · edit-2 6 months ago

Terms which should be void as this update was pushed to systems that explicitly disabled automatic updates.

Companies were literally raped by Crowdstrike.

/edit Sauce (bottom paragraph)

John Richard · 6 months ago

Companies were not raped by CrowdStrike. They were raped by their own ineptitude.

No where have I seen evidence where these updates were disabled and still got pushed. I’m not saying it is impossible, but unlikely if they followed any common sense and best practices. Usually, you’d be monitoring traffic and asking yourself why it is still checking for updates despite being disabled before deploying it to your entire IT infrastructure.

I see a lot of bad faith arguments here against CrowdStrike. I agree that they messed up, but it pales in comparison in my book to how messed up these companies are for not doing any basic planning around IT infrastructure & automation to be able to recover quickly.