Fault in CrowdStrike caused airports, businesses and healthcare services to languish in ‘largest outage in history’

Services began to come back online on Friday evening after an IT failure that wreaked havoc worldwide. But full recovery could take weeks, experts have said, after airports, healthcare services and businesses were hit by the “largest outage in history”.

Flights and hospital appointments were cancelled, payroll systems seized up and TV channels went off air after a botched software upgrade hit Microsoft’s Windows operating system.

It came from the US cybersecurity company CrowdStrike, and left workers facing a “blue screen of death” as their computers failed to start. Experts said every affected PC may have to be fixed manually, but as of Friday night some services started to recover.

As recovery continues, experts say the outage underscored concerns that many organizations are not well prepared to implement contingency plans when a single point of failure such as an IT system, or a piece of software within it, goes down. But these outages will happen again, experts say, until more contingencies are built into networks and organizations introduce better back-ups.

  • @TheDemonBuer
    link
    464 months ago

    Here’s an idea: don’t give one company kernel level access to the OS of millions of PCs that are necessary to keep whole industries functioning.

    • @ansiz
      link
      English
      284 months ago

      I mean, Microsoft themselves regularly shits the bed with updates, even with Defender updates. It’s the nature of security, they have to have that kind of access to stop legit malware. That’s why these kind of outages happen every few years. This one just got to much coverage from the banking and airline issues. And I’m sure future outages will continue to get similar coverage.

      But the Crowdstrike CEO was also at McAfee in 2010 when they shit the bed and shut down millions of XP machines so it seems like he needs a different career…

      • @[email protected]
        link
        fedilink
        104 months ago

        The problem is the monoculture. We are fucking addicted to convenience and efficiency at all costs.

        A diverse ecosystem, if a bit more work to manage, is much more resilient, and wouldn’t have been this catastrophe.

        Our technology is great, but our processes suck. Standardization. Just in time. These ideas create incredibly fragile organizations. Humanity is so short sighted. We are screwed.

        • @krashmo
          link
          54 months ago

          That seems like a pretty hardcore doomer view for an event that didn’t really do much in the grand scheme of things. I wouldn’t have even known it happened if it wasn’t all over the internet, and I work in tech to boot.

        • @[email protected]
          link
          fedilink
          34 months ago

          Time is money. Training all of the staff needed to manage not just one system in multiple areas, but multiple systems in multiple areas is a horrible idea. Sure for a one off issue like this it would save your bacon. But how often does this really happen?

      • billwashere
        link
        English
        14 months ago

        I’m not sure you can blame the CEO. As much as I despise C-level execs this seems like a failure at a much lower level. Now the question of whether this is a culture failure is a different story because to me that DOES come from the CEO or at least that level.

      • JackbyDev
        link
        fedilink
        English
        1
        edit-2
        4 months ago

        This happened to me in December 2022/January 2023. Pretty similar problem. Just a regular Windows update caused it. Weirdly it didn’t affect everyone (and I’m not on any sort of beta channels). Installing KB5021233 keeps causing BSOD 0xc000021a.

        After installing KB5021233, there might be a mismatch between the file versions of hidparse.sys in c:/windows/system32 and c:/windows/system32/drivers (assuming Windows is installed to your C: drive), which might cause signature validation to fail when cleanup occurs.

      • @emax_gomax
        link
        04 months ago

        How difficult would it be for companies to have staged releases or oversee upgrades themselves? I mostly just use Linux but upgrading itself is a relatively painless processing and logging into remote machines to trigger an update is no harder. Why is this something an independent party should be able to do without end user discretion?

    • @[email protected]
      link
      fedilink
      14
      edit-2
      4 months ago

      So we should have five different cyber security solutions at any given site? That wheezing is the sound of every it person on the planet queuing to swing a sock full of nickles at you.

      Crowdstrike was near ubiquitous because it was the best tool out there. And plenty of threats were prevented because of it.

      The answer isn’t to force every single site to manage everything themselves. It is to increase oversight on ci/CD models

      • HubertManne
        link
        fedilink
        124 months ago

        I read his comment more about the kernel level access more than the one company.

        • @[email protected]
          link
          fedilink
          34 months ago

          Like it or not, that is the most effective way to collect the data these solutions need.

          This isn’t riot anti cheat where it is of questionable effectiveness. Crowdstrike was demonstrably amazing at its job.

          • Riskable
            link
            fedilink
            English
            8
            edit-2
            4 months ago

            Crowdstrike has clients that run on MacOS and Linux. Only the Windows version requires kernel level access. I believe it has something to do with the absolute shitshow that is Windows security model but it might also be because it runs a 31-year-old filesystem that still doesn’t allow one process to read another process’s files while they’re open.

            • @[email protected]
              link
              fedilink
              24 months ago

              There have been issues with Linux and Mac clients in the past. Not to this scale but market share is very much a factor.

              Kernel access is a mess but it is also important to understand that even the less priveleged software can cause problems.

              I do firmly believe more hardware should run Linux but it is also important to understand the support burden. But, regardless, that is a different conversation.

              • @[email protected]
                link
                fedilink
                14 months ago

                Less privileged software can also cause problems, but you can limit the scope in which those problems can occur.

      • @TheDemonBuer
        link
        34 months ago

        Crowdstrike was near ubiquitous because it was the best tool out there.

        I understand the reason for it, but that ubiquity comes with potential dangers, as we saw on Friday. But, no, I don’t think the solution is “five different cyber security solutions” at every site. However, different cyber security solutions for different industries might not be such a bad idea. Or, I suppose the root of the problem might be the ubiquity of the OS. Should every PC be running the same jack of all trades but master of none OS?

        • @[email protected]
          link
          fedilink
          34 months ago

          Again, all you are doing is increasing complexity and punting it to a support staff who are likely unqualified to even know what crowdstrike did.

          This was one of those rare cases of capitalism working. There are many options. There was one that was miles ahead of all the others and that dominated.

            • @sandalbucket
              link
              14 months ago

              I want to spin up a separate thread here if that’s okay.

              Please give me an example of any EDR solution produced through “public ownership structures”. I don’t think such a thing exists, but I welcome being proven wrong.

            • @sandalbucket
              link
              14 months ago

              Private ownership and investment of capital created Crowdstrike as a profit-seeking venture. It also created MS Defender, SentinelOne, trellix, carbon black, etc. Competition in the marketplace (and there was/is lots of competition) forced these products to be as good as they could, and or self-stratify into pricing tiers. Crowdstrike, being the best (and most expensive) is the most widely-used. Note that not every enterprise requires that level of security, and so while CS is widely used, it is not ubiquitous. This outage could have been significantly worse.

    • AtHeartEngineer
      link
      English
      24 months ago

      Also the obligatory: “don’t run infrastructure on Microsoft products, run Linux”

  • @[email protected]
    link
    fedilink
    244 months ago

    I’m actually pretty excited to go to work on Monday.

    We have spent the past few years hardening our security and simplifying our critical systems. One way to doing that was to move a much off Microsoft as possible.

    And since I’ve been on vacation for the past week, I’m either going to walk into a nightmare shit show or everyone is going to be cheering that we are fully operational since we don’t depend on Microsoft.

    • @krashmo
      link
      34 months ago

      I don’t get how you’re not sure what you will find if you don’t use Windows.

      • @Frozengyro
        link
        44 months ago

        They haven’t moved everything off windows, just as much as they could.

      • @[email protected]
        link
        fedilink
        3
        edit-2
        4 months ago

        I’m not on that team. For all I know, our password manager might be on a random window server. Or some middleware.

        In a major company where each team does their own thing and communicate through endpoints, It’s impossible to know every configuration.

      • @[email protected]
        link
        fedilink
        24 months ago

        Was thinking bout this (a week later).

        Following up, our partner/affiliate sites were down. Each partner connects to us to submit data, and half were government contracts that were down. It didn’t affect our systems, but it affected how we provide services to them.

        So it was a mild shit show.

        • @krashmo
          link
          24 months ago

          Well that’s not so bad, all things considered. Glad it wasn’t worse anyway

    • @arin
      link
      English
      24 months ago

      Crowdstrike shit the bed a few months ago with linux systems

    • @[email protected]
      link
      fedilink
      English
      14 months ago

      Good luck. I hope that even if it is chaos when you get there, that it’s at least popcorn-tier chaos.

  • Flying Squid
    link
    134 months ago

    C-suite to experts: Are the future risks short term or long term? Specifically longer term than my golden parachute?

    • @hedgehogging_the_bed
      link
      54 months ago

      This is why “they are the biggest” isn’t a good reason to pick a vendor. If all these companies had been using different providers or even different OS, it wouldn’t have hit so many systems simultaneously. This is a result of too much consolidation at all levels and one issue with the Microsoft OS monopoly.

      • @[email protected]
        link
        fedilink
        English
        44 months ago

        The issue, in this case, is more about Crowdstrike’s broad usage than Microsoft’s. The update that crippled everything was to the Crowdstrike Falcon Sensor software, not to the OS.

        Funnily enough, they had a similar issue with an update to the Linux version of the software a few months ago, that didn’t have these broad-reaching consequences largely due to the smaller Linux user base. Which means this is starting to look like a pattern, and there are going to need to be some serious process changes at Crowdstrike to prevent things like this in the future.

        Anybody’s guess if those changes happen or not.

        • @[email protected]
          link
          fedilink
          24 months ago

          The real surprise to me is not the software, company or OS issues, but rather so many companies just blindly pushing untested updates to their prod environments, this was and will continue to be a risk associated with anything they do trust so implicitly. Feels like the security folks just totally failed Dev 101.

          • @[email protected]
            link
            fedilink
            English
            14 months ago

            I know in at least some of the BSOD cases that it was an automatic update that wasn’t possible to delay. An acquaintance of mine told me that they have previously complained to their IT support about the disruption of auto-updates at inopportune times, but IT said it’s out of their hands for security updates because of regulatory requirements.

          • @Frozengyro
            link
            14 months ago

            Or the security folks are doing the best they can with a shoestring budget.

    • @credo
      link
      24 months ago

      After. Action. Review.

      • AtHeartEngineer
        link
        English
        14 months ago

        Should be done even when things go right.

  • @Z3k3
    link
    English
    44 months ago

    This would have been a fun MIR had my systems be impacted.

  • billwashere
    link
    English
    34 months ago

    Still say not allowing untested updates in a production environment fixes this. I don’t care if it’s a README file, don’t update without testing.