CrowdStrike downtime apparently caused by update that replaced a file with 42kb of zeroes

Aatube · edit-2 7 months ago

CrowdStrike downtime apparently caused by update that replaced a file with 42kb of zeroes

@CeeBee_Eh · edit-2 7 months ago

The following:

An internal backup of previous configs
Encrypted copies
Massive warnings in the system that current loaded config has failed integrity check

There’s a load of other checks that could be employed. This is literally no different than securing the OS itself.

This is essentially a solved problem, but even then it’s impossible to make any system 100% secure. As the person you replied to said: “this is poor code”

Edit: just to add, failure for the system to boot should NEVER be the desired outcome. Especially when the party implementing that is a 3rd party service. The people who setup these servers are expecting them to operate for things to work. Nothing is gained from a non-booting critical system and literally EVERYTHING to lose. If it’s critical then it must be operational.

𝙲𝚑𝚊𝚒𝚛𝚖𝚊𝚗 𝙼𝚎𝚘𝚠 · 7 months ago

The 3rd party service is AV. You do not want to boot a potentially compromised or insecure system that is unable to start its AV properly, and have it potentially access other critical systems. That’s a recipe for a perhaps more local but also more painful disaster. It makes sense that a critical enterprise system does not boot if something is off. No AV means the system is a security risk and should not boot and connect to other critical/sensitive systems, period.

These sorts of errors should be alleviated through backup systems and prevented by not auto-updating these sorts of systems.

Sure, for a personal PC I would not necessarily want a BSOD, I’d prefer if it just booted and alerted the user. But for enterprise servers? Best not.

@CeeBee_Eh · 6 months ago

Sure, for a personal PC I would not necessarily want a BSOD, I’d prefer if it just booted and alerted the user. But for enterprise servers? Best not.

You have that backwards. I work as a dev and system admin for a medium sized company. You absolutely do not want any server to ever not boot. You absolutely want to know immediately that there’s an issue that needs to be addressed ASAP, but a loss of service generally means loss of revenue and, even worse, a loss of reputation. If you server is briefly at a lower protection level that’s not an issue unless you’re actively being targeted and attacked. But if that’s the case then getting notified of an issue can get some people to deal with it immediately.

𝙲𝚑𝚊𝚒𝚛𝚖𝚊𝚗 𝙼𝚎𝚘𝚠 · 6 months ago

A single server not booting should not usually lead to a loss of service as you should always run some sort of redundancy.

I’m a dev for a medium-sized PSP that due to our customers does occasionally get targetted by malicious actors, including state actors. We build our services to be highly available, e.g. a server not booting would automatically do a failover to another one, and if that fails several alerts will go off so that the sysadmins can investigate.

Temporary loss of service does lead to reputational damage, but if contained most of our customers tend to be understanding. However, if a malicious actor could gain entry to our systems the damage could be incredibly severe (depending on what they manage to access of course), so much so that we prefer the service to stop rather than continue in a potentially compromised state. What’s worse: service disrupted for an hour or tons of personal data leaked?

Of course, your threat model might be different and a compromised server might not lead to severe damage. But Crowdstrike/Microsoft/whatever may not know that, and thus opt for the most “secure” option, which is to stop the boot process.

CrowdStrike downtime apparently caused by update that replaced a file with 42kb of zeroes

CrowdStrike downtime apparently caused by update that replaced a file with 42kb of zeroes

christian_taillon (@christian_tail)