Its been an interesting morning

@[email protected] · 1 year ago

Its been an interesting morning

@[email protected] · 1 year ago

You are paying aws to not have one big server, so you get high availability and dynamic load balancing as instances come and go.

I agree its not cheaper than being on prem. But it’s much higher quality solutions.

Today at work, they decided to upgrade from ancient Ubuntu version to a more recent version. Since they don’t use aws properly, they treat servers as pets. So to upgrade Ubuntu, they actually upgraded Ubuntu on the instance instead of creating a new one. This led to grub failing and now they are troubleshooting how to mount disks etc.

All of this could easily be avoided by using the cloud properly.

@ElectricCattleman · 1 year ago

That could be avoided by using on prem properly, too. People are very capable of making bad infrastructure whether on prem or cloud.

@[email protected] · 1 year ago

Yep. Virtualization is not a unique selling point of the cloud, despite the benefits of it seeming to be one of the largest selling points.

@[email protected] · edit-2 1 year ago

I used to work on an on premise object storage system before, where we required double digits of “nines” availability. High availability is not rocket science. Most scenarios are covered by having 2 or 3 machines.

I’d also wager that using the cloud properly is a different skillset than properly managing or upgrading a Linux system, not necessarily a cheaper or better one from a company point of view.

@[email protected] · 1 year ago

where we required double digits of “nines” availability

Do you mean 99% or 99.99999999%? Because 99.99999999% is absurd. Even Google doesn’t go near that for internal targets. That’s 1/3 of a second per year of downtime. If a network hiccup causes 30s of downtime, you’ve blown through a century of error budget. If you’re talking durability, that’s another matter, but availability?

For ten-nines availability to make any sense, any dependent system would also have to have ten nines availability, and any calling system would have to have close to ten nines availability or it’s not worth ten nines on the called system.

If the traffic ever goes over TCP/IP, not even if it ever goes over the public internet, if it ever goes over Ethernet wires, ten nines sounds like overkill. Maybe if it stays within a mainframe computer, but you’d have to carefully audit that mainframe to ensure that every component involved also has approx ten nines.

If you mean 2 nines availability, that’s not high availability at all. That’s nearly 4 days of downtime a year. That’s enough that you don’t necessarily need a standby system, you just need to be able to repair the main one within a few hours if it goes down.

@[email protected] · edit-2 1 year ago

Sorry, yes, that was durability. I got it mixed up in my head. Availability had lower targets.

But I stand by the gist of my argument - you can achieve a lot with a live/live system, or a 3 node system with a master election, or…

High availability doesn’t have to equate high cost or complexity, if you can take it into account when designing the system.

@[email protected] · 1 year ago

you can achieve a lot with a live/live system, or a 3 node system with a master election, or…

“A lot”, sure, but not say 5 nines. 99.9% (8 hours of downtime per year), is reasonable. That’s enough time to fire up an instance in another location if that turns out to be necessary.

99.99% (50 minutes of downtime per year) is harder. It means you need automatic systems doing the switchover, geographical separation, people on call 24/7 to diagnose and fix any issue in minutes.

99.999% is only 5 minutes of downtime per year. At that rate, you can’t even afford for someone on call to respond. You do still want them on call to verify the automated systems did the work, but you need to rely on automated systems fully handling any possible emergency. The system needs to fail over perfectly without any human intervention. For that, a 3 node system isn’t enough. You need geographical redundancy, as well as redundancy within each geographic region. You need to be able to do software upgrades without affecting that redundancy, so you need at least a secondary 3-node system so that you can do a blue/green deployment, testing out handing over traffic to the new system with the ability to instantly roll back if something doesn’t work.

Each “nine” you add reduces the “error budget” by a factor of 10, so as you start getting above 4/5 nines, you really do start to need specialized engineering which tends to come with high cost and complexity.

For a typical Lemmy instance, 3 nines is probably good enough. 2 nines might even be acceptable if people aren’t paying. But, for something like Netflix, 8 hours of downtime per year is far too much. For something like a high frequency trading platform, 8 nines might not even be enough. For them, the custom engineering and obscene cost of chasing 7+ nines is worth it because every second of downtime could cost millions.

@[email protected] · 1 year ago

Agreed, but for many services 2 or 3 nines is acceptable.

For the cloud storage system I worked on it wasn’t, and that had different setups for different customers, from a simple 3 node system (the smallest setup, mostly for customers trialing the solution) to a 3 geo setup which has at least 9 nodes in 3 different datacenters.

For the finanicial system, we run a live/live/live setup, where we’re running a cluster in 3 different cloud operators, and the client is expected to know all of them and do failover. That obviously requires little more complexity on the client side, but in many cases developers or organisations control both anyway.

Netflix is obviously at another scale, I can’t comment on what their needs are, or how their solution looks, but I think it’s fair to say they are an exceptional case.

@computergeek125 · edit-2 1 year ago

We have on prem and do all our upgrades by burn the OS and move the data, with the exception of the hypervisor OS (which has a pretty resilient bulk self upgrade built in, and we have a burn-the-OS plan documented for if they do crash). Even system file corruption of a random pet server? New VM and reattach the data disk. Need high availability? Throw F5 or HAProxy at the problem (assuming L7 protocol support).

Both cloud and on prem can work equally when done right. The most important part is to understand that both have different types of cost (human, machine, developer) and to make the right choice based your/your customer’s needs and any applicable laws or regulations about data locality. And yeah, sometimes one will be better for someone and not someone else.

Seven figures of cloud engineering can’t solve stupid, but neither can seven figures of datacenter. This isn’t some Sith/Jedi concept where you have hard definitions of dark and light or good and evil - though sometimes both will see each other as the enemy, and they are in a way competitors.