Companies are often insane. I’m working in one who has this one guy build a super complicated architecture, because he don’t know aws. So instead of just using a message queue on aws, he is building Java programs and tons of software and containers to try and send messages in a reliable way. Costs the company huge money, but they don’t care, since he is some old timer who has been there for like 10 years and everyone let’s him do what he wants.
It’s a different form of lock-in since it’s just his creation. When he leaves, all of this will be very hard to maintain and the company will probably rebuild it all on aws.
I have been bringing this up but they say that it’s too late to change direction now (they are afraid to upset the guy).
But I’m looking on the bright side. I get to learn a lot of stuff I otherwise I wouldnt if this was a single managed aws service. I’m bringing in terraform and instead of just putting a message queue there, I need to spin up entire architectures to run his ec2 instances with all the apps and everything required to make things work.
Takes months… So for me it’s fun. I don’t have to pay for it. But companies are crazy. :)
Yeah great but what about when he dies and nobody else knows how it works? I’ve had to deal with that more than once (creator of Blackboard and creator of IP Office Contact Center, when they died so did the product)
I personally always try to engineer away from cloud services. They cost you ridiculous amounts of money and all you need is documentation afterwards. Then it can be easier and faster than AWS or GC
Got to agree with @[email protected] here, although it depends on the scope of your service or project.
Cloud services are good at getting you up and running quickly, but they are very, very expensive to scale up.
I work for a financial services company, and we are paying 7 digit monthly AWS bills for an amount of work that could realistically be done with one really big dedicated server. And now we’re required to support multiple cloud providers by some of our customers, we’ve spent a TON of effort trying to untangle from SQS/SNS and other AWS specific technologies.
Clouds like to tell you:
Using the cloud is cheaper than running your own server
Using cloud services requires less manpower / labour to maintain and manage
It’s easier to get up and running and scale up later using cloud services
The last item is true, but the first two are only true if you are running a small service. Scaling up on a cloud is not cost effective, and maintaining a complicated cloud architecture can be FAR more complicated than managing a similar centralized architecture.
I worked in operations for a large company that had their own 50,000 sq ft data center with 2000 physical servers, uncountable virtual servers, backup tape robots, etc… Their cooling bill would like to disagree with your assessment about scaling. I was unpacking new servers regularly because, when you own you own servers, not only do you have to buy them, but you have to house them (so much rented space), run them, fix them, cool them, and replace them.
Don’t get me wrong, I’ve also seen the AWS bill for another large company I worked for and that was staggering. But, we were a smaller tech team and didn’t require a separate ops group specifically to maintain the physical servers.
You are paying aws to not have one big server, so you get high availability and dynamic load balancing as instances come and go.
I agree its not cheaper than being on prem. But it’s much higher quality solutions.
Today at work, they decided to upgrade from ancient Ubuntu version to a more recent version. Since they don’t use aws properly, they treat servers as pets. So to upgrade Ubuntu, they actually upgraded Ubuntu on the instance instead of creating a new one. This led to grub failing and now they are troubleshooting how to mount disks etc.
All of this could easily be avoided by using the cloud properly.
I used to work on an on premise object storage system before, where we required double digits of “nines” availability. High availability is not rocket science. Most scenarios are covered by having 2 or 3 machines.
I’d also wager that using the cloud properly is a different skillset than properly managing or upgrading a Linux system, not necessarily a cheaper or better one from a company point of view.
where we required double digits of “nines” availability
Do you mean 99% or 99.99999999%? Because 99.99999999% is absurd. Even Google doesn’t go near that for internal targets. That’s 1/3 of a second per year of downtime. If a network hiccup causes 30s of downtime, you’ve blown through a century of error budget. If you’re talking durability, that’s another matter, but availability?
For ten-nines availability to make any sense, any dependent system would also have to have ten nines availability, and any calling system would have to have close to ten nines availability or it’s not worth ten nines on the called system.
If the traffic ever goes over TCP/IP, not even if it ever goes over the public internet, if it ever goes over Ethernet wires, ten nines sounds like overkill. Maybe if it stays within a mainframe computer, but you’d have to carefully audit that mainframe to ensure that every component involved also has approx ten nines.
If you mean 2 nines availability, that’s not high availability at all. That’s nearly 4 days of downtime a year. That’s enough that you don’t necessarily need a standby system, you just need to be able to repair the main one within a few hours if it goes down.
you can achieve a lot with a live/live system, or a 3 node system with a master election, or…
“A lot”, sure, but not say 5 nines. 99.9% (8 hours of downtime per year), is reasonable. That’s enough time to fire up an instance in another location if that turns out to be necessary.
99.99% (50 minutes of downtime per year) is harder. It means you need automatic systems doing the switchover, geographical separation, people on call 24/7 to diagnose and fix any issue in minutes.
99.999% is only 5 minutes of downtime per year. At that rate, you can’t even afford for someone on call to respond. You do still want them on call to verify the automated systems did the work, but you need to rely on automated systems fully handling any possible emergency. The system needs to fail over perfectly without any human intervention. For that, a 3 node system isn’t enough. You need geographical redundancy, as well as redundancy within each geographic region. You need to be able to do software upgrades without affecting that redundancy, so you need at least a secondary 3-node system so that you can do a blue/green deployment, testing out handing over traffic to the new system with the ability to instantly roll back if something doesn’t work.
Each “nine” you add reduces the “error budget” by a factor of 10, so as you start getting above 4/5 nines, you really do start to need specialized engineering which tends to come with high cost and complexity.
For a typical Lemmy instance, 3 nines is probably good enough. 2 nines might even be acceptable if people aren’t paying. But, for something like Netflix, 8 hours of downtime per year is far too much. For something like a high frequency trading platform, 8 nines might not even be enough. For them, the custom engineering and obscene cost of chasing 7+ nines is worth it because every second of downtime could cost millions.
We have on prem and do all our upgrades by burn the OS and move the data, with the exception of the hypervisor OS (which has a pretty resilient bulk self upgrade built in, and we have a burn-the-OS plan documented for if they do crash). Even system file corruption of a random pet server? New VM and reattach the data disk. Need high availability? Throw F5 or HAProxy at the problem (assuming L7 protocol support).
Both cloud and on prem can work equally when done right. The most important part is to understand that both have different types of cost (human, machine, developer) and to make the right choice based your/your customer’s needs and any applicable laws or regulations about data locality. And yeah, sometimes one will be better for someone and not someone else.
Seven figures of cloud engineering can’t solve stupid, but neither can seven figures of datacenter. This isn’t some Sith/Jedi concept where you have hard definitions of dark and light or good and evil - though sometimes both will see each other as the enemy, and they are in a way competitors.
I would argue that most cloud native services existed in their standalone forms way before public clouds made their own versions. For example there are loads of message queue systems that are just as easy to incorporate and are cloud agnostic, some of them are FOSS. Sure you can reinvent the wheel but in most cases something like RabbitMQ will work OK depending on the use case. Having cloud vendor lock in is where cost catches up with you. Complexity is arbitrary since there are ways to make anything overcomplicated.
RabbitMQ is more expensive on AWS than e.g. SNS/SQS. It’s not a coincidence, you’re trading lock-in for a cheaper price.
The increased complexity comes from the fact you will need some components which exist in either managed, but vendor lock-in form, or you need to spin them up / managed yourself.
Right, paying for managed services whether cloud native or not is pretty much the same thing, it hurts in the pocket. Spinning up your own RabbitMQ on a VM is both cheap and cloud agnostic, especially if sized right.
One time I rewrote an Azure function to make it slightly more efficient. The cost savings were ~$50k /yr. Cloud services have their place but it is amazing how quickly the costs can spiral out of control.
What the company likes about the old timer is that because he has been there for 10 years, he will likely be there for the next 10 years to support the complicated system he is creating now. If a younger team member creates something using a modern approach, there is the risk they will leave in a years time and no one knows how the system works.
No one knows how to use a well documented, publicly available service? No, I’d argue that no one knows how to use a private, internal only, custom solution.
That because you’re an engineer (I assume). The people signing off on these kinds of projects don’t know enough themselves, so they go to someone they trust (the old timers) to help them make the decision. The old timers don’t keep up with new tech, so we keep reinventing the wheel.
“keeping up with new tech” is often just re-inventing the wheel. If it isn’t broke, and can still be maintained, then why break it because you like the flavor of the week?
In this case, we’re talking about the OPs example of someone implementing a complex message passing architecture in Java instead of using an off the shelf solution. There are devs with 20+ years at the same company who don’t know the basics of networking/cloud, because they haven’t improved their technical skills much in those 20 years and instead focused on corporate politics. Those are the people who tend to gets asked for advice from upper management.
So he’ll rip an even bigger hole, when he is retiring because the company never bothered to get a new solution running. Then they get a hydra of legacy code that is poorly documented and probably using some old hacks based on even older forum posts, nowhere to be found again.
Oh god. I do a lot of PowerShell scripting at my place, and less than half my team is proficient in it. My co-workers who are almost never write comments in their scripts. Meanwhile, if it’s anything that will live longer than ~5 manual runs, I spemd more time on comments and documentation than scripting.
That effort is valued, but I’m shocked that my team isn’t more aware of the need for documentation. We literally experienced the “bus factor” situation a few years ago.
We have someone at my company who has been here for 30 years that gets to do whatever he wants basically - but what he builds is great. He doesn’t even have a CS degree or anything related, he started as a paralegal who wanted to make his life easier, and has built several iterations of the software that the entire company uses. He’s now my boss, running the data engineering and science department and I gotta say that he’s genuinely great. The only bad things I’ve run across that he’s built are things that he explicitly told management were meant to be just a quick bandaid fix to a problem to buy time for a full fledged solution… and they kept it as the full fledged solution. The stuff still works, it’s just awful to make updates or change to
Companies are often insane. I’m working in one who has this one guy build a super complicated architecture, because he don’t know aws. So instead of just using a message queue on aws, he is building Java programs and tons of software and containers to try and send messages in a reliable way. Costs the company huge money, but they don’t care, since he is some old timer who has been there for like 10 years and everyone let’s him do what he wants.
No vendor look-in with his solution though.
It’s a different form of lock-in since it’s just his creation. When he leaves, all of this will be very hard to maintain and the company will probably rebuild it all on aws.
I have been bringing this up but they say that it’s too late to change direction now (they are afraid to upset the guy).
But I’m looking on the bright side. I get to learn a lot of stuff I otherwise I wouldnt if this was a single managed aws service. I’m bringing in terraform and instead of just putting a message queue there, I need to spin up entire architectures to run his ec2 instances with all the apps and everything required to make things work.
Takes months… So for me it’s fun. I don’t have to pay for it. But companies are crazy. :)
Yeah great but what about when he dies and nobody else knows how it works? I’ve had to deal with that more than once (creator of Blackboard and creator of IP Office Contact Center, when they died so did the product)
I personally always try to engineer away from cloud services. They cost you ridiculous amounts of money and all you need is documentation afterwards. Then it can be easier and faster than AWS or GC
You’re the guy 1984 was talking about…
Got to agree with @[email protected] here, although it depends on the scope of your service or project.
Cloud services are good at getting you up and running quickly, but they are very, very expensive to scale up.
I work for a financial services company, and we are paying 7 digit monthly AWS bills for an amount of work that could realistically be done with one really big dedicated server. And now we’re required to support multiple cloud providers by some of our customers, we’ve spent a TON of effort trying to untangle from SQS/SNS and other AWS specific technologies.
Clouds like to tell you:
The last item is true, but the first two are only true if you are running a small service. Scaling up on a cloud is not cost effective, and maintaining a complicated cloud architecture can be FAR more complicated than managing a similar centralized architecture.
I worked in operations for a large company that had their own 50,000 sq ft data center with 2000 physical servers, uncountable virtual servers, backup tape robots, etc… Their cooling bill would like to disagree with your assessment about scaling. I was unpacking new servers regularly because, when you own you own servers, not only do you have to buy them, but you have to house them (so much rented space), run them, fix them, cool them, and replace them.
Don’t get me wrong, I’ve also seen the AWS bill for another large company I worked for and that was staggering. But, we were a smaller tech team and didn’t require a separate ops group specifically to maintain the physical servers.
If you really need the scale of 2000 physical machines, you’re at a scale and complexity level where it’s going to be expensive no matter what.
And I think if you need that kind of resources, you’ll still be cheaper of DIY.
You are paying aws to not have one big server, so you get high availability and dynamic load balancing as instances come and go.
I agree its not cheaper than being on prem. But it’s much higher quality solutions.
Today at work, they decided to upgrade from ancient Ubuntu version to a more recent version. Since they don’t use aws properly, they treat servers as pets. So to upgrade Ubuntu, they actually upgraded Ubuntu on the instance instead of creating a new one. This led to grub failing and now they are troubleshooting how to mount disks etc.
All of this could easily be avoided by using the cloud properly.
That could be avoided by using on prem properly, too. People are very capable of making bad infrastructure whether on prem or cloud.
Yep. Virtualization is not a unique selling point of the cloud, despite the benefits of it seeming to be one of the largest selling points.
I used to work on an on premise object storage system before, where we required double digits of “nines” availability. High availability is not rocket science. Most scenarios are covered by having 2 or 3 machines.
I’d also wager that using the cloud properly is a different skillset than properly managing or upgrading a Linux system, not necessarily a cheaper or better one from a company point of view.
Do you mean 99% or 99.99999999%? Because 99.99999999% is absurd. Even Google doesn’t go near that for internal targets. That’s 1/3 of a second per year of downtime. If a network hiccup causes 30s of downtime, you’ve blown through a century of error budget. If you’re talking durability, that’s another matter, but availability?
For ten-nines availability to make any sense, any dependent system would also have to have ten nines availability, and any calling system would have to have close to ten nines availability or it’s not worth ten nines on the called system.
If the traffic ever goes over TCP/IP, not even if it ever goes over the public internet, if it ever goes over Ethernet wires, ten nines sounds like overkill. Maybe if it stays within a mainframe computer, but you’d have to carefully audit that mainframe to ensure that every component involved also has approx ten nines.
If you mean 2 nines availability, that’s not high availability at all. That’s nearly 4 days of downtime a year. That’s enough that you don’t necessarily need a standby system, you just need to be able to repair the main one within a few hours if it goes down.
Sorry, yes, that was durability. I got it mixed up in my head. Availability had lower targets.
But I stand by the gist of my argument - you can achieve a lot with a live/live system, or a 3 node system with a master election, or…
High availability doesn’t have to equate high cost or complexity, if you can take it into account when designing the system.
“A lot”, sure, but not say 5 nines. 99.9% (8 hours of downtime per year), is reasonable. That’s enough time to fire up an instance in another location if that turns out to be necessary.
99.99% (50 minutes of downtime per year) is harder. It means you need automatic systems doing the switchover, geographical separation, people on call 24/7 to diagnose and fix any issue in minutes.
99.999% is only 5 minutes of downtime per year. At that rate, you can’t even afford for someone on call to respond. You do still want them on call to verify the automated systems did the work, but you need to rely on automated systems fully handling any possible emergency. The system needs to fail over perfectly without any human intervention. For that, a 3 node system isn’t enough. You need geographical redundancy, as well as redundancy within each geographic region. You need to be able to do software upgrades without affecting that redundancy, so you need at least a secondary 3-node system so that you can do a blue/green deployment, testing out handing over traffic to the new system with the ability to instantly roll back if something doesn’t work.
Each “nine” you add reduces the “error budget” by a factor of 10, so as you start getting above 4/5 nines, you really do start to need specialized engineering which tends to come with high cost and complexity.
For a typical Lemmy instance, 3 nines is probably good enough. 2 nines might even be acceptable if people aren’t paying. But, for something like Netflix, 8 hours of downtime per year is far too much. For something like a high frequency trading platform, 8 nines might not even be enough. For them, the custom engineering and obscene cost of chasing 7+ nines is worth it because every second of downtime could cost millions.
We have on prem and do all our upgrades by burn the OS and move the data, with the exception of the hypervisor OS (which has a pretty resilient bulk self upgrade built in, and we have a burn-the-OS plan documented for if they do crash). Even system file corruption of a random pet server? New VM and reattach the data disk. Need high availability? Throw F5 or HAProxy at the problem (assuming L7 protocol support).
Both cloud and on prem can work equally when done right. The most important part is to understand that both have different types of cost (human, machine, developer) and to make the right choice based your/your customer’s needs and any applicable laws or regulations about data locality. And yeah, sometimes one will be better for someone and not someone else.
Seven figures of cloud engineering can’t solve stupid, but neither can seven figures of datacenter. This isn’t some Sith/Jedi concept where you have hard definitions of dark and light or good and evil - though sometimes both will see each other as the enemy, and they are in a way competitors.
Yup, if your solution is not cloud agnostic you’ve fucked up.
Being cloud-agnostic also means additional cost/complexity.
Sometimes the only way to win the game is by not playing it.
I would argue that most cloud native services existed in their standalone forms way before public clouds made their own versions. For example there are loads of message queue systems that are just as easy to incorporate and are cloud agnostic, some of them are FOSS. Sure you can reinvent the wheel but in most cases something like RabbitMQ will work OK depending on the use case. Having cloud vendor lock in is where cost catches up with you. Complexity is arbitrary since there are ways to make anything overcomplicated.
RabbitMQ is more expensive on AWS than e.g. SNS/SQS. It’s not a coincidence, you’re trading lock-in for a cheaper price.
The increased complexity comes from the fact you will need some components which exist in either managed, but vendor lock-in form, or you need to spin them up / managed yourself.
Right, paying for managed services whether cloud native or not is pretty much the same thing, it hurts in the pocket. Spinning up your own RabbitMQ on a VM is both cheap and cloud agnostic, especially if sized right.
I didn’t look at the username, so this came across as an underserved Orwell-referencing insult. Lol
Accusing him of being O’Brian or something.
One time I rewrote an Azure function to make it slightly more efficient. The cost savings were ~$50k /yr. Cloud services have their place but it is amazing how quickly the costs can spiral out of control.
Nah, unless you have a super popular app or suck down bandwidth I rarely find the costs of AWS exceed the labor of a good sysadmin.
Plus, if the cloud service goes down, you don’t need to worry about your service being out as well
😬
Didn’t you know, anyone that stays at a company more than 18 months is old…
Hey, I just hit 18 months, almost to the day! …but was at the previous job 23 years lol. Good to know I’m back to old timer status.
What the company likes about the old timer is that because he has been there for 10 years, he will likely be there for the next 10 years to support the complicated system he is creating now. If a younger team member creates something using a modern approach, there is the risk they will leave in a years time and no one knows how the system works.
No one knows how to use a well documented, publicly available service? No, I’d argue that no one knows how to use a private, internal only, custom solution.
That because you’re an engineer (I assume). The people signing off on these kinds of projects don’t know enough themselves, so they go to someone they trust (the old timers) to help them make the decision. The old timers don’t keep up with new tech, so we keep reinventing the wheel.
“keeping up with new tech” is often just re-inventing the wheel. If it isn’t broke, and can still be maintained, then why break it because you like the flavor of the week?
In this case, we’re talking about the OPs example of someone implementing a complex message passing architecture in Java instead of using an off the shelf solution. There are devs with 20+ years at the same company who don’t know the basics of networking/cloud, because they haven’t improved their technical skills much in those 20 years and instead focused on corporate politics. Those are the people who tend to gets asked for advice from upper management.
So he’ll rip an even bigger hole, when he is retiring because the company never bothered to get a new solution running. Then they get a hydra of legacy code that is poorly documented and probably using some old hacks based on even older forum posts, nowhere to be found again.
Oh god. I do a lot of PowerShell scripting at my place, and less than half my team is proficient in it. My co-workers who are almost never write comments in their scripts. Meanwhile, if it’s anything that will live longer than ~5 manual runs, I spemd more time on comments and documentation than scripting.
That effort is valued, but I’m shocked that my team isn’t more aware of the need for documentation. We literally experienced the “bus factor” situation a few years ago.
We have someone at my company who has been here for 30 years that gets to do whatever he wants basically - but what he builds is great. He doesn’t even have a CS degree or anything related, he started as a paralegal who wanted to make his life easier, and has built several iterations of the software that the entire company uses. He’s now my boss, running the data engineering and science department and I gotta say that he’s genuinely great. The only bad things I’ve run across that he’s built are things that he explicitly told management were meant to be just a quick bandaid fix to a problem to buy time for a full fledged solution… and they kept it as the full fledged solution. The stuff still works, it’s just awful to make updates or change to
There is nothing more permanent than a temporary solution.
That describes two out of the four jobs I’ve had in my career lol
It pays the bills though so what can I say, the tech I actually want to work with is what personal projects are for lol
That is probably what used to be required. Have you told him about the message queue?
old timer = being paid little
That doesn’t mean he or his fuck ups are free.
A bad architecture means slower development, more bugs, less reliability. All of which cost money.
Quick and dirty as they like to say