lemmy.ml is overloaded, use other instances instead

Nutomic@lemmy.ml · edit-2 2 years ago

lemmy.ml is overloaded, use other instances instead

PriorProject · 2 years ago

I think you probably underestimate how far one can get with “vertical” scaling. Here’s the dockerfile: https://raw.githubusercontent.com/LemmyNet/lemmy/release/v0.17/docker/prod/docker-compose.yml

It includes 4 different containers… so there’s a way to scale out to 4 machines right away. Maybe not every container is doing an equal amount of work… but there’s some amount of immediately available machine-splitting.
I’m no expert, but I believe that at least the lemmy and lemmy-ui containers are stateless. If so, they’re horizontally scalable already.
Postgres then would likely be the main bottleneck. But postgres offers read-replicas, so again the write-load and the read-load can be hosted on separate machines. And if there’s enough read-load, you can have many replicas.

Other comments from the admins have shown that lemmy.ml today is running on a single eight-core box and it’s currently hosting 30k registered users and over 1k active. So how much more compute capacity can we throw at “vertical” scaling on the current software architecture?

Just by going to a bigger single box, we can get 128 cores with no problem, a 16x bump in capacity. Does that get us to at least to 300k registered + 10k active?
Splitting the containers onto 4 separate machines. Does that get us 2x more?
Adding PG read-replicas and additional lemmy/lemm-ui containers would allow us to expand our instance footprint to maybe 6 physical machines should get us another 2x or more in performance.

Conservatively, that’s 100x the computing capacity of the current hardware and could potentially support 1m registered users and 50k active. Now, I don’t REALLY expect this to be possible today, there will be many software bottlenecks found along the way to scaling a single instance this large. But my point is that there’s already a medium amount of horizontal scalability built into lemmy, and if the software doesn’t fall over for algorithmic reasons (which is will at first), the current infrastructure architecture allows quite a lot of growth. There’s plenty of time between now and a federation of million user instances to adopt a truly distributed storage backend if needed.

aksdb@feddit.de · 2 years ago

Doesn’t solve the availability issues, though. I know of no seriously hosted system that doesn’t have at least two replicas in different availability zones. I don’t expect any hobby instance to offer any kind of availability guarantee. But if we want to have one or two central instances that the typical reddit user can flock to, this would IMO be essential to have.

Also, in my experience it is FAR cheaper to have a few low to mid range systems for vertical scaling, than to throw a high end machine at it for vertical scaling. If you look the the pricing, the monthly costs for vertical scaling goes up exponentially once you want much more RAM and CPU cores (and storage, and so on).

Being able to scale horizontally solves both issues: hardware is cheaper and reliability is higher.

That lemmy is so damn efficient would then simply mean, that we can achieve excessively good results with low resources, where Reddit would already struggly and needs to put much more machines in place. That would be a nice “business” advantage.

PriorProject · edit-2 2 years ago

Doesn’t solve the availability issues, though. I know of no seriously hosted system that doesn’t have at least two replicas in different availability zones.

I’m not sure why you think the setup I’ve described can’t have coverage in multiple availability zones. If the lemmy and lemmy-ui containers are stateless as I suspect, you can autoscale them. Pictrs is new to me, not sure there… but it appears to support object-storage which would likely make it stateless and the object-storage can replicate to multiple-az’s. Postgres read-replicas can be placed in multiple az’s as well. The only component that presents an issue is the Postgres write-leader, and failovers there can be done in minutes. Many many popular sites run with an infrastructure like this and achieve excellent uptimes.

I do get the power of horizontal scalability, I specialize in distributed databases. But they come at a cost in flexibility relative to something like Postgres… and we’re very far from “needing” horizontally scaling database writes here. Everything else looks like it can be scaled horizontally if someone wants to take on the headache of doing so.

aksdb@feddit.de · 2 years ago

Well, one could try to swap postgres for cockroachdb. But a ticket in github that asked for clustering support was closed with being out of scope. So might be lemmy is not stateless. Haven’t checked the code yet, though.

PriorProject · edit-2 2 years ago

If cockroach is truly PG compatible, lemmy admins can swap it in without developer support. I suspect Cockroach constrains some SQL features and has poor performance on others, but that or AWS Aurora are things you can experiment with without dev support if you’re passionate about the proving out the value of scale-out.

The statement that spawned my response though was this:

I think lemmy will be bitten in the ass by not having considered clustering/horizontal scaling from the start. Federation alone as a scaling mechanism is only feasible for “nerds”. But if the network wants to grow, we will need a few scale-able large hosted instances.

I still don’t think it’s true that we need horizontal scaling to support sufficiently large instances. The amount of vertical and horizontal scaling ability built into Lemmy today is both useful, and likely to outstrip the current ability of its code to scale a single instance. Any algorithms that scale super-linearly with respect to comment-count, post-count, user-count, or community-count, will fail just as hard with distributed backends as they do with an RDBMS. And as you note, PG-compatible distributed systems provide a potential lower-engineering-cost on-ramp to distributed systems once the codebase is efficient-enough to warrant such a transition to scale further. I suspect I’ve contributed everything of use I have to this thread though, and don’t expect to respond further.

aksdb@feddit.de · 2 years ago

Thank you for your thorough explanations and input. It definitely gave me a few things to think about. And if I have some spare time I might even try to spin up lemmy in some local k8s to see how it reacts to being scaled up and down.

iouapizza@infosec.pub · 2 years ago

As someone not versed in DBs and scaling for web architecture, this was a super fun read through, appreciate the comment chains from both users.