Lemmy Devs: could you shed light on the range of scalability issues we're about to see with the reddit influx?

teoria · edit-2 2 years ago

Lemmy Devs: could you shed light on the range of scalability issues we're about to see with the reddit influx?

@[email protected] · 2 years ago

There are a few main problems with performance which we are currently fixing:

Websocket is inefficient. We are removing it and switching to HTTP API. https://github.com/LemmyNet/lemmy-ui/pull/1081
There are some slow SQL queries which we need to improve. https://github.com/LemmyNet/lemmy/issues/2877
There is no caching. This PR adds it, but will only work after rewriting to HTTP. https://github.com/LemmyNet/lemmy-ansible/pull/75

Federation isnt causing any performance problems, its all due to database reads from local users.

@PriorProject · edit-2 2 years ago

You may or may not get Lemmy devs weighing on here (Edit: Nutomic did respond). It’s a VERY busy time for them, and they’re probably focused on fixing imminent scaling issues rather than explaining them to newcomers like us. But to provide some context from another newcomer who is trying to pay attention:

Lemmy was very small until very recently. The biggest instance is lemmy.ml, which according to the stats on it’s homepage has ~30k registered users and ~2k active (which is probably a high water mark… in previous days when I looked it was more like 1k).
As a result of (1), it’s a fair bet that there are some serious inefficiencies in the codebase that just never matter before now. It will take some time to unwind these. A good example of this is in https://github.com/LemmyNet/lemmy/issues/2877, where you can see Lemmy devs and Lemmy.ml admins cooperating with a Postgres expert who is helping them find some low-hanging performance fruit, and the Lemmy team is getting a chance to ask some performance related questions they’ve never been able to get access to an expert for. There’s probably a lot more work like this to do in order to scale Lemmy to work well with 10x and beyond bigger instances.
There may be distributed/federated performance issues as the network grows as well, but Lemmy uses ActivityPub like Mastodon, which already has a much bigger network. I’m inclined to think they’ll be ok in this regard, but you never know… it’s possible they’re abusing the protocol in some way that will need to be fixed to scale to bigger networks of federated servers.
In terms of hardware, lemmy.ml runs on a very modest 8-core VM from OVH: https://lemmy.world/comment/1350. Obviously there’s a LOT more that could be done to get more capacity powering lemmy.ml. Much bigger single servers exist, though not in the lineup of VM offerings from their current provider, which means there are no more “easy” upgrades available to them where they let the cloud provider to the migration work. I tried to break down infra upgrade possibilities in https://lemmy.world/comment/3583. In short, it would be straightforward to expand a Lemmy install to 5-10 machines if you were serious about it. But due to (1) and (2), it’s probably not productive to do so. Algorithmic inefficiencies in the codebase would probably swamp any amount of hardware somewhere between 1.5x and 5x the user/post/comment counts of what lemmy.ml runs today.

There’s a lot of speculation in this comment. I haven’t run or perf-tested a sizeable Lemmy instance. I’m not familiar with the codebase. But I am a software engineer and I know a lot about scaling infra, software, and teams… and the above feel like reasonably informed guesses and speculation in the absence of disagreement from someone more informed than I.

RoundSparrow · edit-2 2 years ago

Is lemmy scalable?

Preface: I’m new to the project and still trying to install my first server, and the instructions so far are leading into problems. I’m a professional Technical Editor since the early 1990’s, so I’m trying to document the problems over on https://lemmy.ml/post/1160483 about a from-scratch Ubuntu 22.04.2 LTS install.

Based on what I have seen, I have big concerns about scale. Using a SQL database for a messaging system like this with all these granular real-time data elements like vote counts and dynamic sorting orders I think is going to prove to have serous scaling problems without some concept of caching results in a simple format (like a plain old disk file or NoSQL database) before returning results to the webapp clients or API. I am struggling with my human languages lately, but I did start to raise some topics over on community [email protected]

I understand the desire for a real-time database that has exact up to date information on every vote count and every single latest comment, but I think server operators are going to run into big problems with Lemmy code as it works now once their data starts to get into the thousands of postings per day, tens of thousands of user profiles being looked up for render those postings, and millions of comments. When there isn’t enough RAM for those SQL index keys, things will dramatically change in performance.

teoria · 2 years ago

Well, i expect we’ll hear soon from the horse’s mouth (/u/nutomic?) or maybe from large instance admins like /u/alyaza or /u/QuentinCallaghan as to what it entails and how we can actually expect that to behave

RoundSparrow · edit-2 2 years ago

I have some suspicions that the /communities page is dong a live database query to get counts of subscribers and monthly visits (on every hit to the page) and this is putting a heavy load on the SQL database.

Almost every time I hit this page in the past 48 hours I see > 4 second delay before I get the first page of results. I’m also getting internal errors/failures on loading the page. Typical error is a single line of output: “404: FetchError: request to http://lemmy:8536/api/v3/site”

@[email protected] · 2 years ago

Lemmy and Lemmy-ui are pretty much stateless and thus horizontally scalable. Pictrs and Postgres take a bit more work.

I’m not a dev though. Just my own experience hosting my own instance.

@PriorProject · 2 years ago

Recent versions of pict-rs support object storage. I haven’t tried, and that doesn’t inherently mean that pict-rs scales horizontally, but I’m hopeful that multiple instances could connect to the same blob storage safely? Then one could use minio or just S3 or whatever which do scale horizontally (or cloudily).

Also, I read somewhere that Lemmy uses an older version of pict-rs and that some modest effort is needed to upgrade and get access to the blob storage feature.

@[email protected] · 2 years ago

I’ve tested running multiple instances of pict-rs using distributed storage (Moosefs) for the uploaded file directory. I ran into several weird issues. Mainly that images would not load, or end up broken. But then refresh the page and hit a different pict-rs instance and it worked for some images, but broke for others. So now I run a single pict-rs instance, still on distributed storage, but everything works. This is on the 3x branch. I believe the 4x branch is still in rc stage.

@PriorProject · 2 years ago

I’ve tested running multiple instances of pict-rs using distributed storage [and] ran into several weird issues.

Well that’s a bummer.

At the very least, a Lemmy install is comprised of at least 4 discrete processes. I have to think that putting lemmy, lemmy-ui, pict-rs (backed by object-storage to provide scaleable I/O and let the pict-rs host focus on CPU), and PG (optionally with additional read-replicas) on separate boxes would result in a hardware platform that outscales the current codebase for a year or two while they clean up perf issues that crop up with large user/post/comment/community counts.

@[email protected] · 2 years ago

That’s basically what I’m doing. A singleton instance for Pict-rs and Postgres. Multiple compute nodes for Lemmy and Lemmy-ui being accessed over load balancers. All of data is placed on distributed storage so I can quickly fail-over Pict-rs or Postgres if needed. It’s enough for now, though I don’t think it could handle Reddit levels of traffic.

@[email protected] · 2 years ago

We’ll find out soon I think, especially as communities start to really freak out.