Spawned my own instance of lemmy: now I've got a lot of questions about federation

@[email protected] · 2 years ago

Spawned my own instance of lemmy: now I've got a lot of questions about federation

chiisana · 2 years ago

Header is expired issue is big part of the current federation problem. And whether you know it like it or not, you’ve just made the matter worse. You’re not to blame though. I’ve done it too, along with many other people self hosting our own instance.

The way federation currently works is each write action must be federated outwards to each federated instance. A comment reply, such as this one, must be federated outwards by the hosting instance. An instance receiving a federation event must also discard messages that are older than 10 seconds.

Here lies the problem… popular instances like lemmy.world and lemmy.ml has thousands of users, and thousands of federated servers. Yesterday, when I checked, lemmy.world had 3600 users per day and 2200+ federated servers. If there’s a really popular post on a very popular community, and 10% of the users comments on it? Lemmy.world server must send 360x2200 = 700K+ outbound federation event messages. Each one of these are sent over HTTPS via TCP so they can’t send all of them at the same time, and the messages are put into a queue where the federation workers will send them out. Each worker will send the message and because HTTPS is over TCP, it is not fire and forget, the worker must wait for acknowledgement for the packets. If an instance owner gets bored because they’re not getting all the messages and shuts down? Now the worker needs to wait for that to error out and thereby delaying messages further down the queue. If it had to wait more than 10 seconds? Everyone down the queue will just get expired messages because the event is already outdated.

So now you’ve already created an instance and adding to the load of the network, just like me, what can you do? Keep your server online in a fast data center. Use Cloudflare to reduce latency. That way at least your server isn’t going to introduce too much latency to other servers down the queue. Hopefully the devs figure out something to make the process better. I’ve put in a more scalable notification fleet architecture change on GitHub already. Lets see if they can implement that or change other requirements on the system.

@[email protected] · edit-2 2 years ago

Can you link your proposed change? I am interested

EDIT: for what concerns the queue stuff and the 10 second expiration. Is this part of the activityPub protocol or it has been choose by Lemmy’s devs during the implementation of the protocol?

chiisana · 2 years ago

Also this is my GitHub issue ticket that will allow even larger servers to exist after the 5 minutes extension limit exhausts.

https://github.com/LemmyNet/lemmy/issues/3230

HTTP_404_NotFound · 2 years ago

As an alternative, I opened up a discussion for a better purposed architecture design.

https://github.com/LemmyNet/lemmy/issues/3245

chiisana · 2 years ago

I believe the 10 seconds expiration is Lemmy thing. The ActivityPub protocol doesn’t dictate 10 seconds expiry. There is currently a change in activitypub-federation-rust (line 70), I don’t know when that will be released but it may help.

@[email protected] · 2 years ago

As someone who has just enough knowledge to know how big the task of creating a performant way to propagate updates through the federation is, I really hope there are some smart people working on a solution. That is the biggest advantage reddit has over lemmy: Known and centralized hardware standards. Lemmy needs to find a way to make propagation work when half of all instances are hosted at home on consumer-grade hardware.

z3bra · 2 years ago

Isn’t there a mechanism to remove timing out servers? Or a way to unregister your instance ? Otherwise the model could never scale properly as servers get retired every now and then, even within the same instance.

kopper [they/them] · 2 years ago

This commit changes the timeout to 1 day. I assume 0.18 will ship it, though I haven’t checked.

chiisana · 2 years ago

If there is an option to adjust/disable it, I wasn’t able to find it.

HTTP_404_NotFound · 2 years ago

And whether you know it like it or not, you’ve just made the matter worse.

What is making the matter worse, is everyone clobbering together on lemmy.ml, and lemmy.world. This causes those particular servers to be vastly overloaded.

If, say, people created communities ELSEWHERE, the load could be spread-around.

Not- saying the architecture of federation isn’t a problem, as indeed, it is a huge problem- but, in the interim time, this can be helped out by people spreading out.

@TitanLaGrange · edit-2 2 years ago

Each one of these are sent over HTTPS via TCP

Do you happen to know how the server-to-server connections are managed? I’m not too familiar with it, but it seems like HTTP/3 might provide some benefits for server-to-server communication.

Also, regarding queuing federation messages, I’m curious if packages like Kafka or Pulsar have been considered? They aren’t typically used over HTTP, but it doesn’t seem like it would be too hard to adapt, and the stream retention policy could be set to allow consumers to pick up older records as they have capacity (to avoid the issue around servers getting out of sync. The consumer would know the queue offset for each stream it was consuming and could pick up records as it has capacity, provided it doesn’t fall so far behind that the records expire). Publishers could provide separate topics for different message types to allow consumers to prioritize activity types (for example, prioritizing receiving replies over up/down votes). Also servers could potentially use cluster replication (Mirror Maker) to handle moving activity records from one server to another (again, HTTP-only would be an issue here), and each server could then consume the federation activity messages locally from its own queue.

Kafka/Pulsar support have strong scaling support, so adding capacity for federation messages should be fairly straightforward.

I’ve only used Kafka once, and I’m completely unqualified to operate an instance of any complexity, by in general my experience with it was pretty good.

@[email protected] · edit-2 2 years ago

Yes, because posts/comments are created there and then federated to subscribed instances, which happens on a queue. Depending how popular that community is, there’s probably a lot of other instances ahead of you in the queue, but lemmy.world is also a huge instance in general so they might be under al ot of load.
Same reason as above, federation is asynchronous and happens on a queue.
In theory you could probably write a bot that just auto-subscribes to communities in other instances, but you’re going to have a LOT of traffic going through to your instance and you’ll basically need it to be as powerful as those instances. Sending you only the data you’ve subscribed to receive is how all federated services work, an instance just can’t reasonably get everything. However I’ve put some thought into this and I think a way to browse other communities without subscribing to them would be a game changer.
This could be a lot of things unfortunately. Have you tried making new posts?
Same on my end, I think lemmy.ml is waaay overloaded at this point and I think it might also be behind CloudFlare which complicates the federation process.
Seems like a problem on the kbin end. You might want to create an issue on the kbin’s codeberg.
You’ll probably want to post that in one of the lemmy communities or on its issue list (make sure to search first).

Perhyte · 2 years ago

Same on my end, I think lemmy.ml is waaay overloaded at this point and I think it might also be behind CloudFlare which complicates the federation process.

The IP addresses (both IPv4 and IPv6) I’m seeing for lemmy.ml trace to OVH, not Cloudflare.

chiisana · 2 years ago

CloudFlare does not complicate federation process. Cloudflare proxy requests and filter out bad actors, that’s all. The federation issues we are seeing now is due to the way ActivityPub protocol is being used. Every write interaction on the host lemmy instance must be announced to each federated instance. So for example, there’s 3.6K active daily user on lemmy.world when I checked yesterday; and there was 2200+ linked servers when I checked yesterday. If there’s a popular post and 10% of the daily user comment on it, lemmy.world server needs to send 360 x 2200= 700K+ outbound messages to the federated serves. If the messages arrives more than 10 seconds after the action is performed, federated instances toss the message out as expired, so federated instances doesn’t show the comment. This is the crux of the problem. Cloudflare has nothing to do with the issue.

@[email protected] · edit-2 2 years ago

There was a GitHub issue where when federating an instance was being sent to the CF “I am not a robot” prompt, which of course meant some federation requests like subscriptions simply aren’t going through, so CF actually does complicate federation. Additionally, if a community is pretty active on an instance it’s going to be sending a lot of requests over and over which could look like a “bad actor”. Not that it matters for this particular case since lemmy.ml doesn’t use cloudflare as someone else pointed out. Anyway, I just stated CF as a possibility. I didn’t check myself, but the main thing I was pointing out was volume. Comments do seem to be eventually federating, though, albeit some are indeed being thrown out probably due to the issue you pointed out.

chiisana · 2 years ago

I think there’s a lot of misunderstanding because Federation is not one direction.

Please allow me to clarify a couple of things:

CF bot challenge will only appear when activity originating from an IP/a network that appeared abusive by behavior analysis against CF’s clients at large.
CF bot challenge will only appear and prevent requests to servers protected by CF, not incoming messages received by instances.
Federation in ActivityPub spec outwards communicates activities happening on the instance to subscribed instances.

So with the combination of those, the reality is that if you’re self hosting at home, using a residential IP that is likely to be flagged as abusive due to neighbors on your ISP misbehaving, you will see bot challenge when you’re browsing. However, because your self hosted instance is unlikely to have a lot of active communities with people subscribing to, your instance will not be making out bound ActivityPub messages that would trigger CF bot challenge on other instances that uses CF.

On the flip side, larger instances with active communities will be sending out a lot of federation messages. As they are generally in data centers with more stricter abuse prevention policies than residential ISPs (lemmy.ml is on OVH; lemmy.world is on Hertzner; Beehaw is on Digital Ocean; etc.), the out bound messages are less likely to be flagged should they be posting to another instance behind CloudFlare. In the unlikely event where they are triggering bot challenge, subscribing instances (i.e.: my own lemmy.chiisana.net , which is behind CloudFlare) can whitelist them on CloudFlare side to allow them to come through no problem.

Overall, it is a good idea to have CF in front of any public facing instance. Doesn’t matter how powerful the individual instance servers are, we’re in an age where it is very easy to take down a single service. CF will unlikely introduce complications, and the good it provides significantly out-benefit the unlikely complications.

Source: I’ve been running servers since late 90s, building hugely popular web applications (used by significant portion of internet at some point) since early 00s, as well as worked with CF since their super early days.

@[email protected] · 2 years ago

Thank you for the detailed response! I’ll open the issues for 6. And 7.

Matt · 2 years ago

I can only answer a couple.

This is by design, due to how federation works. Federation is literally just:

Instance A requests information from Instance B
Instance B responds to the request and sends it back
Instance A follows Instance B
Instance B now federates all future posts to Instance A

There’s nothing more complicated to it, but it does mean that instances cannot know about other instances without being told, as there is no central location that instances connect to in order to find out about all other instances.

Only posts after subscribing are federated to an instance, it doesn’t backfill. An option for admins of an instance to request a backfill would not be a bad option though, but as time goes on, backfilling an entire community could take too much data on instances.
Issue with Lemmy.ml, although when you see Subscribe Pending, you tend to still see things in your feed.

@PixxlMan · 2 years ago

So if you’re using a new instance all old posts are not visible? Really? Can it not load them on demand? Or am I misunderstanding?

Matt · 2 years ago

Yes, this is correct.

It’s possible to fetch older posts by requesting them directly but there’s no automated way to grab old posts on demand.

Marsta · 2 years ago

Can confirm that all of the above applies to my instance as well

@[email protected] · 2 years ago

Well, this is good. At least we are sure almost everything is good on our side

Marsta · 2 years ago

Hoping someone found a solution for at least the slow comment updating. Especially older posts (48 hours) dont seem to sync properly

Marsta · 2 years ago

The http header error disappeared after I disabled cloudflare proxy + adding ntp

@[email protected] · edit-2 2 years ago

I didn’t removed Cloudflare protection nor added ntp and I don’t see any header expired error since this morning. But I still have my instance out of sync as yesterday as we discussed in other comments

@[email protected] · edit-2 2 years ago

I can’t even see this post on my selfhosted instance, so that’s fun.

With regard to #6, you apparently have to search for the URL of the Kbin magazine (e.g. https://kbin.social/m/RedditMigration) and then you can subscribe as normal from that point on.

And yeah, the discovery and comments thing is super annoying. There will be completely different conversations based on where you’re viewing the post, which kind of kills the idea of all these instances federating in the first place.

That being said, it appears to be feeds coming from lemmy.world and lemmy.ml that have the majority of the problems with comment/post sync. Unfortunately that’s the selfhosted community.

@[email protected] · 2 years ago

The suggestioni for kbin works flawless, thanks! I don’t get why Lemmy and kbin instances have a different way to do this.

Anyway, seems like my instance is more in sync (by order of magnitude, I still have to verify it better) with kbin then with Lemmy.world or .ml.

@[email protected] · 2 years ago

Until 5 minutes ago I was seeing 6 comments. Now Lemmy says there are 11 comments (as I was saying on lemmy.world until minutes ago) but I only count 6 comments by hand 😂

Yep the problem is that we all see different things so we are not having a consistent conversation between us.

I still don’t know and I would like to, how the activityPub and federation protocol works and where the fault is (in the activityPub or in the Lemmy’s implementation). As I understood everything that happens on Lemmy.ml or lemmy.world is sent to all the connected instances with HTTP. This means that if there are 10k connected instances, a like produces 10k HTTP request and this is multiplied for each actions done on these servers. Is this right?

@[email protected] · 2 years ago

It sounds like you know more about ActivityPub than I do, but that seems right. My guess is that .ml and .world are so overloaded that there’s just a massive delay coming from them. It is kind of funny that, like you mentioned above, with my Lemmy instance I have almost perfect sync with Kbin servers but problems with Lemmy.

I’m hopeful that this is like Mastodon, where things calm down in a few weeks and there’s a smaller but more sustainable userbase.

chiisana · 2 years ago

You’re close!

Instance must send outbound notifications to every federated server where at least 1 user subscribed to the community the activity happened on.

Take /c/selfhosted on lemmy.world for example. If of the 2K+ servers only 200 had people subscribed to it, then lemmy.world server will only need to send 200 messages for an upvote. However, given the nature of this community, it’s probably closer to 2000.

To exacerbate the problem, if people take their server offline, because of TCP protocol being under pinning of HTTPS, lemmy.world server must wait until the socket completes or times out.

To circumvent this, lemmy.world has bumped the federation worker count to 10K… but if every action on this community creates 2K messages, it could get filled up pretty quickly. And that’s just this community, without regard to other communities.

There is an increase coming per the link I’ve shared with you in another reply, I’m hoping it will alleviate some pain, but I’d still like to see something more scalable to enable support for network that’s much larger.

@[email protected] · 2 years ago

How does kbin implements ActivityPUB? Isn’t it better from this point of view? Because I see that kbin’s actions are synchronized better and faster with my instance

chiisana · 2 years ago

I don’t know how kbin implements it, as I’ve not seen / reviewed their code. There are several ways around it; perhaps a longer expiry, perhaps signing each message as it is going out, perhaps a better out bound queuing system. It is important for Lemmy dev team to think these approaches, and implement solutions that makes sense for Lemmy, and enable better scaling.

@namelivia · 2 years ago

Number 3 was the most important learning I got from setting up my own self-hosted Pleroma instance. Finding new content is super hard in the fediverse if you self-host, in the case of Mastodon/Pleroma I tried federating with some Relays, but still…

@SpecGeo · 2 years ago

Don’t use lemmy.world for now if you want to experience federation(atleast for now). I have a separate account for this instance because of the popularity and use my personal instance for other instances except lemmy.ml

@RedirectedPotato · 2 years ago

I’m only just now learning more about federation, so I’m no expert.

Is your instance setup to use NTP? federation requires fairly accurate time from my understanding, and shared hosting normally drifts quite a bit, which might be where your “Header is expired” issue comes from. It could also have something to do with your instance not processing incoming federation stuffs fast enough I think?

There is a chat space at https://matrix.to/#/#lemmy-space:matrix.org with people that run their own instances, which might help answer these questions

@marsta · 2 years ago

thanks for the hint regarding ntp. I configured my VM to use NTP now but the http error remained