Still having massive federation issues despite the upgrade?

WFH · 2 years ago

Still having massive federation issues despite the upgrade?

@00Lemming · 2 years ago

FYI, this is due to a confluence of issues.

We are the largest instance with the highest active user count - and by a good margin.
We are dealing with a premature software base that was not designed to handle this load. For example, the way the ActivityPub Federation queues are handled are not conducive to high volume requests. Failed messages stay in queue for 60 seconds before they retry once, and if that attempt fails it sits in queue for one hour before attempting to retry. These queued messages sit in memory the whole time. It’s not great, and there isn’t much we can currently do to change this, other than to manually defederate from ‘dead’ servers in order to drop the number of items stuck in queue that are never going to get a response. Not an elegant solution by any means, and one we will go back and address when future tools are in place, but we have seen significant improvement because of this.
We have attempted contacting Lemmy devs for some insight/assistance with this, but have not heard back yet, at this time. Much of this is in their hands.
We were able to confirm receipt of our federation messages (from lemmy.world) to other instance admins instances at lemm.ee and discuss.as200950.com. As such we do know that federation is working at least to some degree, but it is obviously still in need of some work. As mentioned above, we have reached out to the Lemmy devs, who are instance owners of Lemmy.ml, to collaborate. I cannot confirm if they are getting our federation at this time. Hopefully in coming Lemmy releases this becomes easier to analyze without needing direct server access to both instances servers.

As you can see, we are trying to juggle several different parameters here to try and provide the best experience we can, with the tools we have at our disposal. You may consider raising an issue on their GitHub about this to try to get more visibility to them from affected users.

Ruby Hoshino · 2 years ago

I feel sorry for the admins. This must be stressful to deal with.

@00Lemming · 2 years ago

I appreciate the concern, but we are doing this because we enjoy doing it and our skill sets allow us to contribute. These are exciting times, even with the issues we are working through :) No worries.

@Glunkbor · 2 years ago

Dunnow if it is related, but everything is just worse this the upgrade than it was before. Now everything takes forever to load, responses take longer, pictures don’t load the correct way. So for me this was a downgrade in my experience here or what the Germans call a “Verschlimmbesserung” (making it worse by trying to improve).

@[email protected] · 2 years ago

I bet it’s just the dramatic uptick in traffic. It takes quite a bit more money and effort to make a highly available and performant service, which most of these are side projects or paid for by a normal person’s income which can’t compete with a Corp.

WFH · 2 years ago

I think (hope) the slowdowns are less upgrade related and more massive uptick in activity related. But maybe I’m wrong.

@[email protected] · 2 years ago

that’s make sense

@Melco · edit-2 2 years ago

deleted by creator

WFH · 2 years ago

The way I feel it, and maybe I’ve got this wrong is that federation works in an outbound fashion. Every instance sends to all its federation “hey, this thing just happened on my side”.

But if an instance goes down and it becomes “incompatible” like what happened to us for the past week or so, I feel that everything that happened in the meantime got irremediably lost.

I think there should be a contingency plan if one instance goes down. Something proactive along the way of “hey, I was down, I haven’t heard from you since xxxx seconds, could you send me an update?” Depending on the activity and the timespan the could trigger massive amounts of data being sent and has the potential to be abused so safeguards should be put in place, but maybe that could be a solution. Keep the real-time exchanges as the primary method but have a full sync once in a while.

@Desistance · 2 years ago

The same thing has been happening with kbin. Anything crossed with a kbin instance is dramatically late. Like hours or more.