Lemmy Federation Architecture Change Proposal (github.com)

submitted 2 years ago by xtremeownage@lemmyonline.com to c/technology@beehaw.org

59 comments fedilink hide all child comments

https://github.com/LemmyNet/lemmy/issues/3245

I posted far more details on the issue then I am putting here-

But, just to bring some math in- with the current full-mesh federation model, assuming 10,000 instances-

That will require nearly 50 million connections.

Each comment. Each vote. Each post, will have to be sent 50 million seperate times.

In the purposed hub-spoke model, We can reduce that by over 99%, so that each post/vote/comment/etc, only has to be sent 10,000 times (plus n*(n-1)/2 times, where n = number of hub servers).

The current full mesh architecture will not scale. I predict, exponential growth will continue to occur.

Let's work on a solution to this problem together.

you are viewing a single comment's thread
view the rest of the comments

[-] bdonvr@thelemmy.club 20 points 2 years ago

But, just to bring some math in- with the current full-mesh federation model, assuming 10,000 instances-

That will require nearly 50 million connections.

Each comment. Each vote. Each post, will have to be sent 50 million seperate times.

Well your whole premise is just utterly wrong.

The way federation actually works:

A user on lemmy.ml subscribes to a community on lemmy.world. Say, !funny@lemmy.world

Assume that this user is the first lemmy.ml user to do so - basically what happens is the lemmy.world community sees that a member of a never before seen instance just subscribed. !funny@lemmy.world then adds lemmy.ml to its list of instances it needs to tell whenever something happens in the community.

No matter how many users of lemmy.ml subscribe, this only happens once.

Now when a user of sh.itjust.works upvotes a post on !funny@lemmy.world, the sh.itjust.works instance then tells !funny@lemmy.world of this change. It accepts the change, then tells everyone on its list of instances that have subscribers on them.

So essentially, sh.itjust.works talks to lemmy.world, lemmy.world tells everyone else. There is no "full mesh". The instance hosting the community is the "hub", everything else is a spoke.

So if there's 10,000 instances, and they all just so happen to have at least one subscriber to some community, each change will be sent out 9,999 times. Your "50 million" premise is just completely wrong and I'm not sure where it's coming from.

[-] xtremeownage@lemmyonline.com 6 points 2 years ago* (last edited 2 years ago)

Its not wrong- we just have opposite ideas here-

The 50 million, is based on the formula for a full-mesh network. Where all instances talk to each other. In the case of lemmy, this would be an absolute worst-case scenario, where every instance, is subscribed to a community on every other instance.

In your example of only 10,000 messages, you are assuming that of the 10,000 instances in existence, they are ONLY looking at a single community, on a single server.

Lets say, those 10,000 instances all decide to look at a community on another server. Now you have 20,000 connections.

Lets add another community, hosted on yet another instance. That is 30,000 connections.

TLDR;

My example, is based on worst-case scenario. (A pretty unachievable one at that!)

Your example, is based on best-case scenario.

Realistically, the actual outcome would be somewhere much closer to best-case scenario(As communities seem to lump up on the big servers). However, for planning architecture, you always assume worse-case scenario.

[-] bdonvr@thelemmy.club 21 points 2 years ago

No - you said:

Each comment. Each vote. Each post, will have to be sent 50 million seperate times.

That won't ever happen. Unless there's 50 million instances. That's not worst case, it's just not a case.

There is no case in the current implementation where any one action is replicated more times than there are total instances.

And it doesn't matter what "model" you assume, each action will have to federate to each instance eventually. That count is minimally, the total number of instances.

Lets say, those 10,000 instances all decide to look at a community on another server. Now you have 20,000 connections.

Looking does nothing, each instance hosts essentially a copy of the "host instance" for each community. Only interactions (comments, likes, posts, etc) are federated.

[-] xtremeownage@lemmyonline.com 5 points 2 years ago

for fucks sake, dude, be collaborative, and not defensive. This isn't reddit, I am not out to attack your karma.

If every instance, hosts a community, and Every other instance, subscribes to every one of those communities, that would lead to a full-mesh between all instances, resulting in worst-case scenario, ie, following the formula I provided for a full-mesh topology.

That is indeed, the worst case scenario, I have provided, explained, and documented in my examples.

[-] delcake@lemmy.songsforno.one 27 points 2 years ago

In no way is the person you're responding to speaking defensively. They've discussed the reason why your extrapolation to a full-mesh connective worst-case scenario isn't based in the reality of how ActivityPub functions. But you don't seem to be willing to entertain the notion that the federation of any given action never exceeds the number of instances subscribed to the community that generated it.

Even should every instance subscribe to every community on every other instance, the recipient of a federated action doesn't turn around and rebroadcast that action back on to the network because it is not the authoritative host of that community. Therefore what this discussion is lacking is proof of where this exponential broadcast storm of federated actions comes from in your assertion.

[-] King@vlemmy.net 11 points 2 years ago

Yes, it is a "full mesh" diagram. But for each specific "federated" action, it is a simple hub and spoke distribution. The hosting server will send the federated action to each subscribed node. The nodes don't need to check in with each other for that specific action.

I too believe that Federation is going to have scaling issues. But not due to full mesh

[-] xtremeownage@lemmyonline.com 3 points 2 years ago

I am onboard with you there-

But, would not not agree- delegating and offloading those federation actions to a dedicated pool of servers, would not assist scalability?

That way- each instance doesn't need to maintain all of the connections?

[-] King@vlemmy.net 5 points 2 years ago

There is no need to "maintain all of the connections". The server opens a connection, sends the data, then closes the connection.

[-] xtremeownage@lemmyonline.com 1 points 2 years ago

I realize that....

Let's- set the record straight here.

Do you think the current implementation of federation works well?

[-] Fauxreigner@beehaw.org 6 points 2 years ago

Federation isn't working well, but it's not working well because the big instances aren't able to keep up with all of the inbound/outbound messages, and if a message fails, that's it. Right now there's no automated way to resync and catch up on missed activity.

[-] xtremeownage@lemmyonline.com 1 points 2 years ago* (last edited 2 years ago)

So- what if, we can delegate a proxy/hub server, for managing all of the inbound/outbound messages, to offload that from the main instance server.

ie, main instance sends/receives its messages through the proxy/hub server, the proxy/hub server then follows a pub/sub topology for sending and receiving.

(Don't imagine a centralized hub server, but, just imagine a localized proxy/hub server for your particular instance. Lets also assume, its designed where you can support multiple hub/proxy servers, in the event one gets overloaded)

[-] Fauxreigner@beehaw.org 2 points 2 years ago

That doesn't do anything to fix the problem. If a server can only handle 5k updates per minute (a completely made up number), it doesn't matter if those 5k updates come from one server or a thousand. In theory you could cut down on outbound messages a bit if you could tell a "hub server" that post #123456 got another upvote, so please tell instances A, B, C, D, and E. But the total number of messages would increase, so even if the hub instance can handle more updates, it may eventually hit capacity again.

The core of the problem is that if an instance doesn't process an update (inbound or outbound), it doesn't ever retry, the instances are just out of sync for that post forever.

[-] xtremeownage@lemmyonline.com 1 points 2 years ago

The core of the problem is that if an instance doesn’t process an update (inbound or outbound), it doesn’t ever retry, the instances are just out of sync for that post forever.

With the pub/sub method- that should be able to be minimized.

At least, with my experience of messing with rabbitmq- A message stays in the queue, until I have told rabbitMQ, Hey, I have processed this message.

If I accept a message, an encounter an exception mid-way through, that message returns back to the queue, until It has been processed, or dead-letter logic handles it.

Granted, there is a hard-coded timeout somewhere in lemmy, where, older messages cannot be processed. That would need to be adjusted.

[-] Fauxreigner@beehaw.org 2 points 2 years ago

If you ensure that all messages are queued until processed, with retries on failure, what's the point of the hub model? As pointed out elsewhere, the large instances would be acting as hubs already.

[-] xtremeownage@lemmyonline.com 1 points 2 years ago

Just removing that load from the main instance server, allowing it to just handle serving its local user-base.

In short- splitting the load into multiple components, rather than everything being handled by just the single instance server.

[-] Fauxreigner@beehaw.org 1 points 2 years ago

I'm just not seeing a benefit here, I think this is a solution to the wrong problem. Your proposal in theory cuts outbound updates from the big hubs, but in reality they're only updating a subset of other instances for any given update, and it doesn't do anything to help with inbound updates. And to do that, you have to solve a pretty tricky problem.

If my instance gets an update from Beehaw, I can validate that they're allowed to do so, because Beehaw has a TLS certificate that says "Yep, this is actually Beehaw." If you introduce a hub system, I need some way to determine that the hub system that's telling me "Beehaw has an update for you" is allowed to send updates on behalf of Beehaw.

[-] xtremeownage@lemmyonline.com 1 points 2 years ago

To clarify-

After feedback/comments, I have modified the idea- this would be a optional local proxy/hub/delegation server/service, hosted by the instance owners.

https://github.com/LemmyNet/lemmy/issues/3245#issuecomment-1601585922

Ie- you can optionally scale your federation updates, independent of your main application server.

[-] cyd@vlemmy.net 1 points 2 years ago

How was syncing done in Usenet? It has a very similar decentralized model, and I don't recall there being problems of data loss due to desyncing between servers.

[-] King@vlemmy.net 2 points 2 years ago

I believe the current implementation wont scale because instances won't be able to handle every subscribed federated action. Having a hub server doesn't reduce the number of subscribed federated actions, only whom they come from.

[-] xtremeownage@lemmyonline.com 0 points 2 years ago

But- if we take that action of handling the federations, and seperate it from the main application server(Allowing the main instance server to focus on handling its local user-base), and architect it in a way that allows scaling the number of proxy servers up and down-

Would that not sound like a big improvement to scalability?

[-] King@vlemmy.net 2 points 2 years ago

The node still needs to receive every subscribed federated action and insert it into the local database. This has to be local to the "main application server". Your proxy servers don't reduce the number of federated actions. It only reduces the number of servers needed to communicate with.

I feel that the bottleneck will be the total number of federated actions, not which servers deliver them.

[-] bdonvr@thelemmy.club 10 points 2 years ago* (last edited 2 years ago)

Apologies if I came off as hostile.

I mean I get what you're saying - I just don't see the practical use. The centralized hub replication servers would have to basically foot a huge bill for the fediverse, and do so silently and invisibly to the end user. As it is, most instances run on goodwill or donations. A silent, invisible server is hard to gather donations for. Who would run them?

Furthermore the topology you propose is essentially what we already have. A few large instances hold most of the largest communities. I don't see that changing. This brings a fairly good balance - smaller instances pretty much only have to listen for updates from a few other instances, only the big instances are doing the hard work of notifying hundreds of others. They are already our "hubs". Small instances really hardly do practically any hard work, the one I run for example just listens to maybe a dozen instances send updates, and occasionally sends out an update when one of my users interacts.

I suppose I just don't understand how this could be implemented in practice- or rather how it could be useful to do so. It would strictly enforce a sort of centralization that right now is only a natural consequence of user behavior, while seemingly only bringing theoretical benefits unlikely to be realized.

[-] xtremeownage@lemmyonline.com 2 points 2 years ago* (last edited 2 years ago)

The centralized hub replication servers would have to basically foot a huge bill for the fediverse, and do so silently and invisibly to the end user.

One consideration, since they are only having to basically sub/pub - the load actually might be drastically lower than expected.

Furthermore the topology you propose is essentially what we already have. A few large instances hold most of the largest communities. I don’t see that changing.

Suppose- that is a valid point. The issue though- those large instances are unable to keep up with demand and load, causing lots of federation issues.

Perhaps, my idea actually wouldn't help that at all, but, using lemmy.ml as an example-

Instead of it having to send all of its updates out to every server subscribed- it can delegate that to a hub server to do it. The hub server can run a very minimal set of instructions, with enough intelligence to handle sub/pub.

Perhaps- one idea is, instead of thinking of it as a hub-server, think of it as a proxy server. Being able to delegate your instances actions to the proxy server to reduce that load from the main server.

And, instead of the hubs/proxies being more centralized, perhaps, its just an optional thing which you CAN do.

My line of thinking, is methods to reduce load from the main servers. This might be an idea that only benefits the handful of big servers.

To also further clarify- I DONT have a solution to the problem. I am only intending to establish a forum to discuss if this is even a viable option, or perhaps, think of other ways to spread around the load.

[-] monobot@lemmy.ml 1 points 2 years ago

I am not certain on scenarios you were mentioning above, but I do agree that separating software to instance plus hub/proxym/mssage queue could help with handling load.

How can we scale our big i instances? I don't know maybe it is easy to put instance on multiple servers, but sounds to me they are just buying bigger one, and that will fill up fast of growth continues to happen.

I would like to hear from developers what they think, but thank you for starting conversation about scaling.

[-] russjr08@outpost.zeuslink.net 1 points 2 years ago

The issue though- those large instances are unable to keep up with demand and load, causing lots of federation issues.

I am probably missing something / being really oblivious (its been a long day...) but wouldn't this same problem occur to the hub server in your model?

Although thinking about it a bit more, I thought I recalled seeing one of the Lemmy devs mention that the biggest issue is the SQL queries that are ran for various actions (such as loading the front page) - if that is the case, I don't know if this idea would help with that.

The idea of a centralized hub server(s) also sounds like we'd be moving closer to the model of a centralized Reddit... But I guess in a way, the fact that larger instances exist in of itself poses the same issue?

... I'm probably just rambling to myself at this point, however, I do think a message queue type of system for federating events would be a good idea, for the sake of recovering from send failures.

this post was submitted on 21 Jun 2023

43 points (100.0% liked)

Technology

42944 readers

149 users here now

A nice place to discuss rumors, happenings, innovations, and challenges in the technology sphere. We also welcome discussions on the intersections of technology and society. If it’s technological news or discussion of technology, it probably belongs here.

Remember the overriding ethos on Beehaw: Be(e) Nice. Each user you encounter here is a person, and should be treated with kindness (even if they’re wrong, or use a Linux distro you don’t like). Personal attacks will not be tolerated.

Subcommunities on Beehaw:

This community's icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.

founded 4 years ago

MODERATORS

alyaza@beehaw.org

TheRtRevKaiser@beehaw.org

gyrfalcon@beehaw.org

rs5th@beehaw.org

coldredlight@beehaw.org

SemioticStandard@beehaw.org

TheRtRevKaiser@kbin.social

remington@beehaw.org