147

Post-Mortem: The massive lemmy.world -> lemmy.dbzer0.com federation delays. (dbzer0.com)

submitted 8 months ago by db0@lemmy.dbzer0.com to c/div0@lemmy.dbzer0.com

28 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[-] nutomic@lemmy.ml 15 points 8 months ago* (last edited 8 months ago)

As someone hosting a service like this, especially when it has 12K people in it, this is very scary! While 2 lemmy core developers were in the chat, the help they provided was very limited overall and this session mostly relied on my own skills to troubleshoot.

This reinforced in my mind that as much as I like the idea of lemmy (or any of the other threadiverse SW), this is only something experts should try hosting. Sadly, this will lead to more centralization of the lemmy community to few big servers instead of many small ones, but given the nature of problems one can encounter and the lack of support to fix them if they’re not experts, I don’t see an option.

I disagree with this conclusion. If you had installed Lemmy according to the official instructions, you would have the database, backend and everything else on the same server and would never have run into this particular issue. And any problems youd have would likely be noticed (and debugged) by many other instances too. Your setup is heavily customized so it is only natural that there are few people who can help with it.

Anyway its an interesting journey, thanks for writing down your experience and for improving the documenation!

[-] db0@lemmy.dbzer0.com 28 points 8 months ago* (last edited 8 months ago)

The official instructions do not scale nor do they work for all situations. But besides that, the problem is not that my bad setup caused a problem. Shit happens and I didn't blame anyone but myself. The problems is that when a problem occurs, one has to get lucky to get support. I don't have to even prove this. I know for sure a fact that there's lemmy instances that decommissioned because they followed the default setup, run into issues, got no support and gave up.

Edit: Also, man, from one Foss developer to another: You really have to learn to stop the instinct to say 'it broke because you did it wrong'. I know it feels unfair, but trust me, this is not the way.

[-] nutomic@lemmy.ml 3 points 8 months ago

I'm not saying you did it wrong, it's open source so of course you can use it in any way you like. But some ways have a higher risk of breaking than others.

[-] KairuByte@lemmy.dbzer0.com 11 points 8 months ago

I’m curious how you think “everything on the same box” scales? You can’t load balance, you can’t ensure resources are being used efficiently, you can’t even reboot a machine without the entire thing going dark.

[-] nutomic@lemmy.ml 5 points 8 months ago

Lemmy.ml runs on a single server and is much bigger than db0. Sure you can't get 100% availability this way but no one expects that.

[-] KairuByte@lemmy.dbzer0.com 2 points 8 months ago

Do you have a link to something describing their infrastructure?

[-] kbotc@lemmy.world 9 points 8 months ago

Tossing stuff on the same server is not great as I don’t want to pay for fast storage for my image store, but I want fast for my DB. My web server should have extra CPU and network but is otherwise ephemeral. This is the same stuff people have been running for years and is microservices 101.

The correct thing to do here is build in tracing and profiling hooks, as an example OpenTracing so something like Jaeger can consume and show problems and would have lit this up like a Christmas tree, Pyroscope can show changes over time in where CPU goes, and logs get shuffled off into graylog or some other centralized service for correlation.

[-] nutomic@lemmy.ml 1 points 8 months ago

Images can be stored in S3 so that's not an issue. And Lemmy has some tracing logs as well as Prometheus stats, not sure if db0 tried looking into those.

[-] db0@lemmy.dbzer0.com 6 points 8 months ago

I don't think if seen mention of these anywhere or how to use them

[-] taaz@biglemmowski.win 1 points 8 months ago* (last edited 8 months ago)

Edit: this comment is not written well, and is not describing the issue I wanted to actually comment on, I am tired and sorry

I will hop on to this to also point out that there actually were people willing to actively help (me included, see the original post on this community) but if I say it bluntly we were not "invited in on the show", let me expand that.

The problem is, as @nutomic@lemmy.ml points out here, we don't have the slightest idea how exactly your infrastructure looks, without that there is only the most general stuff we can help with.

From my point of view, joining the matrix chat later in the process, I watched you do/post stuff that I have no idea where it comes from, I don't have the full context of what has been already tried and crossed out and what's the current plan.
You @db0@lemmy.dbzer0.com would have to stop chopping and start networking with the people - that is definitely not easy to do effectively, especially if more people join later (and too have to be updated with the sate) but we could have fast tracked the docker/compilation stuff ruling lemmy out sooner.

In retrospect, if we had full picture of how the infrastructure looks the chance someone would go "oh you have split backend and database servers, check the latency" would definitely be a lot higher, but we didn't know (hell I actually assumed your deployment is same or close to the lemmy ansible one). I am aware this is easy to say after the solution has been found but hopefully you get the networking/communication idea.

[-] db0@lemmy.dbzer0.com 9 points 8 months ago* (last edited 8 months ago)

Wait, hold on, how was help not accepted? I talked with everyone who replied to me me and followed every suggestion. If someone had asked for infra information I gave it.

You know It's really frustrating to open myself and write about my experiences honestly and then people try to stay that it's actually my fault I didn't ask for help "the right way" . What kind of effect to do you think this might have to other potential lemmy hosters?

[-] taaz@biglemmowski.win 4 points 8 months ago* (last edited 8 months ago)

I didn't want to devalue your communication, I think I have worded my previous comment very badly in that sake, I am sorry about that. (I also really need to go to sleep so I will be blunt here.)

There is a nuance to the internet communication when it comes to asking OSS community for support, at least speaking from my own experience as someone working in tech.
Getting one or two people actively bouncing ideas of off is a already big success - quality of OSS support is often very spotty across projects and it's understandable because people do it in their free time which is limited (also if the project is complex, there is often less people experienced with it, less total sum of free time for support, I think this currently applies to Lemmy a lot).
With that in mind, when I come asking for support I am mostly prepared to not get any, I am prepared to have to dive into the codebase, debug, deconstruct, debug, swear, swear some more. Maybe this is just me and I had really bad luck mostly, but I don't know.
Should the devs/owners of any OSS project be ready to provide (some) support for their product if they want it to survive, probably yes, and how much is good depends on the project, you, anyone.

What kind of effect to do you think this might have to other potential lemmy hosters?

My opinion is that currently, lemmy is simply not ready for non-tech people. (And I can't really imagine it will ever be, unless there is a lot of people active in the development and are willing to help others. At least currently there is just too much moving parts that require at least some amount of technical experience. Also lemmy is not something like... GUI application - some application to be used by non-tech people, in the sense that if you want to deploy your own lemmy instance you the admin is the target user of that software, not talking about UX/UI)

Also as someone else has commented here, hosting something for myself is easy, hosting for friends is just a slightly bit harder, but hosting something for the public, getting hundreds-thousands of people makes it by a magnitude a lot more difficult (now you need active monitoring, durable backups, ...).

[-] db0@lemmy.dbzer0.com 6 points 8 months ago

You surely noticed that I was more than prepared to get my hands dirty during this incident. 😉

When I speak about support, I don't mean having people doing it for me.

But overall you don't seem to disagree with me that hosting you lemmy is not for the non-technical. Which is what nutomic took issue with.

[-] taaz@biglemmowski.win 1 points 8 months ago* (last edited 8 months ago)

But overall you don’t seem to disagree with me that hosting you lemmy is not for the non-technical. Which is what nutomic took issue with.

I read it as them taking isssue with you having different infra then recommend/expected, more then (not) being non-tech friendly. (I am going to sleep right now, I will check in tommorrow, well today later).

[-] Simon@lemmy.dbzer0.com 1 points 8 months ago

This is my job, so I'll counter that this isn't realistic, and in a professional situation it would probably be hosted in kubernetes which spans multiple servers and sometimes multiple regions - I don't think the devs have a readme for that.. (or maybe they do). The point being that the official docs are geared for a hobbyist to set up a node and not having separate VMs makes sense in that scenario. However I would say that it's plain that mister db0 has a much larger instance than could be considered hobbyist at this point.

this post was submitted on 08 Mar 2024

147 points (99.3% liked)

/0

1560 readers

1 users here now

Meta community. Discuss about this lemmy instance or lemmy in general.

Service Uptime view

founded 1 year ago

MODERATORS

db0@lemmy.dbzer0.com