Basically, I’m sick of these network problems, and I’m sure you are too. We’ll be migrating everything: pictrs, frontends & backends, database & webservers all to 1 single server in OVH.
First it was a cpu issue, so we work around that by ensuring pictrs is on another server, and have just enough CPU to keep us all okay. Everything was fine until the spammers attacked. Then we couldn’t process the activities fast enough, and now we can’t catch up.
We are having constant network drop outs/lag spikes where all the networking connections get “pooled” with a CPU steal of 15%. So we bought more vCPU and threw resources at the problem. Problem temporarily fixed, but we still had our “NVMe” VPS, which housed our database and lemmy applications showing an IOWait of 10-20% half the time. Unbeknown to me, that it was not IO related, but network related.
So we moved the database server off to another server, but unfortunately that caused another issue (the unintended side effects, of cheap hosting?). Now we have 1 main server accepting all network traffic, which then has to contact the NVMe DB server and pict-rs server as well. Then send all that information back to the users. This was part of the network problem.
Adding backend & frontend lemmy containers to the pict-rs server helped alleviate and is what you are seeing at the time of this post. Now a good 50% of the required database and web traffic is split across two servers which allows for our servers to not completely be saturated with request.
On top of the recent nonsense, it looks like we are limited to 100Mb/s, that’s roughly 12MB/s. So downloading a 20MB video via pictrs would require the current flow: (in this example)
- User requests image via cloudflare
- (its not already cached so we request it from our servers)
- Cloudflare proxies the request to our server (app1).
- Our app1 server connects to the pictrs server.
- Our app1 server downloads the file from pictrs at a maximum of 100Mb/s,
- At the same time, the app1 server is uploading the file via cloudflare to you at a maximum of 100Mb/s.
- During this point in time our connection is completely saturated and no other network queries could be handled.
This is of course an example of the network issue I found out we had after moving to the multi-server system. This is of course not a problem when you have everything on one beefy server.
Those are the board strokes of the problems.
Thus we are completely ripping everything out and migrating to a HUGE OVH box. I say huge in capital letters because the OVH server is $108/m and has 8 vCPU, 32GB RAM, & 160GB of NVMe. This amount of RAM allows for the whole database to fit into memory. If this doesn’t help then I’d be at a loss at what will.
Currently (assuming we kept paying for the standalone postgres server) our monthly costs would have been around $90/m. ($60/m (main) + $9/m (pictrs) + $22/m (db))
Migration plan:
The biggest downtime will be the database migration as to ensure consistency we need to take it offline. Which is just simpler than
DB:
- stop everything
- start postgres
- take a backup (20-25 mins)
- send that backup to the new server (5-6 mins (Limited to 12MB/s)
- restore (10-15 mins)
pictrs
- syncing the file store across to the new server
app(s)
- regular deployment
Which is the same process I recently did here so I have the steps already cemented in my brain. As you can see, taking a backup ends up taking longer than restoring. That’s because, after testing the restore process on our OVH box we were no where near any IO/CPU limits and was, to my amazement, seriously fast. Now we’ll have heaps of room to grow with a stable donation goal for the next 12 months.
See you on the other side.
Tiff
I managed to streamline the exports and syncs so we performed them concurrently. Allowing us to finish just under 40 minutes! Enjoy the new hardware!
So it begins: (Federation “Queue”)
OMG, posts load instantly now, used to take 3 to 15 seconds. I’m in US East Coast for reference.
deleted by creator
🐊 Snap Snap!
That’s what I love to hear! 🎉
Well done!
I just had a look at the graph, it looked good until now, but now it’s up again :(
That’s when US timezones wakes up. We physically cannot accept more than 3 requests per second. Physically being the actual network physical limits ( of
3 x 287ms = 861ms
, we used to be 930ms+. The server move got us 21ms closer!). LW generates more than 3 activities per second during US “awake” time zones. So we have a period of 8 hours where we need to catch up.Like I’ve said in our forcing federation post. There isn’t anything to worry about because we are completely up-to-date on posts and comments because of our sync script.
It’s just the sequential nature of Lemmy. I’m going to test a new container in the next 12 hours which removes the blocking metadata generation from the accepting of activities. That way we can guarantee at least 3 activities a second.
Realistically, that is a minor fix but it won’t help with those graphs in the long term. We will need to have parallel sending, for it ever scale.
On a side note while we were on our old server and were using our forcing federation script, we had it set to 10 parallel requests. It didn’t even worry about it. I saw no increase in server load. Which is good news for the lemmyverse in general, as everyone will be able to accept the new parallel sending without needing to increase their hardware.
Tiff
Thank you for the detailed answer!
There isn’t anything to worry about because we are completely up-to-date on posts and comments because of our sync script.
Sorry, it’s a bit late for me on this side, but if I understand correctly, posts and comments are indeed up-to-date, but upvotes are synchronized later, is this correct?
Thank you for the work as always!
but upvotes are synchronized later
Correct. All votes are syncronised eventually.
Great news! Thank you so much for this!
until the
fire nationspammers attacked.Hehe
😁
Legend.
/gif legen-dary
Fingers crossed that everything goes well!
In the meantime, here’s a counter until the event that should work for any timezone
Thanks!
Idk crap about lemmy backend stuff, I’m just here for Legends of the Hidden Temple.
Glad the link worked! It’s always risky posting mp4 links. I’ll be glad once the new front end patches come through so that by default, shows an image of the video (iirc).
FWIW I didn’t know it was a video until you said something haha. The video did work though also when I clicked on it.
Good luck for the migration!
It starts… Soon. 😎
PS. Everyone enjoying this new wide layout?
Did anything change? If so, I didn’t notice it ha ha
I changed the default theme to be the “Compact” version. Which makes it wide screen, but if you’ve set your own then it doesn’t change it. If you open up reddthat.com in a private browser you should see it.
I’ve noticed some issues since moving to reddthat. Glad to see a fix is being worked on, keep up the good work :)
If you moved recently, you are quite unlucky with the timeframe ha ha
How are your issues now? 🧐
Things seem okay now, no weird behaviour like random logouts and communities not loading 😁
Thank you for all the hard work!
Did I evrr let you know that I pissed off a CIA asset that launders Russian oligarch monies using the FBI, Filipino organized crime, the Albanians, and other US based law enforcement via FedEx, UPS, USPS, and Joann’s Fabrics ?
Should I contribute more monthly to cover their probable sabotaging reddthat ?
Between you and me, you personally probably don’t need to donate more in the short term, but I’m not going to stop you! 😛
We need about A$40-50/month extra to cover everything now. We have A$77.22 setup in recurring donations on OpenCollective, and just our server bills are A$115 (converted from US$74.80). + Domain Renewal (1.5/m Euro) + Wasabi Storage (~$8/m USD) This will be updated Funding post. With the money on Ko-Fi, OpenCollective and the recurring donations on OpenCollective, we have at least 12 months of runway before we run out of money. So it isn’t critical at the moment.
Thanks! 🤎
Edit: Actual prices