Basically, I’m sick of these network problems, and I’m sure you are too. We’ll be migrating everything: pictrs, frontends & backends, database & webservers all to 1 single server in OVH.

First it was a cpu issue, so we work around that by ensuring pictrs is on another server, and have just enough CPU to keep us all okay. Everything was fine until the spammers attacked. Then we couldn’t process the activities fast enough, and now we can’t catch up.

We are having constant network drop outs/lag spikes where all the networking connections get “pooled” with a CPU steal of 15%. So we bought more vCPU and threw resources at the problem. Problem temporarily fixed, but we still had our “NVMe” VPS, which housed our database and lemmy applications showing an IOWait of 10-20% half the time. Unbeknown to me, that it was not IO related, but network related.

So we moved the database server off to another server, but unfortunately that caused another issue (the unintended side effects, of cheap hosting?). Now we have 1 main server accepting all network traffic, which then has to contact the NVMe DB server and pict-rs server as well. Then send all that information back to the users. This was part of the network problem.
Adding backend & frontend lemmy containers to the pict-rs server helped alleviate and is what you are seeing at the time of this post. Now a good 50% of the required database and web traffic is split across two servers which allows for our servers to not completely be saturated with request.

On top of the recent nonsense, it looks like we are limited to 100Mb/s, that’s roughly 12MB/s. So downloading a 20MB video via pictrs would require the current flow: (in this example)

  • User requests image via cloudflare
  • (its not already cached so we request it from our servers)
  • Cloudflare proxies the request to our server (app1).
  • Our app1 server connects to the pictrs server.
  • Our app1 server downloads the file from pictrs at a maximum of 100Mb/s,
  • At the same time, the app1 server is uploading the file via cloudflare to you at a maximum of 100Mb/s.
  • During this point in time our connection is completely saturated and no other network queries could be handled.

This is of course an example of the network issue I found out we had after moving to the multi-server system. This is of course not a problem when you have everything on one beefy server.


Those are the board strokes of the problems.

Thus we are completely ripping everything out and migrating to a HUGE OVH box. I say huge in capital letters because the OVH server is $108/m and has 8 vCPU, 32GB RAM, & 160GB of NVMe. This amount of RAM allows for the whole database to fit into memory. If this doesn’t help then I’d be at a loss at what will.
Currently (assuming we kept paying for the standalone postgres server) our monthly costs would have been around $90/m. ($60/m (main) + $9/m (pictrs) + $22/m (db))

Migration plan:

The biggest downtime will be the database migration as to ensure consistency we need to take it offline. Which is just simpler than

DB:

  • stop everything
  • start postgres
  • take a backup (20-25 mins)
  • send that backup to the new server (5-6 mins (Limited to 12MB/s)
  • restore (10-15 mins)

pictrs

  • syncing the file store across to the new server

app(s)

  • regular deployment

Which is the same process I recently did here so I have the steps already cemented in my brain. As you can see, taking a backup ends up taking longer than restoring. That’s because, after testing the restore process on our OVH box we were no where near any IO/CPU limits and was, to my amazement, seriously fast. Now we’ll have heaps of room to grow with a stable donation goal for the next 12 months.

See you on the other side.

Tiff

  • Tiff
    shield
    OPM
    link
    fedilink
    English
    13
    edit-2
    8 months ago

    I managed to streamline the exports and syncs so we performed them concurrently. Allowing us to finish just under 40 minutes! Enjoy the new hardware!

    So it begins: (Federation “Queue”)
    Federation queue showing a upwards trend, then down then slightly back up again

    • @[email protected]
      link
      fedilink
      English
      48 months ago

      OMG, posts load instantly now, used to take 3 to 15 seconds. I’m in US East Coast for reference.

      • TiffOPM
        link
        fedilink
        English
        7
        edit-2
        8 months ago

        That’s when US timezones wakes up. We physically cannot accept more than 3 requests per second. Physically being the actual network physical limits ( of 3 x 287ms = 861ms, we used to be 930ms+. The server move got us 21ms closer!). LW generates more than 3 activities per second during US “awake” time zones. So we have a period of 8 hours where we need to catch up.

        Like I’ve said in our forcing federation post. There isn’t anything to worry about because we are completely up-to-date on posts and comments because of our sync script.

        It’s just the sequential nature of Lemmy. I’m going to test a new container in the next 12 hours which removes the blocking metadata generation from the accepting of activities. That way we can guarantee at least 3 activities a second.

        Realistically, that is a minor fix but it won’t help with those graphs in the long term. We will need to have parallel sending, for it ever scale.

        On a side note while we were on our old server and were using our forcing federation script, we had it set to 10 parallel requests. It didn’t even worry about it. I saw no increase in server load. Which is good news for the lemmyverse in general, as everyone will be able to accept the new parallel sending without needing to increase their hardware.

        Tiff

        • @[email protected]
          link
          fedilink
          English
          48 months ago

          Thank you for the detailed answer!

          There isn’t anything to worry about because we are completely up-to-date on posts and comments because of our sync script.

          Sorry, it’s a bit late for me on this side, but if I understand correctly, posts and comments are indeed up-to-date, but upvotes are synchronized later, is this correct?

          Thank you for the work as always!

          • TiffOPM
            link
            fedilink
            English
            3
            edit-2
            8 months ago

            but upvotes are synchronized later

            Correct. All votes are syncronised eventually.

    • TiffOPM
      link
      fedilink
      English
      28 months ago

      Glad the link worked! It’s always risky posting mp4 links. I’ll be glad once the new front end patches come through so that by default, shows an image of the video (iirc).

      • @[email protected]
        link
        fedilink
        English
        38 months ago

        FWIW I didn’t know it was a video until you said something haha. The video did work though also when I clicked on it.

  • TiffOPM
    link
    fedilink
    English
    28 months ago

    PS. Everyone enjoying this new wide layout?

      • TiffOPM
        link
        fedilink
        English
        28 months ago

        I changed the default theme to be the “Compact” version. Which makes it wide screen, but if you’ve set your own then it doesn’t change it. If you open up reddthat.com in a private browser you should see it.

  • Lad
    link
    fedilink
    English
    28 months ago

    I’ve noticed some issues since moving to reddthat. Glad to see a fix is being worked on, keep up the good work :)

    • TiffOPM
      link
      fedilink
      English
      28 months ago

      How are your issues now? 🧐

      • Lad
        link
        fedilink
        English
        28 months ago

        Things seem okay now, no weird behaviour like random logouts and communities not loading 😁

  • @[email protected]
    link
    fedilink
    English
    2
    edit-2
    8 months ago

    Did I evrr let you know that I pissed off a CIA asset that launders Russian oligarch monies using the FBI, Filipino organized crime, the Albanians, and other US based law enforcement via FedEx, UPS, USPS, and Joann’s Fabrics ?

    Should I contribute more monthly to cover their probable sabotaging reddthat ?

    • TiffOPM
      link
      fedilink
      English
      4
      edit-2
      8 months ago

      Between you and me, you personally probably don’t need to donate more in the short term, but I’m not going to stop you! 😛

      We need about A$40-50/month extra to cover everything now. We have A$77.22 setup in recurring donations on OpenCollective, and just our server bills are A$115 (converted from US$74.80). + Domain Renewal (1.5/m Euro) + Wasabi Storage (~$8/m USD) This will be updated Funding post. With the money on Ko-Fi, OpenCollective and the recurring donations on OpenCollective, we have at least 12 months of runway before we run out of money. So it isn’t critical at the moment.

      Thanks! 🤎

      Edit: Actual prices