I’ve been noticing reddthat going down for a short time (a few minutes or so) more than before - is everything okay?

https://lestat.org/status/lemmy shows our uptime at 88.82% right now - not horrible, but not great either. We of course stunt need to be up 99.9999% of the time, but still.

Is there anything we can help with besides donating? I like it here, so I want to make sure this place stays up for a long time 👍

  • TiffMA
    link
    fedilink
    English
    arrow-up
    28
    ·
    11 months ago

    These were because of recent spam bots.

    I made some changes today. We now have 4 containers for the UI (we only had 1 before) and 4 for the backend (we only had 2)

    It seems that when you delete a user, and you tell lemmy to also remove the content (the spam) it tells the database to mark all of the content as deleted.

    Kbin.social had about 30 users who posted 20/30 posts each which I told Lemmy to delete.
    This only marks it as deleted for Reddthat users until the mods mark the post as deleted and it federates out.

    The problem

    The UPDATE in the database (marking the spam content as deleted) takes a while and the backend waits(?) for the database to finish.

    Even though the backend has 20 different connections to the database it uses 1 connection for the UPDATE, and then waits/gets stuck.

    This is what is causing the outages unfortunately and it’s really pissing me off to be honest. I can’t remove content / action reports without someone seeing an error.

    I don’t see any solutions on the 0.18.3 release notes that would solve this.

    Temp Solution

    So to combat this a little I’ve increased our backend processes from 2 to 4 and our front-end from 1 to 4.

    My idea is that if 1 of the backend processes gets “locked” up while performing tasks, the other 3 processes should take care of it.

    This unfortunately is an assumption because if the “removal” performs an UPDATE on the database and the /other/ backend processes are aware of this and wait as well… This would count as “locking” up the database and it won’t matter how many processes I scale out too, the applications will lockup and cause us downtime.

    Next Steps

    • Upgrade to 0.18.3 as it apparently has some database fixes.
    • look at the Lemmy API and see if there is a way I can push certain API commands (user removal) off to its own container.
    • fix up/figure out how to make the nginx proxy container know if a “backend container” is down, and try the other ones instead.

    Note: we are kinda doing #3 point already it does a round-robbin (tries each sequentially). But from what I’ve seen in part of the logs it can’t differentiate between one that is down and one that is up. (From the nginx documentation, that feature is a paid one)

    Cheers, Tiff

    • Tiff
      shield
      MA
      link
      fedilink
      English
      arrow-up
      20
      ·
      11 months ago

      Updates hiding in the comments again!

      We are now using v0.18.3!

      There was extended downtime because docker wouldn’t cooperate AT ALL.

      The nginx proxy container would not resolve the DNS. So after rebuilding the containers twice and investigating the docker network settings, a “simple” reboot of the server fixed it!

      1. Our database on the filesystem went from 33GB to 5GB! They were not kidding about the 80% reduction!
      2. The compressed database backups went from 4GB to ~0.7GB! Even bigger space savings.
      3. The changes to backend/frontend has resulted in less downtime when performing big queries on the database so far.
      4. The “proxy” container is nginx, and because it utilises the configuration upstream lemmy-ui & upstream lemmy. These are DNS entries which are cached for a period of time. So if a new container comes online it doesn’t actually find the new containers because it cached all the IPs that lemmy-ui resolves too. (In this example it would have been only 1, and then we add more containers the proxy would never find them). 4.1 You can read more here: http://forum.nginx.org/read.php?2,215830,215832#msg-215832
      5. The good news is that https://serverfault.com/a/593003 is the answer to the question. I’ll look at implementing this over the next day(s).

      I get notified whenever reddthat goes down, most of the time it coincided with me banning users and removing content. So I didn’t look into it much, but honestly the uptime isn’t great. (Red is <95% uptime, which means we were down for 1 hour!).

      Actually, it is terrible.

      With the changes we’ve made i’ll be monitoring it over the next 48 hours and confirm that we no longer have any real issues. Then i’ll make a real announcement.

      Thanks all for joining our little adventure!
      Tiff

      • Stimmed
        link
        fedilink
        arrow-up
        1
        ·
        11 months ago

        For number 4, can you set a cron job to constantly flush DNS cache?

        • TiffMA
          link
          fedilink
          English
          arrow-up
          3
          ·
          11 months ago

          It’s the internal nginx cache. It /shouldn’t/ be a problem once I update the configuration to handle it.

          We can add a resolver line with valid=5s so it will recheck every 5 seconds instead of whatever the internal docker TTL cache is.

    • RagnarokOnline
      link
      fedilink
      English
      arrow-up
      10
      ·
      11 months ago

      Wow, that limitation in the Lemmy design sucks. Thanks for working so hard to figure it out!

      • TiffMA
        link
        fedilink
        English
        arrow-up
        11
        ·
        11 months ago

        Yeah, I dont remember it happening in 0.17 but that was back when we had websockets! So everything was inherently more synchronous.

        0.18.3 has “database optimisations” which apparently also results in 80% space savings . (Like wtf, how could we save 80%!!!).

        Anyway I’ll be testing that on the dev server tonight and then I’ll plan a maintenance window.

        • dartos
          link
          fedilink
          English
          arrow-up
          2
          ·
          11 months ago

          Wait they removed websocket support?!?! Why?

          How were things more synchronous with websockets?

          • TiffMA
            link
            fedilink
            English
            arrow-up
            6
            ·
            11 months ago

            I think the websocket support was “clunky”, and it resulted in weird things happening.

            Because all the clients were on a websocket everything was sent immediately to the clients from the server, and you didn’t need to “wait” for the long running queries.

            But that only really affects admin/mods. The current system is a lot better! And since 0.18.3 the db optimisation has helped! Removing a user and all their content didn’t take as long now and the changes I made for extra horizontal scaling really helped.

    • doctortofuOP
      link
      fedilink
      English
      arrow-up
      7
      ·
      11 months ago

      Got it, thanks for a detailed answer! Some more growing pains, it seems - hope some new updates fix it!

  • Schwim Dandy
    link
    fedilink
    arrow-up
    1
    ·
    11 months ago

    If you feel the uptime here is not that great, you should hang out on lemmy.world. Well, normally you can’t since it’s down but you get the point.