• @[email protected]
    link
    fedilink
    English
    310 months ago

    It’s likely there’s a root cause, like a fiber cut or some other major infrastructure issue. But, Down Detector doesn’t really put a scale on their graphics, so it could be that it’s a huge issue at Meta and a minor issue that’s just noticeable for everyone else. In that case, Meta could be the root cause.

    If everyone is mailing themselves their passwords, shutting their phones on and off, restarting their browsers, etc. because Meta wasn’t working, it could have knock-on effects for everyone else. Could also be that because Meta is part of the major ad duopoly, the issue affected their ad system, which affected everyone interacting with a Meta ad, which is basically everyone.

    • a lil bee 🐝
      link
      English
      210 months ago

      I’ve been an SRE for a few large corps, so I’ve definitely played this game. I’m with you that it was likely just the FB identity or ad provider causing most of these issues. So glad I’m out of that role now and back to DevOps, where I’m no longer on call.

      • @[email protected]
        link
        fedilink
        English
        110 months ago

        Yeah. And when the outage is due to something external, it’s not too stressful. As long as you don’t have absolutely insane bosses, they’ll understand that it’s out of your control. So, you wait around for the external system to be fixed, then check that your stuff came back up fine, and go about your day.

        I personally liked being on call when the on-call compensation was reasonable. Like, on-call for 2 12-hour shifts over the weekend? 2 8-hour days off. If you were good at maintaining your systems you had quiet on-call shifts most of the time, and you’d quickly earn lots of days off.

        • a lil bee 🐝
          link
          English
          110 months ago

          Yeah I’d be less worried about internal pressures (which should be minimal at a halfway decently run org) and more about the externals. I don’t think you would actually end up dealing with anything, but I’d know those reliant huge corps are pissed.

          Man, your on-call situation sounds rad! I was salaried and just traded off on-call shifts with my team members, no extra time off. Luckily though, our systems were pretty quiet so it hardly ever amounted to much.

          • @[email protected]
            link
            fedilink
            English
            110 months ago

            I think you want people to want to be on call (or at least be willing to be on call). There’s no way I’d ever take a job where I was on-call and not compensated for being on-call. On-call is work. Even if nothing happens during your shift, you have to be ready to respond. You can’t get drunk or get high. You can’t go for a hike. You can’t take a flight. If you’re going to be so limited in what you’re allowed to do, you deserve to be compensated for your time.

            But, since you’re being compensated, it’s also reasonable that you expect to have to respond to something. If your shifts are always completely quiet, either you or the devs aren’t adding enough new features, or you’re not supervising enough services. You should have an error budget, and be using that error budget. Plus, if you don’t respond to pages often enough, you get rusty, so when there is an event you’re not as ready to handle it.

    • @guacupado
      link
      English
      1
      edit-2
      10 months ago

      Second half is the closest answer in this thread.