I’ve been running my most recent Server built for quite some time now. I think Uptime was somewhere around 5 Months. Absolutely flawless. A few Days ago i started to have issues. Hard-Locks, Freezing…but absolutely zero log entries. Nothing. The Server was built with “off the shelf” Hardware and no ECC (even though the Ryzen CPU technically supports it, at the time ECC 3200 MHz Memory was still a lot more expensive than it is now) and is running a ZFS. Risky business, but it’s “just” a home server. Would never built a server running mission critical stuff like that (and I’ve been doing that for over 10 years now as my main job). Over the last few weeks, i’ve been trying some stuff and had a pretty high memory load.

In any case, i also like Astrophysics and have some newsletters about Auroras and so on. They are extremely rare, here in southern Germany to occur. Yesterday we had one of the biggest and brightest I’ve ever seen.

But it got me thinking about my hard locks and crashes and i remembered, i had an account for ESA’s SSCC (SSA Space Weather Coordination Centre). They have something called “Post-Event Analysis”, where you can correlate certain timestamps to real time data, for example from DSCOVR (“THE” Space Weather Satellite).

For Auroras to occur, the so called “Bz-Value” is important. Basically, it tells the direction of the interplanetary magnetic field. If it’s direction is towards the sun and towards the charged particles the sun throws at us, they get deflected. If it’s with the direction of the solar wind, the particles “come in” and produce auroras…because the charged particles charge other parts - they generally charge oxygen, which results in green auroras - they also can do all sorts of stuff (and that’s why spaceships, sats and other stuff floating around in space need shielding). The Value is measured in nanoTesla(nT).

There’s also the Kp-Index…which was 7-8, out of 9.

So yeah - i’m pretty sure, i experienced a Single-Event Upset/Bit-Flip. Amazing stuff!

Edit: Picture of the Aurora https://i.imgur.com/TIxketJ.jpg

  • @[email protected]
    cake
    link
    fedilink
    English
    118 months ago

    That is amazing! Now, I need to see about using weather satellites to explain the bugs in my code at work…

      • @nexusbandOP
        link
        English
        18 months ago

        It also wouldn’t cause Hard-Locks and Freezes without any errors

        • @SheeEttin
          link
          English
          58 months ago

          It certainly could. A bit-flip in a core part of the kernel could easily cause it to lock up, if an address is corrupted and it starts writing garbage over its code, or execution jumps to somewhere unexpected, or an instruction is changed from something reasonable to a halt.

          Yes, most of those should trigger a blue screen or kernel panic, but that’s not guaranteed when you’re making completely random changes.

          • @nexusbandOP
            link
            English
            18 months ago

            Sure - i should have mentioned, that the system itself runs not on the ZFS but from it’s own SSD. So a “ZFS Cache in Memory Bit-Flip” should (theoretically…) not cause a hard-lock/freeze. It would probably trigger a complete garbage collection though.

            And yes - that’s what was so confusing to me, no kernel panic, no log entry…nothing, just a sudden, random freeze.

            • @SheeEttin
              link
              English
              38 months ago

              Right, a bit flip in ZFS cache shouldn’t cause that. But a bit flip in active memory could.

              • @nexusbandOP
                link
                English
                18 months ago

                Absolutely! And I think that’s actually what happened :)

    • @nexusbandOP
      link
      English
      28 months ago

      It probably did - but that’s not why the server crashed :)