tl;dr: It’s to test if the cluster fail-over configuration is working properly.
So this was before things like Kubernetes or Terraform were a thing, so had to be done by the operating system itself. The simplest HA cluster is made of two nodes, one in “active node”, the other “passive”. The active node does all the work, and the passive node just keeps its data synchronised with the active node. I used to use DRBD for this, which is a system for copying writes to the active node over a network link to the passive node. That only gives you a “second, up-to-date copy” which is not that useful on its own - you also need a way to automatically switch over to using the passive node if the active one “dies”, and for that I used to use “heartbeat”, which simply passes packets back and forth between the two cluster members - ping-pong style - and if the passive node notices that the active node hasn’t sent its scheduled packet for, say, 10 seconds, it cuts it off the current active node (kills it), and promotes itself to the active role, thus preserving the service. Killing the “other node” is necessary to stop data corruption or user requests going to a node that can’t actually service them, and is called STONITH - Shoot The Other Node In The Head. STONITH can involve an electronically controlled switch, which literally cuts off power to the “other” node, or can isolate it on the network, by shutting down its network ports on the switch, or in a VM setup, sending a notification to the hypervisor to kill the VM.
The reason you need to be able to kill the kernel on the active node, is that when you manually shut down the active node, it automatically informs the passive node that it’s going down, known as an “orderly fail-over”, and you’re not actually testing if the heartbeat fail-over works, you’re just testing an orderly fail-over. Killing the active node’s kernel tests that the passive node is properly configured to take over during a catastrophic failure of the active node. You can watch the heartbeat status go from “up” to “down”, and then see the passive node decide to take over, promote itself and bring up its services, and begin processing requests.
To make sure it’s all working, you need to test orderly fail-overs first, from both nodes, then test disorderly fail-overs both ways, by using the kernel gun on the active node.
Things moved on from Heartbeat-based HA clusters to multimode clusters managed by Corosync and other software, enabling other strategies to be employed. This was eventually supplanted by “orchestration” systems like Kubernetes, and proprietary Virtual Cloud systems that move this functionality to the platform rather than the operating system.
I see! That’s fascinating stuff. I only do simple home hosting, so I never get into deployments like this, or how things used to be done, but love to hear the intricacies of it.
You mean in the context of high availability?
tl;dr: It’s to test if the cluster fail-over configuration is working properly.
So this was before things like Kubernetes or Terraform were a thing, so had to be done by the operating system itself. The simplest HA cluster is made of two nodes, one in “active node”, the other “passive”. The active node does all the work, and the passive node just keeps its data synchronised with the active node. I used to use DRBD for this, which is a system for copying writes to the active node over a network link to the passive node. That only gives you a “second, up-to-date copy” which is not that useful on its own - you also need a way to automatically switch over to using the passive node if the active one “dies”, and for that I used to use “heartbeat”, which simply passes packets back and forth between the two cluster members - ping-pong style - and if the passive node notices that the active node hasn’t sent its scheduled packet for, say, 10 seconds, it cuts it off the current active node (kills it), and promotes itself to the active role, thus preserving the service. Killing the “other node” is necessary to stop data corruption or user requests going to a node that can’t actually service them, and is called STONITH - Shoot The Other Node In The Head. STONITH can involve an electronically controlled switch, which literally cuts off power to the “other” node, or can isolate it on the network, by shutting down its network ports on the switch, or in a VM setup, sending a notification to the hypervisor to kill the VM.
The reason you need to be able to kill the kernel on the active node, is that when you manually shut down the active node, it automatically informs the passive node that it’s going down, known as an “orderly fail-over”, and you’re not actually testing if the heartbeat fail-over works, you’re just testing an orderly fail-over. Killing the active node’s kernel tests that the passive node is properly configured to take over during a catastrophic failure of the active node. You can watch the heartbeat status go from “up” to “down”, and then see the passive node decide to take over, promote itself and bring up its services, and begin processing requests.
To make sure it’s all working, you need to test orderly fail-overs first, from both nodes, then test disorderly fail-overs both ways, by using the kernel gun on the active node.
Things moved on from Heartbeat-based HA clusters to multimode clusters managed by Corosync and other software, enabling other strategies to be employed. This was eventually supplanted by “orchestration” systems like Kubernetes, and proprietary Virtual Cloud systems that move this functionality to the platform rather than the operating system.
I see! That’s fascinating stuff. I only do simple home hosting, so I never get into deployments like this, or how things used to be done, but love to hear the intricacies of it.
Yeah it was wild, but I suspect few orgs do things that way any more.