You just finished setting up all your services and it works fine - how do you now prepare for eventual drive failure?

Kaldo · 1 year ago

You just finished setting up all your services and it works fine - how do you now prepare for eventual drive failure?

@ikidd · edit-2 1 year ago

I run everything on a 2 node proxmox cluster with ZFS mirror volumes and replication of the VMs and CTs between them, run PBS with hourly snapshots, and sync that to multuple USB drives I swap off site.

The docker VM can be ZFS snapshotted before major updates so I can rollback.

@[email protected] · 1 year ago

You should get another node, otherwise when node1 fails node2 will reboot itself and then do nothing because it has no quorum

@ikidd · 1 year ago

pvecm expected 1

@[email protected] · 1 year ago

I know, but every time I had to do that it felt like it’s a jank solution. If you have a raspberry pi or smth like that you can also set it up as a qdevice.

…and if you’re completely fine with how it is you can also just leave it like it is

@ikidd · 1 year ago

So I started to write a reply that said basically that I was OK doing that manually, but thought that “hell, I have a PBS box on the network that would do that fine”. So it took about 3 minutes to install the corosync-qdevice packages on all three and enable it. Good to go.

Thanks for the kick in the ass.

@ikidd · 1 year ago

So since I now had a “quorate” cluster again, I thought I’d try out HA. I’d always been under the impression that unless you had a shared storage LUN, you couldn’t HA anything. But I thought I’d trigger a replication and then down the 2nd node just as a test. And lo and behold, the first node brought up my OPNsense VM from the replicated image about 2 minutes after the second node lost contact, and internet starts working again.

I’m really excited about having that feature working now. This was a good night, thank you.

@[email protected] · 1 year ago

If you need another thing to do, you could try to make your opnsense HA and never have your internet stop working while rebooting a node. It’s pretty simple to set up, you might finish it in 1-2 evenings. Happy clustering!

@ikidd · 1 year ago

I’ll look into that. I did see the option in opnsense once upon a time but never investigated it.