Seeking advice about BTRFS RAID

@[email protected] · 1 year ago

Seeking advice about BTRFS RAID

poVoq · edit-2 1 year ago

How so if the second drive in the raid1 retains a working copy and the checksum is correct? I have had USB drives drop out on me before for longer periods and it was never a problem after reconnecting them and doing a scrub.

But of course raid is not a backup, so that is only the first line of defense against data loss 😉

Atemu · 1 year ago

The problem is on the logic level. What happens when a drive drops out but the other does not? Well, it will continue to receive writes because a setup like this is tolerant to such a fault.

Now imagine both connections are flakey and the currently available drive drops out aswell. Our setup isn’t that fault tolerant, so FS goes read-only and throws IO errors on read.
But, as the sysadmin takes a look, the drive that first dropped out re-appears, so they mount the filesystem again from the other drive and continue the workload.

Now we have a split brain. The drive that dropped out first missed the changes that happened to the other drive. When the other drive comes back, they’ll have diverged states. Btrfs can’t merge these.

That’s just one possible way this can go wrong. A simpler one I allured to is a lost write where a drive will claim to have permanently written something but if power was cut at that moment and the same sector read upon restart, it will not actually be the new data. If that happens to all copies of a metadata chunk, good bye btrfs.