Proxmox - Smartest ZFS Pool Replication Process Across Cluster?

@pr0927 · edit-2 2 months ago

Proxmox - Smartest ZFS Pool Replication Process Across Cluster?

@pr0927 · edit-2 2 months ago

Ah, I see - this is effectively the same as the first image I shared, but via shell instead of GUI, right?

For my NFS server CT, my config file is as follows currently, with bind-mounts:

arch: amd64
cores: 2
hostname: bridge
memory: 512
mp0: /spynet/NVR,mp=/mnt/NVR,replicate=0,shared=1
mp1: /holocron/Documents,mp=/mnt/Documents,replicate=0,shared=1
mp2: /holocron/Media,mp=/mnt/Media,replicate=0,shared=1
mp3: /holocron/Syncthing,mp=/mnt/Syncthing,replicate=0,shared=1
net0: name=eth0,bridge=vmbr0,firewall=1,gw=192.168.0.1,hwaddr=BC:24:11:62:C2:13,ip=192.168.0.82/24,type=veth
onboot: 1
ostype: debian
rootfs: ctdata:subvol-101-disk-0,size=8G
startup: order=2
swap: 512
lxc.apparmor.profile: unconfined
lxc.cgroup2.devices.allow: a
lxc.cap.drop:

For full context, my list of ZFS pools (yes, I’m a Star Wars nerd):

NAME                                    USED  AVAIL  REFER  MOUNTPOINT
holocron                               13.1T  7.89T   163K  /holocron
holocron/Documents                     63.7G  7.89T  52.0G  /holocron/Documents
holocron/Media                         12.8T  7.89T  12.8T  /holocron/Media
holocron/Syncthing                      281G  7.89T   153G  /holocron/Syncthing
rpool                                  13.0G   202G   104K  /rpool
rpool/ROOT                             12.9G   202G    96K  /rpool/ROOT
rpool/ROOT/pve-1                       12.9G   202G  12.9G  /
rpool/data                               96K   202G    96K  /rpool/data
rpool/var-lib-vz                        104K   202G   104K  /var/lib/vz
spynet                                 1.46T  2.05T    96K  /spynet
spynet/NVR                             1.46T  2.05T  1.46T  /spynet/NVR
virtualizing                           1.20T   574G   112K  /virtualizing
virtualizing/ISOs                       620M   574G   620M  /virtualizing/ISOs
virtualizing/backup                     263G   574G   263G  /virtualizing/backup
virtualizing/ctdata                    1.71G   574G   104K  /virtualizing/ctdata
virtualizing/ctdata/subvol-100-disk-0  1.32G  6.68G  1.32G  /virtualizing/ctdata/subvol-100-disk-0
virtualizing/ctdata/subvol-101-disk-0   401M  7.61G   401M  /virtualizing/ctdata/subvol-101-disk-0
virtualizing/templates                  120M   574G   120M  /virtualizing/templates
virtualizing/vmdata                     958G   574G    96K  /virtualizing/vmdata
virtualizing/vmdata/vm-200-disk-0      3.09M   574G    88K  -
virtualizing/vmdata/vm-200-disk-1       462G   964G  72.5G  -
virtualizing/vmdata/vm-201-disk-0      3.11M   574G   108K  -
virtualizing/vmdata/vm-201-disk-1       407G   964G  17.2G  -
virtualizing/vmdata/vm-202-disk-0      3.07M   574G    76K  -
virtualizing/vmdata/vm-202-disk-1      49.2G   606G  16.7G  -
virtualizing/vmdata/vm-203-disk-0      3.11M   574G   116K  -
virtualizing/vmdata/vm-203-disk-1      39.6G   606G  7.11G  -

So you’re saying to list the relevant four ZFS datasets in there but, instead of as bind-points, as virtual drives (as seen in the “rootfs” line)? Or rather, as “storage backed mount points” from here:

https://pve.proxmox.com/wiki/Linux_Container#_storage_backed_mount_points

Hopefully I’m on the right track!

@ikidd · 2 months ago

Add a fresh disk from one of your ZFS backed storages instead of a mountpoint from a directory. When I do that I get a mount like:

mp0: local-zfs:subvol-105-disk-1,mp=/mountpoint,replicate=0,size=8G

Can cat me /etc/pve/storage.cfg so I can see where your mp’s are coming from? I’d expect to see them show up as storagename:dataset like mine unless you’re mounting them as directory type.

@pr0927 · 2 months ago

So currently I haven’t re-added any of the data-storing ZFS pools to the Datacenter storage section (wanted to understand what I’m doing before trying anything). Right now my storage.cfg reads as follows (without having added anything):

zfspool: virtualizing
        pool virtualizing
        content images,rootdir
        mountpoint /virtualizing
        nodes chimaera,executor,lusankya
        sparse 0

zfspool: ctdata
        pool virtualizing/ctdata
        content rootdir
        mountpoint /virtualizing/ctdata
        sparse 0

zfspool: vmdata
        pool virtualizing/vmdata
        content images
        mountpoint /virtualizing/vmdata
        sparse 0

dir: ISOs
        path /virtualizing/ISOs
        content iso
        prune-backups keep-all=1
        shared 0

dir: templates
        path /virtualizing/templates
        content vztmpl
        prune-backups keep-all=1
        shared 0

dir: backup
        path /virtualizing/backup
        content backup
        prune-backups keep-all=1
        shared 0

dir: local
        path /var/lib/vz
        content snippets
        prune-backups keep-all=1
        shared 0

Under my ZFS pools (same on each node), I have the following:

The “holocron” pool is a RAIDZ1 combo of 4x8TB HDDs, “virtualizing” is RAID mirrored 2x2TB SSDs, and “spynet” is a single 4TB SSD (NVR storage).

When you say to “add a fresh disk” - you just mean to add a resource to a CT/VM, right? I trip on the terminology at times, haha. And would it be wise to add the root ZFS pool (such as “holocron”) or to add specific datasets under it (such as "Media or “Documents”)?

I’m intending to create a test dataset under “holocron” to test this all out before I put my real data through any risk, of course.

@ikidd · 2 months ago

Yes, add a resource and specify it as the ZFS storage you want. I’m mystified why your disk resources in your NFS CT show up as directories and not zfs:dataset unless you were specifying them as the host mountpoint instead of as their ZFS dataset. I think if it’s as a directory, you need to take care of the underlying replication, and if it’s specified as a dataset, then PM will. You could probably just fix those to be ZFS specified mountpoints and you’d be fine.

I only specify one pool on a real zpool to Proxmox and work from there, I don’t make ZFS pools in PM out of the underlying datasets. I find that method confusing but I can see why you do it, and I can’t imagine it would mess anything up by doing it.

@pr0927 · 2 months ago

Oh I added the disk resources via shell (via nano) to that config for the NFS server CT, following some guide for bind-mounts. I guess that’s the wrong format and treated them like directories instead of ZFS pools?

I’ll follow the formatting you’ve used (ans I think what results “naturally” from the GUI adding of such a ZFS storage dataset.

And yeah I don’t think replication works if it’s not ZFS, so I need to fix that.

Per your other commend - agreed regarding the snapshotting - it’s already saved me on a Home Assistant VM I have running, so I’d love to have that properly working for the actual data in the ZFS pools too.

Is it generally a best practice to only create the “root” ZFS pool and not these datasets within Proxmox (or any hypervisor)?

Thanks so much for your assistance BTW, this has all been reassuring that I’m not in some lost fool land, haha.

@ikidd · 2 months ago

IDK about “best practice”. I think what you have going works fine, it’ll just create new datasets under those datasets that you’re linking to instead of a level higher like I end up with. I just find it easier to find snapshots, etc without having to drill down. But if you have other properties you want to apply to just those datasets like compression, etc, that’s a perfectly cromulent thing to do.

That makes sense about the NFS mounts, then, because the format is what I would call a bind mount as well. I think resetting them to the actual zfs dataset will have a better result for you.

NP on assistance, it looks like you were having some issues and I’m happy to share what little I know to help. Feel free to ping me if you need more help with this or other issues.

@pr0927 · 2 months ago

It took me a while to find a moment to give it a go, but I created a test dataset under my main ZFS pool and added it to a CT - it did snapshots and replication fine.

The one question I have is - for the bind-mounts, I didn’t have any size set - and they accurately show remaining disk space for the pool they are on.

Here it seems I MUST give a size? Is that correct? I didn’t really want to allocate a smaller size for any given dataset, if possible. I saw something about storage-backed mount points, and adding them (via config file, versus GUI) and setting “size=0” - if this is of a ZFS dataset, would this “turn” it into a directory and prevent snapshotting or replicating to other nodes?

One last question - when I’m adding anything to the Datacenter’s storage section, do I want to check availability for all nodes? Does that matter?

Thanks again!

@ikidd · 2 months ago

I would try setting the size=0. AFAIK, you are expected to put a size in, but PM might be smart enough to derive the actual remaining size as correct. I’ve never come at this from behind like you are, I’ve always built my storage out directly into ZFS storage as I built the CTs.

As to the storage questions, I’ve always done it as all nodes, but I also set up and name my pools exactly the same across all nodes, they are hardware alike. I don’t know what would happen if there wasn’t a pool named that in the other node, I would guess it would just fail out and that would mess up HA and probably replication.

@pr0927 · 2 months ago

Ah makes sense. Will try to give this a go tomorrow when i have time/energy. Appreciate it, definitely will do!

@ikidd · edit-2 2 months ago

Another advantage to always specifying your resources as zfs storage backed is snapshotting. I have one CT docker host and one VM docker host, both ZFS backed, and when I make any major changes or upgrades, I snapshot the guest, and sometimes clone it if I want to test it before pushing it to prod. I have had to revert them because I didn’t like what I did and it’s super simple then. Never having used directory resources, I don’t know how snapshotting works but I can’t imagine it’s the same since PM isn’t triggering the snapshot mechanism in ZFS if it’s a directory mountpoint.

Backups via PBS are also snapshotted, so you don’t run the risk of having a database-stored metadata out of sync with the storage.