Question - ZFS and rsync

@isles · edit-2 1 year ago

Question - ZFS and rsync

@loganb · 1 year ago

Just to make sure. Are you copying to your ZFS pool directory or a dataset? Check to male sure your paths are correct.

Push vs pull shouldn’t matter but I’ve always done push.

If your zpool is not accessible anymore after a transfer then there is a low-level problem here as it shouldn’t just disappear.

I would installe tmux on your ZFS system and have a window with htop running, dmesg, and zpool status running to check your system while you copy files. Something that severe should become self evedent pretty quickly.

@[email protected] · 1 year ago

The drives in the zpool, are they SMR drives? Slow write speed and disks dropping out are a symptom of I remember correctly

@[email protected] · 1 year ago

I don’t think they make SMR drives that big

@isles · 1 year ago

They’re Seagate Exos, https://www.seagate.com/products/cmr-smr-list/ and appear to be CMR

@[email protected] · 1 year ago

So next I’d be checking logs for sata errors, pcie errors and zfs kernel module errors. Anything that could shed light on what’s happening. If the system is locking up could it be some other part of the server with a hardware error, bad ram, out of memory, bad or full boot disk, etc.

Possibly linux · 1 year ago

Have you tried running it overnight to make sure its not just a performance thing?

@isles · 1 year ago

I did, great suggestion. It never recovered.

@s38b35M5 · 1 year ago

If you’re running TrueNAS, the replication feature was the smoothest and easiest way to move large amounts of data when I did it 18 months back. Once the destination location was accessible from the sending host, it was as simple as kicking off a snapshot, resulting in a fully usable replica on the receiving host. IIRC, IXsystems staff told me rsync can be problematic compared to the replication/snapshot system, as permissions and other metadata can be lost.

@[email protected] · 1 year ago

When things lock up, will a kill -9 kill rsync or not? If it doesn’t, and the zpool status lockup is suspicious, it means things are stuck inside a system call. I’ve seen all sorts of horrible things with usb timeouts. Check your syslog.

@isles · 1 year ago

kill -9

Just tested, thanks for the suggestion! It killed a few instances of rsync, but there are two apparently stuck open. I issued reboot and the system seemed to hang while waiting for rsync to be killed and failed to unmount the zpool.

Syslog errors:

Dec 31 16:53:34 halnas kernel: [54537.789982] #PF: error_code(0x0002) - not-present page
Jan  1 12:57:19 halnas systemd[1]: Condition check resulted in Process error reports when automatic reporting is enabled (file watch) being skipped.
Jan  1 12:57:19 halnas systemd[1]: Condition check resulted in Process error reports when automatic reporting is enabled (timer based) being skipped.
Jan  1 12:57:19 halnas kernel: [    1.119609] pcieport 0000:00:1b.0: DPC: error containment capabilities: Int Msg #0, RPExt+ PoisonedTLP+ SwTrigger+ RP PIO Log 4, DL_ActiveErr+
Jan  1 12:57:19 halnas kernel: [    1.120020] pcieport 0000:00:1d.2: DPC: error containment capabilities: Int Msg #0, RPExt+ PoisonedTLP+ SwTrigger+ RP PIO Log 4, DL_ActiveErr+
Jan  1 12:57:19 halnas kernel: [    1.120315] pcieport 0000:00:1d.3: DPC: error containment capabilities: Int Msg #0, RPExt+ PoisonedTLP+ SwTrigger+ RP PIO Log 4, DL_ActiveErr+
Jan  1 22:59:08 halnas kernel: [    1.119415] pcieport 0000:00:1b.0: DPC: error containment capabilities: Int Msg #0, RPExt+ PoisonedTLP+ SwTrigger+ RP PIO Log 4, DL_ActiveErr+
Jan  1 22:59:08 halnas kernel: [    1.119814] pcieport 0000:00:1d.2: DPC: error containment capabilities: Int Msg #0, RPExt+ PoisonedTLP+ SwTrigger+ RP PIO Log 4, DL_ActiveErr+
Jan  1 22:59:08 halnas kernel: [    1.120112] pcieport 0000:00:1d.3: DPC: error containment capabilities: Int Msg #0, RPExt+ PoisonedTLP+ SwTrigger+ RP PIO Log 4, DL_ActiveErr+
Jan  1 22:59:08 halnas systemd[1]: Condition check resulted in Process error reports when automatic reporting is enabled (file watch) being skipped.
Jan  1 22:59:08 halnas systemd[1]: Condition check resulted in Process error reports when automatic reporting is enabled (timer based) being skipped.
Jan  2 02:23:18 halnas kernel: [12293.792282] gdbus[2809399]: segfault at 7ff71a8272e8 ip 00007ff7186f8045 sp 00007fffd5088de0 error 4 in libgio-2.0.so.0.7200.4[7ff718688000+111000]
Jan  2 02:23:22 halnas kernel: [12297.315463] unattended-upgr[2810494]: segfault at 7f4c1e8552e8 ip 00007f4c1c726045 sp 00007ffd1b866230 error 4 in libgio-2.0.so.0.7200.4[7f4c1c6b6000+111000]
Jan  2 03:46:29 halnas kernel: [17284.221594] #PF: error_code(0x0002) - not-present page
Jan  2 06:09:50 halnas kernel: [25885.115060] unattended-upgr[4109474]: segfault at 7faa356252e8 ip 00007faa334f6045 sp 00007ffefed011a0 error 4 in libgio-2.0.so.0.7200.4[7faa33486000+111000]
Jan  2 07:07:53 halnas kernel: [29368.241593] unattended-upgr[4109637]: segfault at 7f73f756c2e8 ip 00007f73f543d045 sp 00007ffc61f04ea0 error 4 in libgio-2.0.so.0.7200.4[7f73f53cd000+111000]
Jan  2 09:12:52 halnas kernel: [36867.632220] pool-fwupdmgr[4109819]: segfault at 7fcf244832e8 ip 00007fcf22354045 sp 00007fcf1dc00770 error 4 in libgio-2.0.so.0.7200.4[7fcf222e4000+111000]
Jan  2 12:37:50 halnas kernel: [49165.218100] #PF: error_code(0x0002) - not-present page
Jan  2 19:57:53 halnas kernel: [75568.443218] unattended-upgr[4110958]: segfault at 7fc4cab112e8 ip 00007fc4c89e2045 sp 00007fffb4ae2d90 error 4 in libgio-2.0.so.0.7200.4[7fc4c8972000+111000]
Jan  3 00:54:51 halnas snapd[1367]: stateengine.go:149: state ensure error: Post "https://api.snapcraft.io/v2/snaps/refresh": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

@LufyCZ · 1 year ago

One thing I haven’t seen mentioned here, zfs can be quite finicky with some sata cards, especially raid cards.

I suggest you connect the hard drives to the motherboard directly and test again.

@isles · 1 year ago

Thank you! I ended up connecting them directly to the main board and had the same result with rsync, eventually the zpool becomes inaccessible until reboot (ofc there may be other ways to recover it without reboot).

Lemmy Tagginator · 1 year ago

deleted by creator

SayCyberOnceMore · 1 year ago

I don’t have practical experience with ZFS, but my understanding is that it uses RAM a lot… if that’s new, it might be worth checking the RAM by booting up memtest (for example) and just ruling that out.

Maybe also worth watching the system with nmon or htop (running in another tmux / screen pane) at the beginning of the next session, then when you think it’s jammed up, see what looks different…

@isles · edit-2 1 year ago

Awesome, thanks for giving some clues. It’s a new build, but I didn’t focus hugely on RAM, I think it’s only 32GB. I’ll try this out.

Edit: I did some reading about L2ARC, so pending some of these tests, I’m planning to get up to 64gb ram and then extend with an l2arc SSD, assuming no other hardware errors.

@[email protected] · edit-2 1 year ago

Based on this thread it’s the deduplication that requires a lot of RAM.

See also: https://wiki.freebsd.org/ZFSTuningGuide

Edit: from my understand the pool shouldn’t become inaccessible tho and only get slow. So there might be another issue.

Edit2: here’s a guide to check whether your system is limited by zfs’ memory consumption: https://github.com/openzfs/zfs/issues/10251

SayCyberOnceMore · 1 year ago

Just another thought… Maybe just format the drives as a massive EXT4 JBOD (just for a temp test) and copy the data again - just to see if ZFS is the problem… maybe it’s something else altogether? Maybe - and I hope not - the USB source drive is failing after long reads?

@isles · 1 year ago

I believe there’s another issue. ZFS has been using nearly all RAM (which is fine, I only need RAM for system and ZFS anyway, there’s nothing else running on this box), but I was pretty convinced while I was looking that I don’t have dedup turned on. Thanks for your suggestions and links!