Looking for thoughts/opinions

I have a 5 disc raidz1 array. The volumes are accumulating CKSUM errors - fairly evenly distributed over the discs. I’ve been lazy and let this progress to the point where there are permanent errors in files.

# zpool status -v
  pool: tank
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 748K in 06:17:19 with 1 errors on Sun Jul 14 06:41:22 2024
config:

        NAME                                 STATE     READ WRITE CKSUM
        tank                                 ONLINE       0     0     0
          raidz1-0                           ONLINE       0     0     0
            ata-ST8000VN004-2M2101_WSD13YBW  ONLINE       0     0     6
            ata-ST8000VN004-2M2101_WSD13YE4  ONLINE       0     0     7
            ata-ST8000VN004-2M2101_WSD1454G  ONLINE       0     0     8
            ata-ST8000VN004-2M2101_WSD1454W  ONLINE       0     0     6
            ata-ST8000VN004-2M2101_WSD14563  ONLINE       0     0     7

errors: Permanent errors have been detected in the following files:

        /you/do/not/need/this/level of detail.txt

I’ve done some research and believe (hope) that the cause of these errors is the “domestic” onboard SATA controllers I’m using and I have ordered a LSI SAS3008 9300-8i HBA as an upgrade.

I know I can fix the permanent error by deleting and restoring it and then running a scrub. But, I’m torn - should I scrub now and risk stressing it more on the crappy SATA controllers, or wait until I get the new HBA (in a few weeks - free cheap, slow, shipping)?

  • @spitfire
    link
    English
    11 month ago

    I’d shut it down before it corrupts even more, replace HBA when it arrives and run a scrub to see what’s the damage

    • Great Blue HeronOP
      link
      fedilink
      English
      11 month ago

      I know that’s the correct response. But, it’s been running like this for many months, maybe even years - as I said in the post, I’ve been lazy. There’s nothing on it that can’t easily be restored, or replaced, and shutting it down would be a PITA.

      • @spitfire
        link
        English
        11 month ago

        There’s always a chance your backups might get corrupted too if you let it continue like that