Looking for thoughts/opinions

I have a 5 disc raidz1 array. The volumes are accumulating CKSUM errors - fairly evenly distributed over the discs. I’ve been lazy and let this progress to the point where there are permanent errors in files.

# zpool status -v
  pool: tank
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 748K in 06:17:19 with 1 errors on Sun Jul 14 06:41:22 2024
config:

        NAME                                 STATE     READ WRITE CKSUM
        tank                                 ONLINE       0     0     0
          raidz1-0                           ONLINE       0     0     0
            ata-ST8000VN004-2M2101_WSD13YBW  ONLINE       0     0     6
            ata-ST8000VN004-2M2101_WSD13YE4  ONLINE       0     0     7
            ata-ST8000VN004-2M2101_WSD1454G  ONLINE       0     0     8
            ata-ST8000VN004-2M2101_WSD1454W  ONLINE       0     0     6
            ata-ST8000VN004-2M2101_WSD14563  ONLINE       0     0     7

errors: Permanent errors have been detected in the following files:

        /you/do/not/need/this/level of detail.txt

I’ve done some research and believe (hope) that the cause of these errors is the “domestic” onboard SATA controllers I’m using and I have ordered a LSI SAS3008 9300-8i HBA as an upgrade.

I know I can fix the permanent error by deleting and restoring it and then running a scrub. But, I’m torn - should I scrub now and risk stressing it more on the crappy SATA controllers, or wait until I get the new HBA (in a few weeks - free cheap, slow, shipping)?

  • @spitfire
    link
    English
    11 month ago

    There’s always a chance your backups might get corrupted too if you let it continue like that