EDIT: this is a full benchmark I run on my pool: https://gist.github.com/thegabriele97/9d82ddfbf0f4ec00dbcebc4d6cda29b3.

Hi! I ran into this issue since I started mu homelab adventure a couple of months ago, so I am still very noob, sorry for this.

I decided today to understand what happens and why it happens but I need your help to understand it better.

My homelab consists of a proxmox setup with three 1 TB HDD s in raidz1 (ZFS) (I know the downsides of this and I took my decisions) and 8 GB of RAM, of which 3.5 are assigned to a VM. The remaining parts are used by some LXC containers.

During high worloads (i.e. copying a file, downloading something via torrent/jdownloader) everything is very slow and other services start to be unresponsive due to the high IO delay.

I decided to test the three single devices with this command: fio --ioengine=libaio --filename=/dev/sda --size=4G --time_based --name=fio --group_reporting --runtime=10 --direct=1 --sync=1 --iodepth=1 --rw=randread --bs=4k --numjobs=32

And more or less they (sda, sdb, sdc) give this results:

Jobs: 32 (f=32): [r(32)][100.0%][r=436KiB/s][r=109 IOPS][eta 00m:00s]
fio: (groupid=0, jobs=32): err= 0: pid=3350293: Sat Jun 24 11:07:02 2023
  read: IOPS=119, BW=479KiB/s (490kB/s)(4968KiB/10378msec)
    slat (nsec): min=4410, max=40660, avg=12374.56, stdev=5066.56
    clat (msec): min=17, max=780, avg=260.78, stdev=132.27
     lat (msec): min=17, max=780, avg=260.79, stdev=132.27
    clat percentiles (msec):
     |  1.00th=[   26],  5.00th=[   50], 10.00th=[   80], 20.00th=[  140],
     | 30.00th=[  188], 40.00th=[  230], 50.00th=[  264], 60.00th=[  296],
     | 70.00th=[  326], 80.00th=[  372], 90.00th=[  430], 95.00th=[  477],
     | 99.00th=[  617], 99.50th=[  634], 99.90th=[  768], 99.95th=[  785],
     | 99.99th=[  785]
   bw (  KiB/s): min=  256, max=  904, per=100.00%, avg=484.71, stdev= 6.17, samples=639
   iops        : min=   64, max=  226, avg=121.14, stdev= 1.54, samples=639
  lat (msec)   : 20=0.32%, 50=4.91%, 100=8.13%, 250=32.85%, 500=49.68%
  lat (msec)   : 750=3.86%, 1000=0.24%
  cpu          : usr=0.01%, sys=0.00%, ctx=1246, majf=11, minf=562
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=1242,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=479KiB/s (490kB/s), 479KiB/s-479KiB/s (490kB/s-490kB/s), io=4968KiB (5087kB), run=10378-10378msec

Disk stats (read/write):
  sda: ios=1470/89, merge=6/7, ticks=385624/14369, in_queue=405546, util=96.66%

Am I wrong or it is a very bad results? Why? The three identical HDs are this one: https://smarthdd.com/database/APPLE-HDD-HTS541010A9E662/JA0AB560/

I jope you can help me. Thank you!

  • @[email protected]
    link
    fedilink
    English
    211 months ago

    I don’t know about fio but I normally use iotop command to identify the process that is doing too much I/O operations

  • Terrasque
    link
    fedilink
    English
    111 months ago

    I have a zfs raid1 with 5 disks, and had some very bad performance. I used atop to figure out that one disk was the problem. I replaced that disk, resynced, and now performance is as expected.

  • terribleplan
    link
    fedilink
    English
    111 months ago

    At one point I was having intermittent performance issues with my pool, and the issue turned out to be scrubs being too aggressive (even though most all the documentation I read said scrubs should not adversely impact user I/O, they totally did)

  • Qazwsxedcrfv000
    link
    fedilink
    English
    1
    edit-2
    11 months ago

    What record size have you set for your dataset? If you are not doing a lot of small writes or you can tolerate the fragmentation, better set it to 1M.

    Also,

    …8 GB of RAM, of which 3.5 are assigned to a VM…

    Default ZFS installation reserves half of the total system memory for its ARC. In your case that means 4GB. And your VM is taking 3.5GB. Are you running anything else? Also is the assignment to VM dynamic? ZFS will release portion of the reserved RAM when the overall demand gets stringent. And that will have adverse impact on read performance.