Hello everyone,
I'm trying to narrow down some complains from my colleagues regarding slow performance between our virtualized environment and our NetApp storage.
We're on vSphere 7.0.3 and have 2 AFF-A400 in 2 datacenters active-active. There are around 15 ESXi per DC, connected via Brocade FC switches. I went through the NetApp best practices and applied the parameters recommended.
Our devs and SysOps are complaining about low performance for their Postgres VMs (running RHEL 7,8,9). I.e. one Postres VM states 25% iowait. One of them wrote a fio test which results in IOPS of 600-700.
Now I played with fio today and the results are quite different depending on the parameters (obviously).
The one the SysOps wrote (resulting in ~700 IOPS):
fio --name TEST --eta-newline=5s --filename=temp.file --rw=randrw --size=2g --io_size=10g --blocksize=4k --ioengine=libaio --fsync=1 --iodepth=1 --direct=1 --numjobs=1 --runtime=60 --group_reporting
The one I wrote (resulting in around 22000 IOPS):
fio --name=random-write --ioengine=libaio --rw=randwrite --bs=4k --size=2g --numjobs=1 --iodepth=64 --runtime=60 --time_based --end_fsync=1 --direct=1 --fsync=1
(I know they are very different - just showing the problem of lacking a proper benchmark)
My first question is if someone can recommend some fio settings which are somehow realistic for some scenarios.
Second question is where I can check for bottlenecks or issues. ONTAP shows no issues or high load. ESXis are also happy. The switches also don't report any trouble.
Last question is what are the "common" IOPS our NetApp should be able to reach regularly - just so I know what's the theoretical maximum achievable.
If you need more details just let me know. Any hint is highly appreciated!