#What is the expected I/O latency to FabricPool StorageGrid?

1 messages · Page 1 of 1 (latest)

twin vessel
#

The vlan for writting data to SG is based on 4x40G LACP group. Because the data is non-critical, and we don't want the data to stay on the primary storage, we used "all" as tiering policy.

We used the command qos statistics volume latency show -volume <volume_name> -vserver svm to measure the latency from "cloud"(StorageGrid), we can see from the result as high as 227ms.

In addition, we also find the messmage below from event log:

NOTICE wafl.ca.latency.threshold: Measured latency from cloud (50287 msec) more than threshold (5000 msec) (aggregate: DEBUG ems.engine.suppressed: Event wafl.ca.latency threshold suppressed 2377 times in last 60 seconds. "cluster_01_ssd_aggr1", aggregate uuid: "29c5faef-5afd-4af5-901a-208e757e8895", object store: "storagegrid_1") 2/13/2025 06:01:32 eos-08

Upon my understanding, With tiering-policy = "all", data will be written to SG directly without buffering in primary storage, so, the speed should be expected slow. But, 5000 msec sounds too slow. What is the expected laytency for tiering policy = "all"

cunning plover
#

Hard to quantify but I certainly would expect less than the 5 seconds that triggers the ONTAP event. That usually indicates something is wrong.

There is a test you can run from ONTAP, in advanced mode, to test your FabricPool setup. It will write 1000 4k objects then read them back using different byte-range-read sizes as ONTAP would for FP data.

gemini::*> aggr object-store profiler start ?
[-node] <nodename> *Node on Which the Profiler Should Run
[-object-store-name] <text> *Object Store Configuration Name
[ -object-prefix <text> ] *Prefix Added to Each Object

Probably worth giving that a try.
Then if that also shows high latency you need to try and figure out why.
Networking?
Overloaded Grid node(s)?
Etc...

twin vessel
#

Forgot to mention that I did run the command. The following is the output on this node. Don't know how to read the data. Any issue as you can see?

Object store config name: storagegrid_1
Node name: cluster-01
Status: Active. Waiting for outstanding PUTs to finish
Start time: 2/15/2025 08:16:08
Object name prefix: 04e4d807-eb9f-11ef-b640-00a098ae2b06
Object prefix:

Op Size Total Failed Latency(ms) Throughput
min max avg

PUT 4MB 2476 0 52 71474 3687 133.2MB
GET 4KB 0 0 0 0 0 0B
GET 8KB 0 0 0 0 0 0B
GET 32KB 0 0 0 0 0 0B
GET 256KB 0 0 0 0 0 0B

indigo marsh
#

PUT latency is irrelevant since the PUTs happen in the background anyway

#

and depending on the grid load etc. latency can be rather large. but it won't impact client access. check the 4kb GET latencies which are probably the most important metric

twin vessel
#

The updated output below. The latency on 4k is 2 - 10017 ms, avg 38ms, seems high, right?
Why would you want to check GET? I thought we shoud find out latency on writing. PLease don't forget those numbers I put in my original message.
Object store config name: storagegrid_1
Node name: cluster-01
Status: Done
Start time: 2/15/2025 08:16:08
Object name prefix: 04e4d807-eb9f-11ef-b640-00a098ae2b06
Object prefix:

Op Size Total Failed Latency(ms) Throughput
min max avg

PUT 4MB 2500 0 52 100682 4596 95.51MB
GET 4KB 131932 6 2 10017 38 9.88MB
GET 8KB 97219 18 2 10023 61 14.47MB
GET 32KB 100453 18 3 10021 66 57.40MB
GET 256KB 25328 48 4 10023 260 120.7MB

#

What about 5sec latency in envet log and also in the command of qos statistics volume latency show -volume <volume_name> -vserver svm in my original message?

cunning plover
#

Your PUTs saw as high as 100 sec latency. GET 10 sec (are those the timeouts for the profiler?)

#

And you have errors, never a good sign

twin vessel
#

I believe those are accumulated data, cannt be this high. If it is this high, then we are in a terrible situation.

cunning plover
#

The profiler only reports numbers from the last run of it

#

So nothing accumulated there

#

What does the grid consist of? How is load balancing done?

#

I have been running iperf3 on SG nodes and then "network test-link" from ONTAP to test network throughput. There might be a firewall trick on the SG node involved but can't recall

twin vessel
#

consists of 8 nodes. Not sure about your question of load balancing, but the load balancer is a VM

#

What is iperf3?

cunning plover
#

Comes bundled on the SG nodes

#

And implemented in ONTAP as "network test-link"

#

With that you can do network testing memory to memory so excluding storage

indigo marsh
#

Is the StorageGRID a SATA grid? what is the consistency level of your bucket?

twin vessel
indigo marsh
arctic fossil
#

We had CPU ready issues cause latency for some vm load balancers. Something to look at though

twin vessel
twin vessel
twin vessel
indigo marsh
# twin vessel It is NL-SAS. Not sure of what consistency level is..

consistency level should be "read-after-new-write". If it is set to anything different this could explain the errors (and maybe even the high latency, but then again I'm no SG expert, I only know that anything other than read-after-new-write is bad for FabricPool 😄 )

arctic fossil
# twin vessel This is the CPU ready graph on the Load Balancer in period of Last Week. It has ...

Seems very high; but this looks like a summation; you can break it down per core or convert to % to get a better picture.

But without knowing the environment the remedy ranges from moving resource contending workloads from the host; or moving the lb vm to a less busy host to right sizing vm'sþ

What cpu ready% is acceptable dependends on workloads and environment but usually 5-10% cpu ready you might start seeing issues

https://knowledge.broadcom.com/external/article/306576/converting-between-cpu-summation-and-cpu.html