#Can some experts please explain what is this Prometheus graph all about?

1 messages · Page 1 of 1 (latest)

deep mirage
#

What subnet are these packets transmitting, client or grid network?
What issue could it be?
We have a full SG running 99%, half of nodes in SG become READ-ONLY. could this graph be as the result of the fullness? How to explain that?

Thank you for your input!

empty crystal
#

You sould be able to see more if you just login to SG admin GUI, it will most likely tell you why the nodes are read-only, and you should also be able to see how full they are... the graph doesn't make sense to me, looks like you are looking at some TCP traffic counters, but without any context it doesn't make sense

deep mirage
#

This is so called "TCP Retransmission Rates". For privacy purpose, I removed Legends which show all SG internal storage nodes. I believe it shows retransmissition between nodes in SG for replication/resync/resend, due to RO nodes, they ave to retry, those caused high retransmission rates. JUst to confirm with you if what I am saying is true.

real marsh
#

TCP retransmissions do not occur because of some logic in the software (like "oh, this node is read only, try again with some other node")

they occur because packets are lost/dropped/corrupted on the network

deep mirage
#

Any proof to say there won't be any retrying, resync etc when multiple nodes were forced to RO?

When mulitple nodes are RO, only seen when SG is 99% full.

There are miniumum writable nodes required, I believe 2, so, if went down to this number, issue will be started, the user cannot write anymore, at this point, it will slow down the entire ONTAP cluster or the aggregates using FabricPool.

real marsh
#

if this is indeed "TCP retransmissions", then this is all part of the TCP protocol, no application layer visibility to those

#

I didn't say that the nodes will not retry at some level when they are full. I just said that TCP retransmissions are not this kind of retries

deep mirage
#

It's not clear that under what circumstances the retransmission would happen. But, to you point, why it happneds only when multiple nodes were forced to RO or SG 99% FULL, and error codes 500 start to increase? If it is packets drops, it could happen anytime.

Ultimately, my question is, this graph "TCP Retransmission Rates" under Support-Diagnostics, is it about just layer2 network(Internal grid network inside SG), or about networks outside or interfacing SG networks?

It didn't explicitly say either way. It looks a layer2 network to me.

real canyon
#

tcp is going to be layer3, irrespective of client or grid network.