#monitoring trident operation duration

1 messages · Page 1 of 1 (latest)

tardy breach
#

Hi!
I'm trying to configure Datadog monitoring for the average duration of some trident operations like volume_add and volume delete.
Found this in the docs:

Average duration in ms of operations performed by Astra Trident

sum by (operation) (trident_operation_duration_milliseconds_sum{success="true"}) / sum by (operation) (trident_operation_duration_milliseconds_count{success="true"})

So I tried to do something similar in Datadog for the volume_add operation for example (and dividing by 1000 to go from ms to s)

sum:trident.trident_operation_duration_milliseconds.sum{operation:volume_add,success:true} by {operation} / sum:trident.trident_operation_duration_milliseconds.count{operation:volume_add,success:true} by {operation}.as_count() / 1000

But I get results in the thousands, like 8k. That can't be right.
Would anyone know what I'm doing wrong?

rich drum
#

I asked the dev team and one of them suggested:
avg:trident.trident_operation_duration_milliseconds.sum{operation:volume_add,success:true} by {operation} / 1000

tardy breach
#

Hey Scott! Thanks for replying. I tried that, however I still get values over 1k which would mean volume_add operations take over 16 minutes

rich drum
#

Can you look at the underlying data and see if there is an abnormally large (millions of seconds) value that would be throwing your average off?
I'm imagining a volume add operation that needed manual intervention and was left overnight before someone fixed it that was eventually successful.