#Same Volume is shown as multiple resources

1 messages · Page 1 of 1 (latest)

crystal lynx
#

Similar to my last Post "Same Qtree is shown as multiple resources" we now see Volumes as multiple resources. See Screenshots.

Maybe important Note:
We moved all Volumes of your Systems from one Aggregate to another one. The old Aggregate was rebuild with new Disk Layout and then we moved all Volumes back again.
Harvest Version: 24.05.0
Dashboard: ONTAP: Volume

agile swan
#

Hi @crystal lynx can you try this? Edit the Top n volumes by average latency panel and replace the query there with this one. Does this remove the duplicates?

label_replace(volume_avg_latency{
  datacenter=~"$Datacenter",
  cluster=~"$Cluster",
  svm=~"$SVM",
  volume=~"$Volume",
  style!="flexgroup_constituent"
},"aggr", "", "aggr", ".*")   
and 
topk(
  $TopResources, 
  avg_over_time(
    label_replace(volume_avg_latency{
  datacenter=~"$Datacenter",
  cluster=~"$Cluster",
  svm=~"$SVM",
  volume=~"$Volume",
  style!="flexgroup_constituent"
},"aggr", "", "aggr", ".*")[3h:] @ end()
  )
)
crystal lynx
#

Will try tomorrow

crystal lynx
agile swan
#

Thanks @crystal lynx Looks like some of your volume moves also caused the volume to switch nodes and nodes are now duplicated. Can you try this query that excludes nodes and aggregates from the labelset?

label_replace(
  label_replace(
    volume_avg_latency{
      datacenter=~"$Datacenter",
      cluster=~"$Cluster",
      svm=~"$SVM",
      volume=~"$Volume",
      style!="flexgroup_constituent"
    },
    "aggr", "", "aggr", ".*"
  ),
  "node", "", "node", ".*"
) 
and 
topk(
  $TopResources,
  avg_over_time(
    label_replace(
      label_replace(
        volume_avg_latency{
          datacenter=~"$Datacenter",
          cluster=~"$Cluster",
          svm=~"$SVM",
          volume=~"$Volume",
          style!="flexgroup_constituent"
        },
        "aggr", "", "aggr", ".*"
      ),
      "node", "", "node", ".*"
    )[3h:] @ end()
  )
)
crystal lynx
#

@agile swan hey Chris. Tried and got the same error.
We had the aggregates rebuild with a different disk layout but the same names (if it helps).

agile swan
#

Can you share the results of running this Prometheus query in Prometheus?
count(volume_avg_latency) by (datacenter,cluster,svm,volume,style) > 1

crystal lynx
#

0 results.
If i put >1 out i get plenty of results

frail acorn
#

@crystal lynx could you run below query. Click on Graph in prometheus and change time to 180d as in screenshot.

crystal lynx
#

I have sent you both a pvt message with result 🙂
Have a nice Weekend

frail acorn
#

Thanks! Query result shows the issue. We have a duplicate record for vol0 with a missing svm, which explains why the above query fails with an error. I have sen you a query via DM. Could you share the results of this?

#

Thank you for sharing the results. As we can see, there is a missing svm for the root volume vol0 that occurred sometime in March. Do you recall if any changes were made in ONTAP related to this or harvest during that time, which might have caused the svm label to be missed?

crystal lynx
#

Around March 20th the whole Aggregate rebuild was made because we had a unsupported disk layout with our advanced disk partioning Aggregates.
So we built provisory aggregates on provisory shelfs, moved all volumes, rebuilt all aggregates on the old shelfs with the supported ADP configuration and moved all volumes back.

This could maybe caused the issue?

frail acorn
#

Yes that could be the issue.

As a workaround, You can try below query for latency panel where we exclude vol0 records.

label_replace(volume_avg_latency{
  datacenter=~"$Datacenter",
  cluster=~"$Cluster",
  svm=~"$SVM",
  volume=~"$Volume",
  style!="flexgroup_constituent",
  volume!="vol0"
},"aggr", "", "aggr", ".*")   
and 
topk(
  $TopResources, 
  avg_over_time(
    label_replace(volume_avg_latency{
  datacenter=~"$Datacenter",
  cluster=~"$Cluster",
  svm=~"$SVM",
  volume=~"$Volume",
  style!="flexgroup_constituent",
  volume!="vol0"
},"aggr", "", "aggr", ".*")[3h:] @ end()
  )
)
crystal lynx
#

Hmm still same error

frail acorn
#

ok if you exclude March from time range. Do you get this error?

crystal lynx
#

It seems so

frail acorn
#

ok how about if we do just May month.

crystal lynx
#

Ok weird. Same error

frail acorn
#

Yes very strange.

crystal lynx
#

Difference is i have now all cluster and all volumes selected

#

If i choose only one random volume (the same as OP Message) i get results

frail acorn
#

ok if we run below query, how many records do we get?

count_over_time(volume_avg_latency{svm=""}[180d])
crystal lynx
#

Sent you results in PVT

#

Not every Cluster seems affected

#

Sent you a List in PVT

frail acorn
#

Thanks

crystal lynx
#

Hmm i have seen it only depends on how i set the time range. The more clusters ss i go back further.

frail acorn
#

Yes this is strange as it implies that there can be same volume name with different aggr in an svm?

crystal lynx
#

Its not possible in Ontap as i know

frail acorn
#

Yes

#

error means that we have 2 instances of volume_avg_latency with same labels as we remove aggregate from it.

frail acorn
#

@crystal lynx We have tried it and it's a bug in prometheus 2.51 and 2.52 as well.

#

@safe lion For your reference in NABox

safe lion
#

Hopefully not present in Victoria metrics !

#

2.52 is current. Brilliant.

frail acorn
#

@glass sandal Could you share screenshot from your local here as well for reference.

glass sandal
frail acorn
crystal lynx
#

@safe lion can i install a different Prometheus version that has not this bug in nabox?
Or how should we deal with that?

safe lion
#

Any chance you can give Nabox 4 a shot ? Curious to see if Victoria metrics has the same issue. I don’t think downgrading Prometheus is supported.

#

Looks like the fix has been merged so it should be available in next release

frail acorn
#

They have fixed it in 2.52.0 but we still see it in 2.52.0.

crystal lynx