Hi!
I'm currently working on configuring the LGTM stack for observability in our Kubernetes cluster. One important thing we need to monitor is the remaining capacity of all volumes in the cluster, so we can act before someone runs out of space. We use Trident with ONTAP NAS as the backend using Flex Volumes.
When inspecting the metrics, I immediately saw that something was off. While kubelet_volume_stats_capacity_bytes reports truthfully that my volume is 50TiB, kubelet_volume_stats_used_bytes is way off reporting ~28TiB while the true size of my data is 86GiB. I recognize this as the discrepancy between running df and du to determine disk utilization (where df reports much more than the true number from du).
This is well-known (see https://docs.netapp.com/us-en/ontap/volumes/df-command-file-size-concept.html). But I'm not a storage expert, so I can't quite wrap my head around the details of why this is, so thought I'd ask here.
It's actually one of the most common questions/complaints from my users, having caused a lot of confusion. Ideally, I'm hoping someone has an easy trick for me to change a parameter somewhere and finally get rid of this ugly bug for good (both via df and the metrics reported to Grafana). But if that's not possible for whatever reason, surely there's some way to get the true metric from Trident so I can present a useful graph to my users?
Hoping for some input on this. 🙂