#Disk.yaml not doing what is expected.

1 messages · Page 1 of 1 (latest)

hot nimbus
#

When I review the disk_labels, I would expect that any disk that has 'outage="xxxyyyzzz" (a non-empty value) would set the metric to zero based on the rule in the disk.yaml file, but it doesn't as seen in the following:

disk_labels{container_type="broken",failed="true",model="X375_SMBPE04TA07",outage="failed",owner_node="XXYY07node5",serial_number="ZY1235F",shared="false",shelf="12345456789",shelf_bay="47",type="FSAS",aggregate="",firmware_revision="NA00",datacenter="MYDC",cluster="XXXYYY07",disk="9.13.47",index="XXYY07node5_9.13.47",node="XXYY07node5"} 1.0

the rule in the disk.yaml is as follows:

plugins:
LabelAgent:
value_to_num:
- new_status outage - - 0
join:
- index _ node,disk

Am I missing something as to why the value of the metric isnt 'zero' when outage is set?

I am testing this on 24.02 but I also have verified it doesn't work on 23.05 (which is our prior version we used).

charred mountain
hot nimbus
#

Every device is marked as "1" with disk_new_status as well.

opal warren
#

@hot nimbus that seems consistent with what you shared earlier, right? In other words, all of your disk have outage info associated with them which means that disk_new_status will have the value of 1

hot nimbus
#

correct. All the disk_labels are "1" as well, assuming that the outage label isn't changing it to 'zero' as indicated in the template

#

which doesn't sound like that is expected

opal warren
#

the disk_labels will always be 1 (it's hardcoded that way) and aren't related

hot nimbus
#

ok, so how does disk_new_status change?

opal warren
#

disk_new_status is the metric you want to use to track disk outage info

hot nimbus
#

I haven't looked at that before

#

ok, that said, how does it change? I have two failed disks that I can query through prom directly to find (using the failed or outage labels) but the metric for either of these never changes

opal warren
hot nimbus
#

outage = "" on all but two disks, but all the disks in disk_new_status are 1

opal warren
#

ah ok, that wasn't clear. can you share a screenshot of the results of each of these prom queries?
disk_labels{outage!=""} and disk_new_status==0

hot nimbus
#

I'll post the results once I've modified the data to obscure it. I can't post live data snapshots to Discord

opal warren
#

feel free to email if that's quicker

hot nimbus
#

here's the disk_labels{outage!=""} query result:

disk_labels{cluster="XXXYYY07", container_type="broken", datacenter="ABC", disk="9.13.47", failed="true", firmware_revision="NA00", index="XXXYYY07n05b_9.13.47", instance="harvestserver1:12996", job="harvest_scrape", model="X375_SMBPE04TA07", node="XXXYYY07n05b", outage="failed", owner_node="XXXYYY07n05b", serial_number="12345", shared="false", shelf="123456789001", shelf_bay="47", type="FSAS"}
1
disk_labels{cluster="YYYZZZ01", container_type="broken", datacenter="ABC", disk="17.10.17", failed="true", firmware_revision="NA00", index="17.10.17", instance="harvestserver1:12997", job="harvest_scrape", model="X477_SMBPE04TA07", outage="init failed", serial_number="12367", shared="false", shelf="1234567890", shelf_bay="17", type="FSAS"}
1

#

and here's the outpout of disk_new_status == 0

#

disk_new_status{cluster="XXXYYY07", datacenter="HPC", disk="9.13.47", index="XXXYYY07n05b_9.13.47", instance="harvestserver1:12996", job="harvest_scrape", node="XXXYYY07_05"}
0
disk_new_status{cluster="YYYZZZ01", datacenter="HPC", disk="17.10.17", index="17.10.17", instance="harvestserver1:12997", job="harvest_scrape"}
0

#

I guess it is changing it to zero, not sure why I wasn't able to query it before with it set to < 1

#

this is good though, it's nice to be able to see that data finally

opal warren
#

cool, yep those match. Looks like you're all good now?

hot nimbus
#

yeah, I just need to beable to get disk_new_status results into a dashboard where I'm using disk_labels now

opal warren
#

sounds good. Depending on exactly what you need, you may be able to ignore disk_new_status in your dashboard and use disk_labels{outage!=""} instead

hot nimbus
#

oh, one thing I found weird, on SSD based systems, it doesn't show an aggregate on the output of disk_new_status, when I query the data directly, it shows 3 aggregates (based on how NA setup the disks)

#

I'm guessing when there is more than one aggregate, it won't populate it

#

?

#

do I need to start a new discussion for that?

#

I understand for failed disks, it could be without

opal warren
opal warren
hot nimbus
#

is disk_new_status based on the Disk template?

opal warren
hot nimbus
#

zapi, we aren't at 9.12 yet

#

I guess I need to get a course on how to write dashboards better then

opal warren
#

nothing in that template about aggregates. Which key/value pair has aggregate info, using the example you pasted above?
disk_new_status{cluster="XXXYYY07", datacenter="HPC", disk="9.13.47", index="XXXYYY07n05b_9.13.47", instance="harvestserver1:12996", job="harvest_scrape", node="XXXYYY07_05"}

hot nimbus
#

if it's based on disk.yaml, then I have added it to a custom_disk.yaml file

#

the data is there in the API, but if it's also somewhere else I can easily index from disk_labels data, I would be willing to do that

#

at least 'index" exists on disk.yaml disks, is that used elsewhere?

opal warren
#

if you share your custom_disk.yaml we can help, not sure what edits you made there

hot nimbus
#

added 'aggregate' and 'firmware_revision'

#

export_options:
instance_keys:
- disk
- index
- node
- aggregate
instance_labels:
- container_type
- failed
- model
- outage
- owner_node
- serial_number
- firmware_revision
- shared
- shelf
- shelf_bay
- type

opal warren
#

disk_labels are used in three dashboards: disk.json, datacenter.json, and health.json

hot nimbus
#

at the end of the day, what I am trying to do is capture the 'time' that the disk_new_status label changed from 1 -> 0 and use that as a way to track time of disk replacement (or how long it's been since the disk was NOT replaced).

#

probably will do that over a 30 day window

#

and display that time in a dashboard view along with the disk information

opal warren
hot nimbus
#

that's a cool feature

#

I "would you like to know more" about this feature. The documentation shows how to add it, but it's not clear on how to use it

#

it adds another metric that tracks the changes?

opal warren
#

The Disk dashboard has this panel, and that would be a good place to start. The tricky part of what you want to do is related to tracking the time. The changelog plugin tracks those changes via the metric value instead of a label because tracking it in a label leads to cardinatlity problems

opal warren
# hot nimbus it adds another metric that tracks the changes?

Yes, those metrics are showed in the documentation I linked, e.g.
change_log{aggr="umeng_aff300_aggr2", cluster="umeng-aff300-01-02", datacenter="u2", index="2", instance="localhost:12993", job="prometheus", node="umeng-aff300-01", object="volume", op="delete", style="flexvol", svm="harvest", volume="harvest_demo"} 1698735708

hot nimbus
#

that's pretty much the panel I built (and didn't know it existed)

#

what I want to do is add a 'time of failure' timestamp to the end of the panel

#

and change_log time would be a good place to get the time from

#

I have to run off to an appointment but likely will have other questions about this

#

I could put enough info into what to track that finding it based on the label should be pretty straight forward

#

index seems like a very good canidate

hot nimbus
#

is the change_log in 23.11 and 24.02, looks like that from documentation

charred mountain
#

@hot nimbus Yes, the changelog plugin has been available in Harvest since 23.11. Also, if you don't want to track the change in value of the outage label, an alternative could be to simply plot the disk_new_status on a timeseries panel in Grafana. Whenever it dips to 0, that would indicate the time when the disk became faulty.

hot nimbus
#

except doing that would require a lot of time when I want to look at it over a longer period of time, specifically, a 30 day window basically times out even with a 'disk_new_status == 0' query

hot nimbus
#

ok, new issue. Why when we use ADP do we not see aggregates assigned to 'disks'?

https://docs.netapp.com/us-en/ontap/concepts/root-data-partitioning-concept.html

The aggregate appears in aggr_labels but don't show anything in this panel.

netapp-detail-disk-and-cache-layers

I know that in the case where we use root-data-data, there are 3 aggregates assigned to each disk, is that causing an issue with collection of data?

I know 'spare' disks don't have aggregates assigned to them, but regular disks that are 'shared' probably should have an aggregate.

#

I guess I am confused on ADP and why it's not capturing all the aggregates for those 'shared' disks

hasty briar
#

@hot nimbus I have enabled aggregate field in disk template and I can get aggr details. Based on some of the disk types, aggregate details are not coming from ONTAP it self.
As Harvest reports exact same as What ONTAP Zapi/Rest gives. we don't know why this behaviour, but it's a good question for ONTAP.

Just to check your ONTAP disk zapi response for that cluster, Could you run this below Harvest cli and share the disk_data.json file to ng-harvest-files@netapp.com?
Replace POLLER_NAME with your cluster name.
./bin/harvest zapi -p {POLLER_NAME} show data --api storage-disk-get-iter > disk_data.json

hot nimbus
#

I know for 'spare' disks, there are NO aggregates, I get that. For shared disks, it shows multiple aggregates but the metrics when adding aggregates doesn't show any aggregates for those disks.

#

I would expect shared disks to have not one metric, but 3, one for each aggregate that is within the disk

#

but this all depends on adding 'aggregate' to the disk template

#

I don't see any easy way outside of that of associating disks to an aggregate

hot nimbus
#

I am sending an example of the data. The records I want the data from is 1440 records. I will send the dump of the poller data from <storage-disk-info>......</storage-disk-info> for one of the disks that isn't reporting any 'aggregate' even though the output shows 3 aggregates

#

email sent

hasty briar
#

@hot nimbus Thanks, we have received the disk response in mail.

We do support the array handling in Zapi collector and this is the exact same case.
I have added the aggregate field in the template and parse your disk response and I can able to get the list of aggrs in comma separated way in disk_labels.

I have sent the updated template in same thread, you can try and let us know your feed back for the same.

hot nimbus
#

I have sent an email asking a question. I am trying to export all aggregates to this, not just 'shared' ones. The email has more details

hot nimbus
#

I have added tracking to disk changes. When I go to query the harvest server via HTTP, does the change_log show up on each individual exporter or does it add it to the local 'unix' one that we still use? How long does it stay, until another change occurs?

#

as well as enabling it for volume,svm,node for my test server

charred mountain
#

The change log is specific to each poller and has no relationship with the Unix poller. It publishes entries only when it detects changes. It will not stay if there are no changes detected in subsequent polls.

Consider the following example:

  • During the first data collection, the disk outage label changes from A to B. At this time, a change_log metric will be published to showcase this change.
  • In the second data collection, the value remains B. No change_log metric will be published.
  • In the third data collection, the value changes from B to C. At this time, a change_log metric will be published again.
  • In the fourth data collection, the value remains C. No change_log metric will be published.
hot nimbus
#

I still don't understand how long it will have change_log metric published, is it time based or ???. Prometheus is scraping this and if it happens to NOT get the data collected when it "Appears" how do I know that any data was created or not? So if I am reading this correctly, if we have a 3 minute poll time and between time A and time B, 3 minutes pass, it will create a change_log at time B and then the next time the poller queries the hardware, time C (3 minutes later), the change_log is gone?

#

so basically, if that is the case, then it's likely that prometheus won't possibly get the change_log metric

#

I guess I don't understand why the metric isn't published as long as the change matches what it already has stored

charred mountain
#

@hot nimbus change_log tracks changes that occur between two polls. If the polling schedule is set to every 3 minutes, then any particular change will be published and available for Prometheus to scrape for 3 minutes. If your Prometheus scrape interval is longer than 3 minutes, it's possible that it may miss some changes.

Are you suggesting that the change_log plugin is not tracking some changes? If that is the case, could you provide an example? Additionally, please share any changes you have made to the relevant template, apart from enabling the change_log plugin.

hot nimbus
#

I am unable to verify whether it is / isn't tracking at this point. The issue I referring to is when prometheus polling of harvest is on one cadence and harvest is on its own cadence, then there is a pretty good possiblity that the harvest metric for the change_log could be missed if it's simply changes between two poll times.

#

example: prom cadence -> 5 minutes, harvest cadence -> 3 minutes. In this case, the harvest cadence could record an event 2 minutes after prom scrapped the metrics and before the next poll from prom, the harvest metric for change_log would be gone. effectively leaving no change_log entry in prometheus