#Permission issues, workload monitoring and Harvest Admin with NAbox

1 messages · Page 1 of 1 (latest)

tawny chasm
#

Hi, on a new deployment of NAbox, as suggested by official support to help figure why we are experiencing slowness for some customers on 2x newly aquired 500f nodes with 15 aggregates of all flash and nvme, and 2x (some years older, but refreshed with new disks and support) fas9000 with 15+ aggregates all flash, and a couple sata shelves aswell (currently no data on) after upgrading to version 9.13.1P4

anyway to explain our problems:
Our main and biggest concerning problem with NAbox is that we cannot add one of our nearstore clusters via web, the configuration validator checks out all fields with green V,/ but nothing gets added/comitted or saved.

We are also having alot of permission denies on certain operations from the pollers/collectors from the clusters we did manage to add, we have followed documentation upon creating roles, we have also went through the posts we could search in relation on here, and done whatever manual suggested role creation/permission suggested, we still have problems.

Also we cannot seem to find how to actually activate certain uncommented features likes Cifssessions under rest.

Ticking off advanced data collection doesnt react or apply either.
Grafana is inccorectly grouping nodes, or failing to display datacenters correctly as configured from harvest.yml - via web.

tawny chasm
#

Being a bit frustrated and felt let out, im sorry that come to display so easily in my rethoric, but you could say that in a megacorp i am one of the few fighting to keep some sort of vendor based/purchased and / leased for that matter in house, on-premises amongst our 7 DC's in the nordics but this is being fed as bullet material to shoot down the solution for budgets for next year

#

in favour of other vendor where, as me personally/ i put alot more trust and rely alot more on netapp which we have had a good experience with for most parts. up until just recently,

#

i apologize for that , and would naturally be very thankful of any help and insights to be able to progress furthetr

tawny chasm
#

So to update,. / i have seen theres been some updates to the permission on the docs just now, ive added those; but now have a new error in the log from the nabox-harvest2 instance;
2024-01-26T02:23:44Z WRN poller/poller.go:1043 > Failed connecting to admin node error="dial tcp 127.0.0.1:8887: connect: connection refused" Poller=DC1-***FAS0xx-mc-A admin=:8887

#

Entering standby mode error="failed to fetch data: error making request StatusCode: 403, Error: Permission denied, Message: not authorized for that command, API: /api/support/ems/messages?fields=name&return_records=true" Poller=Netapp-FAS8060-XX-RXX-1 collector=Ems:Ems task=instance

#

nabox-harvest2 | 2024-01-26T02:27:31Z ERR collector/collector.go:424 > error="Permission denied => Insufficient privileges: user 'harvest2' does not have read access to this resource errNum="13003" statusCode="0"" Poller=DC2-***FASxx-mc-B collector=ZapiPerf:FCVI plugin=FCVI

#

abox-harvest2 | 2024-01-26T02:28:40Z WRN poller/poller.go:1043 > Failed connecting to admin node error="dial tcp 127.0.0.1:8887: connect: connection refused" Poller=_unix admin=:8887
nabox-harvest2 | 2024-01-26T02:28:45Z WRN poller/poller.go:1043 > Failed connecting to admin node error="dial tcp 127.0.0.1:8887: connect: connection refused" Poller=Netapp-FAS8060-XX-XX3-1 admin=:8887
nabox-harvest2 | 2024-01-26T02:28:45Z WRN poller/poller.go:1043 > Failed connecting to admin node error="dial tcp 127.0.0.1:8887: connect: connection refused" Poller=Netapp-FAS8060-XX-XX1-2 admin=:8887
nabox-harvest2 | 2024-01-26T02:28:45Z WRN poller/poller.go:1043 > Failed connecting to admin node error="dial tcp 127.0.0.1:8887: connect: connection refused" Poller=Netapp-FAS8060-XX-XX2-1 admin=:8887
nabox-harvest2 | 2024-01-26T02:28:45Z WRN poller/poller.go:1043 > Failed connecting to admin node error="dial tcp 127.0.0.1:8887: connect: connection refused" Poller=Netapp-FAS8060-XX-XX2-2 admin=:8887
nabox-harvest2 | 2024-01-26T02:28:46Z WRN poller/poller.go:1043 > Failed connecting to admin node error="dial tcp 127.0.0.1:8887: connect: connection refused" Poller=DC1-xxxFA1-mc-A admin=:8887
nabox-harvest2 | 2024-01-26T02:28:46Z WRN poller/poller.go:1043 > Failed connecting to admin node
error="dial tcp 127.0.0.1:8887: connect: connection refused" Poller=DC1-xxxFA
2-mc-A admin=:8887
nabox-harvest2 | 2024-01-26T02:28:46Z WRN poller/poller.go:1043 > Failed connecting to admin node error="dial tcp 127.0.0.1:8887: connect: connection refused" Poller=DC2-xxxFAS1-mc-B admin=:8887
nabox-harvest2 | 2024-01-26T02:28:46Z WRN poller/poller.go:1043 > Failed connecting to admin node error="dial tcp 127.0.0.1:8887: connect: connection refused" Poller=DC2-xxxFAS
2-mc-B admin=:8887
nabox-harvest2 | 2024-01-26T02:29:25Z WRN poller/poller.go:1043 > Failed connecting to admin node error="dial tcp 127.0.0.1:8887: connect: connection refused" Poller=_unix admin=:8887

these are new errors that i didnt see before either, ive tried docker restart with same repeating ones./ the 8060 name is in fact clusters of 2x fas9000 and 2x 500f's newer controller since a few months. we just havent gotten to change the svm name of the admin vserver/filer..

tawny chasm
#

by manually stating harvest admin start it works, but i dont understand why that is needed

#

Yet i cannot get detailed stats, cant find dashboards for cifs and i am unable to add a nearstore cluster

tawny chasm
serene swan
#

Hello.
There is indeed a delay between Harvest documentation update and NAbox documentation update when the required permissions changed. Permissions stayed the same for a very long time at the point NAbox doc was created. It appears now there is regular modifications harvest side that probably requires that I remove those from NAbox doc and point to harvest documentation instead, I'm working on it. For now I just updated harvest dow yesterday.

There is also the issue with advanced data collection that's known and fixed in 3.4.1b, you can manually copy custom_volume_details.disabled to custom_volume_details.yaml in /opt/harvest2-conf/conf/zapiperf and restart harvest.

Customizing harvest for things like cifssessions is a bit of a challenge, I tried to document harvest customization here https://nabox.org/faq/#customize-harvest-2 but it's not obvious what needs to be done for things like cissessions that already has disabled yaml files.

You can create a file custom_cifssessions.yaml in /opt/harvest2-conf/conf/rest that contains :

objects:
  CIFSSession:
    - cifs_session.yaml

and restart harvest (dc restart nabox-harvest2

I don't know about Grafana incorrectly grouping nodes or failing to display datacenter, it could be related to permissions issues you had

Don't worry about the failure to connect to admin node 8887, it's not something that's fully implemented in NAbox yet, but not critical to ontap monitoring

#

WHat you call nearstore cluster is the fas9000 clusters I suppose ? If you can't add them it's probably a network or permission issue. You can look at dc logs -f nabox-harvest2 while adding it

tawny chasm
#

Hi, Yann
No we have some FAS8020's as a Nearstore target

#

that is in another DC which is backup target for

#

500f and 9000 in cluster (Netapp8060 ) sorryu for easy confusion there, i will redo naming standards and stuff once i have more critical things under control.

#

so we have the 2x 500f's and the 2x 9000 in a standard cluster setup in 1 dc, then we have a nearstore 8020 as backup target in another dc, (i cant seem to add that one, and indeed by logs i can see a permission denied and i traced that to possible blocking of port https

#

Thank you for the instructions on the cifs, i will try to get those and nfs clients up and running now straight away, those are where we would need to pull most data from

tawny chasm
#

So isnt there anything i can add somewhere in the docker files

#

to have it run harvest admin start though ?

#

i added the custom config, and restarted naboxh2, but
i dont see any changes here, woudl expecty atleast the warning to go away

#

or is it dependent on actual data, and it will take a little while for it to accumulate?

#

also i am unsure about a couple role accesees that i still cant get to work, under /private and /support i think

serene swan
#

I will look into the harvest admin so it's implemented natively but that requires planning. Can you check the content of /opt/harvest2-conf/conf/rest/custom.yaml ?

tawny chasm
#

yes

#

cat /opt/harvest2-conf/conf/rest/custom.yaml

Objects:
CIFSSession:
- cifs_session.yaml
objects: []

#

sorry for the delay, meetings now b2b, but i wiell try to keep lookin, if you do a @highlight on me when you writ ei can give quicker feedbacks

serene swan
#

That doesn't seem to be right

#

Can you show me exact /opt/harvest2-conf/conf/rest/custom_cifssessions.yaml

tawny chasm
#

yes

#

messed formatting, hence the screenshot
cat /opt/harvest2-conf/conf/rest/custom_cifssessions.yaml
Objects:
CIFSSession:
- cifs_session.yaml

serene swan
#

This is what you should have :

nabox-api:/opt/harvest2-conf/conf/rest# cat custom_cifssessions.yaml
objects:
  CIFSSession:
    - cifs_session.yaml
nabox-api:/opt/harvest2-conf/conf/rest# cat custom.yaml
objects:
  CIFSSession: cifs_session.yaml
#

from what you paste earlier, /opt/harvest2-conf/conf/rest/custom.yaml looks incorrect

tawny chasm
#

ok let me cat both in a screenshot

#

moment

#

so you do this from inside the context of the nabox container

#

or does it not matter really, shouldnt i guess

serene swan
#

all the path are in the context of the VA, not the container

tawny chasm
#

yeah , exactly

serene swan
#

you don't need to jump into the container for this

tawny chasm
serene swan
#

it's "objects" (all lowercase)

tawny chasm
#

Ok

#

Bingo!

#

what was the other cifs module,m sec

#

CifsShare, do i add that under same objects: in same formatting?

#

or make a new custom yaml probably, to separate em is better

serene swan
#

You can do either indeed. If you do a single file, it's probably more logical to rename it custom_cifs.yaml

tawny chasm
#

yeah, natural:) thanks

#

what role is missing for /ems and /private

#

those are permission denies i still get on the pollers

#

doesnt seem to have worked, or maybe i should give it al ittle time

#

that made custom.yaml empty, so how do i need to format it to bundle then?

#

ill try splitting em then to see

serene swan
#

You're missing a ":"

#

after NetPort

tawny chasm
#

ah yeah

#

well i did

#

so gonna try restarting now

tawny chasm
#

seemed to be working great, think its onlty the permissions for ems , and private then

#

that i am not sure about which role to give it

#

permission denied on ems, and then on NetConnections i got an error ;
.

#

and support

serene swan
#

Great news. We might need @subtle plover here.

tawny chasm
#

Ok

#

and also about the Changelog plugin

#

how do i succesfully load that_

#

and thanks guys

#

this is giving me ammuniition we direly need for next followup in 5hours

#

by ammunition i mean , some matter of defending the solution

#

instead of just having to swallow network dept, and customer claims

plush sphinx
tawny chasm
#

yeah ill try

serene swan
tawny chasm
#

There was only 1 permission in the first one you pasted;

#

least-pribilege-approach one;
A::> security login role create -role harvest2-role -access readonly -cmddirname "metrocluster configuration-settings mediator add"
rest was duplicate entries, i will try the one from the #configure-role href to documentation

#

so the idea is that it should cover, /ems, /api/private /support and /licensing now ?

#

i am inserting roles now

#

all are duplicate entries

#

so nothing new,

#

gonna look at the changelog now.

#

so, enabling the plugin for changelog, does it follow same structure as enabling other modules like CIFSSessions

#

plugins:

  • ChangeLog:
    • track:
      • node
      • volume
      • svm
      • style
      • type
      • aggr
      • state
      • status
      • junction_path

like in which file would this go

plush sphinx
#

the update list of permissions will cover everything but /support/autosupport That endpoint has a bug in ONTAP RBAC. We'll have a workaround next week.

The changelog plugin is located in the volume, svm, and node templates. Here's a link that shows the location of each of the templates that contain the ChangeLog plugin. https://github.com/search?q=repo%3ANetApp%2Fharvest+changelog+language%3AYAML&type=code&l=YAML By default you don't to add new things to track unless the defaults don't fit your needs

tawny chasm
#

so i uncomment the lines in the respective templates

plush sphinx
#

yes, that's right

tawny chasm
#

Ok i will try that now thank you so far, just had a meeting with customers and could display some latency graphs that were most valuable

#

The Changelog Module, can it be used in any terms of 'auditing' aswell , like is that a part of the subset , ?

plush sphinx
#

Glad the graphs were helpful @tawny chasm ! I'm not aware of anyone using the ChangeLog plugin for auditing purposes, but that should work with some caveats. The ChangeLog plugin can detect object creation, modification or deletion changes. Keep in mind change detection only works while Harvest is running and only for label values, not metric values. The gist of how it works is Harvest keeps an in memory cache of the objects you enabled the changelog on. Subsequent polls compare the current labels with the previous cached values and when there are changes to the traced labels, Harvest will publish a new change_log metric. That means after Harvest is restarted, two subsequent polls are needed to detect changes.

tawny chasm
#

yeah ok,- i was hoping we couldh ave it as some safety backlock if other things faiils

#

and possibly even feed it out to a db, or to modify the nabox with syslogd-ng and specific rewrites and rules to parse it into graylog

#

would make sense to have the auditing come from there aswell, for our internal concern and security measurements,. and in the central logging
but mjaeh im just thinking out loud a bit . Thanks foir explaining how it works, i gather it can be used, and i would just have to 'accept' a timer and staticcally trigger 2x polls / events and then rely on from there on it tracks the traced labels, what would be a better solution possibly

#

If id want to have that viewable in grafana anyway,- ; makes sense to build the more of what you can sensible in the module, if im gonna combine it in a greater homebrew stew

#

if you know what i mean, im probably not explainng very well, nbot slept for 36 hours working on this

#

will be good night very soon 🙂

#

modifying the templates for plugin now anyway, 1min and ill have it restarted also

plush sphinx
#

yes, some folks have told us their using Telegraf to scrape their Prometheus instance and send the data somewhere else

the polling I described happens automatically by Harvest, no need to trigger anything statically. You can change the frequency at a per-template level with the schedule parameter. https://netapp.github.io/harvest/nightly/configure-zapi/#collector-configuration-file and examples here https://github.com/search?q=repo%3ANetApp%2Fharvest+schedule+language%3AYAML&type=code

Oh dear, sleep sounds like a good idea 🙂
In terms of visualizing, I'd start with the exisiting Changelog Monitor dashboard

tawny chasm
#

/opt/packages/harvest2/container/prometheus/alert_rules.yml =

#

just verifying,

plush sphinx
#

alert_rules are examples of how to setup Prometheus Alertmanager rules for Harvest provided metrics. In terms of enabling changelog, you can ignore that file

tawny chasm
#

ok, but would that be where id add some catching for audits evidently

#

if i were too ?

plush sphinx
#

yes

tawny chasm
#

ok, yes makes sense

#

Ok, so youd start with the existing changelog

plush sphinx
#

i would

tawny chasm
#

could you give one practical example of how you could add something to that or build from that

gilded lance
#

I'm glad to see some good progress was made here. Go get some sleep! 🙂

plush sphinx
#

There are three examples in alert_rules.yml of setting up a rule for changelog metrics. Are you familiar with Prometheus rules? Not something that Harvest builds, but a feature of Prometheus that we use.

The expr is generally where to start. For example expr: change_log{op="delete",object="volume"} > 0 is a rule that will fire when that Prometheus query evaulates to true. The change_log is what the Changelog plugin publishes with the labels op and object. op is operation, and in this case we care about the changelog metrics that represent deletes of object volume.

tawny chasm
#

ok updated al field of changelog plugin now i think

#

gonna restart nabox-harvest2

#

hmm

#

must have missed some files in regardss to the Changelog plugin

#

thats odd

#

i thought i updated all 8 files, saved em

#

restarted, and they all seem to have been back with commenting

#

oO

#

SMZOS2NMG001:/opt/harvest2-conf/conf# cat //* |grep Chang
e

  • ChangeLog
  • ChangeLog
  • ChangeLog
  • ChangeLog

no comments now

#

/opt/harvest2-conf/conf# cat /// |grep Cha
nge

  • ChangeLog
  • ChangeLog
  • ChangeLog
#

restarting it again then

#

i must be editing the wrong files

#

they are back commented after restarting na-ha2

plush sphinx
#

you need to make the changes similar to how you made the earlier changes. The way that you made the earlier changes is documented here as steps 1 and 2. In Step 2 instead of updating the file with the schedule contents you would update the file to contain
https://github.com/NetApp/harvest/discussions/2556

tawny chasm
#

oh, ok, i thought uncommenting it as is was enough

#

mybad. ok fixing

#

meanwhile i did update the roles as instructed

#

im still seeing some
abox-harvest2 | 2024-01-26T18:59:46Z ERR fcvi/fcvi.go:45 > Failed to fetch data error="error making request StatusCode: 404, Message: API not found, Code: 3, API: /api/private/cli/metrocluster/interconnect/adapter?fields=node%2Cadapter%2Cport_name&return_records=true" Poller=DC1-99XFAS003-mc-A href=api/private/cli/metrocluster/interconnect/adapter?return_records=true&fields=node,adapter,port_name object=FCVI plugin=RestPerf:FCVI
nabox-harvest2 | 2024-01-26T18:59:46Z ERR collector/collector.go:424 > error="error making request StatusCode: 404, Message: API not found, Code: 3, API: /api/private/cli/metrocluster/interconnect/adapter?fields=node%2Cadapter%2Cport_name&return_records=true" Poller=DC1-99XFAS003-mc-A collector=RestPerf:FCVI plugin=FCVI

#

errors of that sort*

#

wait hang on

#

that is what is there, if i remove the comment as i did - isnt that correctt ?

plush sphinx
#

what version of ONTAP is that poller monitoring?

tawny chasm
#

9.13.1p4 i think

#

let me check double

#

9.13.1P4 , metrocluster, and normal cluster and nearstore backup cluster on various locatrions is what its pointing at atm

#

all on same version

#

brb. 2minutes

plush sphinx
#

the log message you pasted is not a permission issue, but an API not found which typically means the cluster you're monitoring does not have that object, in this case FCVI

#

what's the path of the screenshot of the node template you pasted above?

tawny chasm
#

SMZOS2NMG001:/opt/harvest2-conf/conf# nano rest/9.12.0/node.yaml
SMZOS2NMG001:/opt/harvest2-conf/conf# echo $PWD
/opt/harvest2-conf/conf
SMZOS2NMG001:/opt/harvest2-conf/conf#

#

so; /opt/harvest2-conf/conf/rest/9.12.0/node.yaml

plush sphinx
#

thanks. Whenever nabox is restarted it overwrites those files. That's why you need to follow the steps 1 and 2 here https://github.com/NetApp/harvest/discussions/2556. In Step 2 instead of updating the file with the schedule contents you would update the file to contain

#

that's what you did here. #1200214236702441622 message These changes are similar in how you create a custom and make the changes in the custom

tawny chasm
#

uhm, so i would make a custom_plugins.yaml , insert plugins: - Changelog and_something_more.yaml?

#

cause i asked abovve it if was just to uncomment those lines, maybe we spoke passsed each other

plush sphinx
#

since you already have a custom.yaml I would add to it the same way you did earlier

tawny chasm
#

ok, but i dont have to do anythiing about the uncommented code that you referred to on github_

#

i mean commented*

#

those were 8 files of plugins: <<--->> Changelog also,
so instead i make a custom_plugins.yaml

#

plugins:

  • ChangeLog
#

gha, dunno how to format codeblock on here, or so

#

but with same formatting as u pasted in screenshot, and then its loaded. cause from the documentation it looks like you load it on svm, node and volumes for example- can do other modules aswell, but how does that correlate to what custom config

plush sphinx
#

for each of the zapi(perf) templates that you want to change, you create a new template with a new name (like you did earlier), for example custom_volume.yaml and the contents of custom_volume.yaml in its entirety would be what I pasted earlier. At runtime, the original template and your new template are read and merged. That is how you uncomment ChangeLog without modifying the original.

At the moment, that way of extending only works for Zapi and ZapiPerf. Rest and RestPerf templates can not be extended, they can only be replaced. More details here https://netapp.github.io/harvest/nightly/configure-templates/#extend-an-existing-object-template

#

FYI I've got to run to a meeting 3m

tawny chasm
#

ok, get going, and thanks alot for the help sdo far

plush sphinx
#

you bet!

tawny chasm
#

i was just wondering how it would know

#

to overwrite "volume.yaml"

#

or

#

svm.yaml

#

from the "custom_volume.yaml" naming

#

or is it what comes after custom_ that matter sthen or must match/matchable?

#

like theres 8 toggles in various places, but 1 custom untoggle them all ? if so then ok that means, uncommenting it in override file in merge on compose that will activate all 8 places of it ?

plush sphinx
#

that filename needs to match what you put in the custom.yaml file
you will see something like this in the logs

tawny chasm
#

and sorry for being stupid and not seeing it apparent

#

ok

#

thank you, i will work on what i got for now

plush sphinx
tawny chasm
#

thank you, will look into that

#

well im an opensource bsd-nix guy i thought so dunno, it could be that it started off with some unlucky outdated documents that threw me quite off

#

anyway, i will continue try to get it going

#

c u, and thanks again

subtle plover
odd marten
#

Hello @tawny chasm . Hopefully you are up and running now!

tawny chasm
#

I had a couple days resting time

#

i will try to hook it now, as i struggled abit but never took a nbew shot on that custom ones for changelog

#

doesnt look like the EMS permissions where added

#

i also see;

#

a couple services

#

wondering bt that

#

ifollwoed instructions on the changelog as pr description

#

something is off

tawny chasm
#

for the ems

#

think i dsicovered typo for changelog, trying now

#

nope

#

checking again for typos, but doesnt seem that solved it for changelog

#

i discovered this aswell>

#

no ping stats

tawny chasm
#

cant get that too work :/

tawny chasm
#

managed to break it upon trying to change to prometheus-r as exporter also

#

fcu

#

ok i think i got it back up

#

and, it looks like ems also responds now

#

guess im just gonna keep shooting questions;

#

it looks like that is newly reported stats only? according to chart on the top right?,- but i think we have all aggregates ssd exception of 1 empty

#

but no data on latency, activity and such ?=

#

all flash

subtle plover
tawny chasm
#

Thank you

#

Ok, uhm you have any idead about the ChangeLog

#

i did the modifications according to the documentation update that was made

#

(alot cleaner process, aswell + on that)

#

but i dont really get anything up,.

subtle plover
#

Have you made any changes to volume, svm, node in ONTAP?

tawny chasm
#

what do you mean like if i have remodeled the command or anything like that

#

or if we are following standard commands syntax in ontap ?

subtle plover
#

Yes, Basically Changelog dashboard detects changes done to these objects. For example: Volume rename

tawny chasm
#

oh

#

ahng on then

#

1 sec.

subtle plover
tawny chasm
#

i think i adde more than just volume, svm and node

#

im gonna double check to verify that moment

#

otherwise ill just do a svm lif update to see if that comes up

#

should be valid test

#

_

#

?

subtle plover
#

okay. If they were added after Harvest started and ChangeLog templates fixed, We should detect that

#

We track below changes for svm
name, state, type, anti_ransomware_state

#

so if you make changes for these then it should be detected

tawny chasm
#

anti ransomware state is nice

subtle plover
#

Below is the default tracking labels

svm: svm, state, type, anti_ransomware_state
node: node, location, healthy
volume: node, volume, svm, style, type, aggr, state, status
tawny chasm
#

and type is that, create ,update, delete etc_

subtle plover
tawny chasm
#

ok

#

and, this should collect on all clusters

subtle plover
#

Yes

tawny chasm
#

that is added, w/o exception right?

#

ok imma be a ballsy and enable ransomware on a metrocluster on the fly here

#

its relatively small, but we didnt notice anything really of perf issues ((except we might have struck the high latency issue thingie with some of our more sensitive ones)
not this cluster tho

#

but that should show then

#

havent had time to read up on that latency issue, just saw 9.14.1 should fix it / i hope there are workarounds aswell

subtle plover
#

An example of node name change as below

tawny chasm
#

excellent

#

but cant we track who did it also

subtle plover
tawny chasm
subtle plover
tawny chasm
#

then our 3 more critical customers

#

started reporting crazy latency when browsing cifs shares etc

#

which is why i started going down the nabox route after not really getting value from the support

subtle plover
#

sure Harvest can help in tracking various of ONTAP metrics.

tawny chasm
#

or,

#

send it to some centrla logging

#

and have it pushed back to grafana on another input

#

?

subtle plover
tawny chasm
#

ok, is it the see also link

#

that is referred to where there has been done work

#

or is it not a started feature yet

#

nvm 2179 found it

subtle plover
#

Feature is not started yet

tawny chasm
#

no but i can checkout the rev

#

and implement it myself

#

i suppose

subtle plover
#

nice! Feel free to contribute it back to our repo.

tawny chasm
#

yeah i would and will if i get to finnish it ,

#

i have some other python scripts ive made, that i was gonna upload to i-built this

#

(dynamic LB of volume re-arrangement on multiple arrays/clusters) etc..

#

just - not enough time for anything :/
and thanks for the answers about todo a change and check here

#

also i noticed one weird issue; but its maybe just me that remembers wrong-

#

when you add a cluster to the nabox,

subtle plover
#

sure. np

tawny chasm
#

autostart: '1'
collectors:
- Rest
- RestPerf
- Ems
datacenter: SF24_Prod

#

from harvest.yml in harvest-2/conf/conf

#

i suppose that is the "merged" one,

#

but im being snowblind and cant find where i set the default, or if it was done on pr.cluster lvl , On the collectors

#

cause 1 is showing up oddly as REstPerf

#

whilst the rest are RestPerf

#

oO

subtle plover
#

Can you share output of command

dc exec -w /conf nabox-harvest2 /netapp-harvest/bin/harvest doctor --print
tawny chasm
#

uhm

#

ueaj

subtle plover
#

Below is my configuration

tawny chasm
#

hang on i see something but, uhm / its nto totally clear, how do i format for code block

#

here

subtle plover
tawny chasm
#

thanks

#
    httpsd:
        listen: :8887
Defaults:
    collectors:
        - Rest
        - RestPerf
        - Ems
        - Zapiperf
        - Zapi
    exporters:
        - prometheus-r
Exporters:
    nabox-prometheus:
        addr: -REDACTED-
        exporter: Prometheus
        master: true
    prometheus-r:
        exporter: Prometheus
        port_range: 13000-13999
Pollers:
    DC1-99XFAS003-mc-A:
        addr: -REDACTED-
        autostart: '1'
        collectors:
            - Rest
            - RestPerf
            - Ems
            - Zapi
            - ZapiPerf
        datacenter: Kredinor
        password: "-REDACTED-"
        prometheus_port: 12991
        use_insecure_tls: true
        username: -REDACTED-
    DC2-99XFAS003-mc-B:
        addr: -REDACTED-
        autostart: '1'
        collectors:
            - Rest
            - RestPerf
            - Ems
            - Zapi
            - ZapiPerf
        datacenter: Kredinor
        password: "-REDACTED-"
        prometheus_port: 12992
        use_insecure_tls: true
        username: -REDACTED-
    Netapp-FAS8060-D6-R41-1:
        addr: -REDACTED-
        autostart: '1'
        collectors:
            - Rest
            - RestPerf
            - Ems
            - Zapi
            - ZapiPerf
        datacenter: SF24_Prod
        password: "-REDACTED-"
        prometheus_port: 12990
        use_insecure_tls: true
        username: -REDACTED-
    Netapp_F8020_Nearstore_1:
        addr: -REDACTED-
        autostart: '1'
        collectors:
            - Zapi
            - ZapiPerf
            - Rest
            - RestPerf
            - Ems
        datacenter: GM_Backup
        password: "-REDACTED-"
        prometheus_port: 12993
        use_insecure_tls: true
        username: -REDACTED-
    _unix:
        addr: -REDACTED-
        autostart: '1'
        collectors:
            - Unix
        datacenter: NAbox
        prometheus_port: 12800
Tools:
    grafana_api_token: -REDACTED-
default:
    send_autosupport_stats: '1'

Error: Problems found in custom.yaml files
  conf/zapi/custom.yaml references template file "custom_volume.yaml,custom_volume_blacklist.yaml" which does not exist in conf/zapi/**
#

i think i manually corrected the RE

#

so id have to dc restart

#

or rebuild it pheraps, ?>

subtle plover
#

Have you manually added Collectors?

tawny chasm
#

im not sure_

subtle plover
#

What are cluster versions?

tawny chasm
#

ive done only whats been in tthis chat discussed, and then standard deployment

#

9.13p4 i htink

#

let me check

#

9.13.1P3

#

P4

subtle plover
#

Okay I see why changelog is not working. As steps shared are for Zapi templates

tawny chasm
#

i did add back Zapi and ZapiRest

#

aswell

subtle plover
#

but Most of your clusters have Rest precedebce

tawny chasm
#

thats only, when i started

#

getting the ssd info, in the chart down on the right

#

but - that was a mnual hack so im not sure where and how, as it related to the same as with the collectors- or does it select based on what the cluster is giving out _? cant really be? cause i only noticed it cause RestPerf stats / sessions dropped to 50% almost

#

when i added ZapiPerf

#

and that went up to almost 50% ~ , ~

#

(ie i just changed the runnning config added it and killall -HUP some process)

subtle plover
#

Okay. Let's keep the order as below for collectors

Zapi
ZapiPerf
Rest
RestPerf
Ems
tawny chasm
#

in which file on the docker host

#

would that be then, to be the correct one - to edit it in

subtle plover
#

We want to edit file /opt/harvest2-conf/harvest.yml

tawny chasm
#

ok

subtle plover
tawny chasm
#

Ok, i will read on that . i did update the config / i have one quick question i think might be related.

#

each of my nodes, or "apis" i connect to ie netapp solutions.

subtle plover
#

let's do dc restart

tawny chasm
#

i have set datacenter

#

but,

#

i dont understand the NAbox datacentre, and why it shows as 1 when i selected all of them./ but only if thats part of the same file is why i wanted to mention it

#

doing the dc restart

subtle plover
#

Let me check

subtle plover
subtle plover
# tawny chasm doing the dc restart

Can you execute and share output of below command again after restarting

dc exec -w /conf nabox-harvest2 /netapp-harvest/bin/harvest doctor --print
tawny chasm
#

Ok, excellent

#

idd

#
    httpsd:
        listen: :8887
Defaults:
    collectors:
        - Zapi
        - ZapiPerf
        - Rest
        - RestPerf
        - Ems
    exporters:
        - prometheus-r
Exporters:
    nabox-prometheus:
        addr: -REDACTED-
        exporter: Prometheus
        master: true
    prometheus-r:
        exporter: Prometheus
        port_range: 13000-13999
Pollers:
    DC1-99XFAS003-mc-A:
        addr: -REDACTED-
        autostart: '1'
        collectors:
            - Zapi
            - ZapiPerf
            - Rest
            - RestPerf
            - Ems
        datacenter: Kredinor
        password: "-REDACTED-"
        prometheus_port: 12991
        use_insecure_tls: true
        username: -REDACTED-
    DC2-99XFAS003-mc-B:
        addr: -REDACTED-
        autostart: '1'
        collectors:
            - Zapi
            - ZapiPerf
            - Rest
            - RestPerf
            - Ems
        datacenter: Kredinor
        password: "-REDACTED-"
        prometheus_port: 12992
        use_insecure_tls: true
        username: -REDACTED-
    Netapp-FAS8060-D6-R41-1:
        addr: -REDACTED-
        autostart: '1'
        collectors:
            - Zapi
            - ZapiPerf
            - Rest
            - RestPerf
            - Ems
        datacenter: SF24_Prod
        password: "-REDACTED-"
        prometheus_port: 12990
        use_insecure_tls: true
        username: -REDACTED-
    Netapp_F8020_Nearstore_1:
        addr: -REDACTED-
        autostart: '1'
        collectors:
            - Zapi
            - ZapiPerf
            - Rest
            - RestPerf
            - Ems
        datacenter: GM_Backup
        password: "-REDACTED-"
        prometheus_port: 12993
        use_insecure_tls: true
        username: -REDACTED-
    _unix:
        addr: -REDACTED-
        autostart: '1'
        collectors:
            - Unix
        datacenter: NAbox
        prometheus_port: 12800
Tools:
    grafana_api_token: -REDACTED-
default:
    send_autosupport_stats: '1'

Error: Problems found in custom.yaml files
  conf/zapi/custom.yaml references template file "custom_volume.yaml,custom_volume_blacklist.yaml" which does not exist in conf/zapi/**```
subtle plover
#

looks good!

tawny chasm
#

unless there were added more ones, in the last 36 hours

#

yeah no, all duplicates so have all the permissions set

subtle plover
#

great

#

We can test ChangeLog plugin now

#

May be try creating a volume?

tawny chasm
#

ok

#

how much time should we leave the nabox to gather

subtle plover
#

polling cycle is 3 minutes

tawny chasm
#

yeah ive seeen it but i also saw something that it should be adjusted to the time it takes running all and so on

#

never got to that part yet

#

still no data in changelog

tawny chasm
subtle plover
tawny chasm
#

6.10 am here

#

the nabox logs

subtle plover
#

yes

tawny chasm
#

netbox-api and harvest2 ?

subtle plover
#

harvest2

tawny chasm
#

ok well i just cped the command so got the api aswell

subtle plover
#

That should be ok

tawny chasm
subtle plover
#

Thanks

#

received logs

#

I see data collected for changelog

#

Can you refresh ChangeLog dashboard?

tawny chasm
#

yeah

subtle plover
#

Do you see data?

tawny chasm
#

nope

#

f5, ctrlf5, and ctrl-r in chrome

#

all 0, no data :/

subtle plover
#

If you expand Volume Changes row?

tawny chasm
#

yeah

subtle plover
#

right top of your dashboard. What does it show?

tawny chasm
#

therer we go

subtle plover
#

nice!

#

time range was the issue?

tawny chasm
#

yeah super nice

#

yes in the end now

#

when you posted 3 hours ir emember i had 5 min so

subtle plover
#

great! makes sense

tawny chasm
#

this one is erroring tho

#

aswell as this;

subtle plover
#

I see from logs there are several Rest permission issues

tawny chasm
#

yeah

#

is wrong, it needs -vserver ADMIN_VSERVER_NAME

#

at the end

#

think i did this allreayd

#

alsoi

#

ok some missing there tho

subtle plover
#

okay. Can you share output of below

cat /opt/harvest2-conf/conf/rest/default.yaml
tawny chasm
#

collector:          Rest

schedule:
  - data: 3m

# See https://github.com/NetApp/harvest/blob/main/docs/architecture/rest-strategy.md
# for details on how Harvest handles the ONTAP transition from ZAPI to REST.

objects:
  Aggregate:                   aggr.yaml
# The CIFSSession template may slow down data collection due to a high number of metrics.
#  CIFSSession:                 cifs_session.yaml
#  CIFSShare:                    cifs_share.yaml
  CloudTarget:                 cloud_target.yaml
  ClusterPeer:                 clusterpeer.yaml
  Disk:                        disk.yaml
  EmsDestination:              ems_destination.yaml
#  ExportRule:                  exports.yaml
  FCP:                         fcp.yaml
  LIF:                         lif.yaml
  Health:                      health.yaml
  Lun:                         lun.yaml
  Namespace:                   namespace.yaml
#  NetConnections:              netconnections.yaml
#  NetPort:                     netport.yaml
  NetRoute:                    netroute.yaml
#  NFSClients:                  nfs_clients.yaml
  Node:                        node.yaml
  NtpServer:                   ntpserver.yaml
  OntapS3:                     ontap_s3.yaml
  OntapS3Policy:               ontap_s3_policy.yaml
  QosPolicyAdaptive:           qos_policy_adaptive.yaml
  QosPolicyFixed:              qos_policy_fixed.yaml
  QosWorkload:                 qos_workload.yaml
  Qtree:                       qtree.yaml
  Security:                    security.yaml
  SecurityAccount:             security_account.yaml
  SecurityAuditDestination:    security_audit_dest.yaml
  SecurityCert:                security_certificate.yaml
  SecurityLogin:               security_login.yaml
  SecuritySsh:                 security_ssh.yaml
  Sensor:                      sensor.yaml
  Shelf:                       shelf.yaml
  SnapMirror:                  snapmirror.yaml
  SnapshotPolicy:              snapshotpolicy.yaml
  Status:                      status.yaml
  Subsystem:                   subsystem.yaml
  Support:                     support.yaml
  SupportAutoUpdate:           support_auto_update.yaml
  SVM:                         svm.yaml
  Volume:                      volume.yaml
  VolumeAnalytics:             volume_analytics.yaml
SMZOS2NMG001:~# ```
#

then;

subtle plover
tawny chasm
#

that is probably sec.

#
conf/rest/custom.yaml              conf/rest/custom_netconnections.yaml  conf/zapi/custom_changelog.yaml             conf/zapiperf/custom_volume_details.disabled
conf/rest/custom_cifssession.yaml  conf/rest/custom_netport.yaml         conf/zapi/custom_volume_blacklist.yaml      conf/zapiperf/custom_volume_details.yaml
conf/rest/custom_cifsshare.yaml    conf/rest/custom_nfsclients.yaml      conf/zapiperf/custom.yaml
conf/rest/custom_exports.yaml      conf/zapi/custom.yaml                 conf/zapiperf/custom_volume_blacklist.yaml``` as spoke about above, i wanted to enable those for the info on our shares
subtle plover
#

ah got it. Makes sense

#

Usually these objects can be spammy. I think we have had an issue with netconnections in past. let me check that

tawny chasm
#

Ok

subtle plover
#

Ok this was the reason

subtle plover
tawny chasm
#

(Y) and according to the dashboard

#

our total is 0.31% cpu commit

#

so i thought i could check it out the load of metrics vs the view we get

#

i can jump that, i got other topics aswell lol

subtle plover
# tawny chasm

This is fine status when there are no instances to report.

tawny chasm
#

im wondering about SecurityLogin and SecurityAuditDestination (i guess thats logdisk-volume)

#

well,

subtle plover
tawny chasm
#

and we def. do have nfs mouns

#

let me just verify dc, and host for nfs

#

ok that is ok, actually nvm

#

then in reality

#

its like this;

#

the ones with troubles

#

i can obv. edit out NetConnections custom yaml for now, but would be nice to see

#

i consider EMS more important tho

subtle plover
#

I think it's a permission issue

#

let me check logs you sent

tawny chasm
#

yeah but i have given the permissions youre docs are stating

#

ok thanks

subtle plover
#

Let's try below URL to check if it is not a permission issue
Replace user, pass, and $ip with values in harvest config.

curl --user user:pass -snk 'https://$ip/api/private/cli/network/connections/active'
tawny chasm
#

ok

#

i mean, it will probably say it is

#

but - it does have permissions ?

subtle plover
#

What about below?

curl --user user:pass -snk 'https://$ip/api/support/ems/messages'
tawny chasm
#

actually; 'https://10.66.32.114/api/private/cli/network/connections/active' { "error": { "message": "API not found", "code": "3" } }SMZOS2NMG001:/opt/harvest2-conf#

subtle plover
#

okay we see same in logs.

#

what about ems endpoint?

tawny chasm
#

you got a link

#

to try against

tawny chasm
#

ok thanks

#

didnt see that

#
{
  "error": {
    "message": "not authorized for that command",
    "code": "6"
  }
}SMZOS2NMG001:/opt/harvest2-conf# ```
subtle plover
#

okay so we have permission issues

#

Let me look for commands to check

tawny chasm
#

ok

subtle plover
#

What is the output of below command?

security login show -user-or-group-name harvest2
tawny chasm
#

yeah i did that

#

ok other one

#

sec

subtle plover
#

Ok i dont' see harvest2-rest-role which you just created?

tawny chasm
#

i added the doc just before so thats strange aswell

#

hhmm

#

ok ill give that role to the user

subtle plover
tawny chasm
#

yeah moment

#

hmm

subtle plover
#

okay and what about security login role show -role harvest2-rest-role?

tawny chasm
#

shud be good now then

subtle plover
#

yes let try the curl command again to verify

tawny chasm
#

ok

#

both works

#

{
"node": "Netapp-FAS9000-D6-R33-3B",
"cid": 1238271224,
"vserver": "z102os2cfi001"
}
],

subtle plover
#

great!

tawny chasm
#

],
"num_records": 7629,
"_links": {
"self": {
"href": "/api/support/ems/messages"
}
}

#

yah that is awesome

subtle plover
#

You want to do the permissions fix for all clusters

#

once that is done.. let's do dc restart in nabox

tawny chasm
#

ok

subtle plover
#

and we should be good

tawny chasm
#

moment

#

ok restarting now

#

bit strange

#

it lists 2 versions now, one failed creds other working

subtle plover
#

That is strange. If you run query metadata_component_status in prometheus. What does it show?

#

URL for prometheus will be : nabox_ip/prometheus

tawny chasm
#

one sec

subtle plover
#

you are missing a s in the end

#

metadata_component_status

#

Try running below query
count(metadata_component_status) by (target,poller,name) > 1

tawny chasm
#

ph

#

oh

#

ok sorry

#

empty too

#

lol

#

but it must be having data ?:)

subtle plover
#

let's refresh the metadata dashboard and check if you see duplicates

subtle plover
subtle plover
tawny chasm
#

how acn i enable that ems_alert_rules

#

for xexample

subtle plover
#

From your screenshot, They already seem enabled in Prometheus?

tawny chasm
#

ok uhm maybe i didnt understand something hee

#

but ok, let me check up be back on it in a little bit

tawny chasm
#

Ok, so pressumably we could use our m onitoring software to stat those urls

#

and if data raise alarm, and if not - coast is clear

#

presummably

subtle plover
tawny chasm
#

this is lovely,. excellent, what i was hoping for

#

is there any sense or thought behind from your end of HA for the nabox

#

or is that something you need to sortof design in yourself if you ever want something like that,- say if you intend to use parts of it for a critical monitoring system

serene swan
#

Note that there is an ongoing effort to provide a preconfigured alertmanager in NAbox that would handle the stock alerts in harvest, but still WIP.
There is also email configuration in NAbox 3.4.1b that provides alerting capacity to Grafana. That said I didn't check if current grafana integration with harvest published alerts bridges the gap with grafana email config.
tu put it simply : can Grafana trigger alerts based on alerts published by harvest in prom ?

tawny chasm
#

yeah that would be what im aiming against

#

then furthermore, i am also looking at "embedding" the grafana

#

within-> feeding into a graylog instance (on elasticsearch)

#

i suppose grafana is somewhat the same itself, but you got all the dashboards n stuff for it, and its awesome

plush sphinx
#

Yann can speak to any HA capabilities of NAbox. Harvest is not prescriptive in HA solutions. I know some customers are achieving that with K8 or Terraform.

tawny chasm
#

indeed yeah you could LB it on pods

#

or dockers..

#

Ok, i am going to read that straight away

#

and i also wanted to say, thank you for prooving my somewhat.. irrational entrance on your beloved community

#

i am now in great gratitude to the help and support you have given me so far, and i felt i needed to comment on that.

plush sphinx
#

you bet!

tawny chasm
#

yes, this is lovely and you have put it very clean to begin with here.

#

i like that!

plush sphinx
#

thanks! let us know if something needs clarification or is missing

tawny chasm
#

yeah ofcourse, i will be cross referencing it back and forth next upcoming days 😉

#

but i mean, im just flanking over the beginning now and it is essencially what puzzles it together if you ask me

#

❤️

umbral zealot
#

@subtle plover @plush sphinx @serene swan Could you gents encapsulate what this issue was about so we can update the title accordingly for future search purposes/indexing? These are the kinds of threads we love to see and don't want it to get lost!

Great to see this kind of live problem-solving taking place! Thanks for everyone's patience and contributions!

#

//cc @nova pendant

serene swan
#

Permission issues, workload monitoring and Harvest Admin with NAbox

tawny chasm
#

just wondering wth can i buy some swag!

gilded lance
#

DM me the names of your account team. No promises, but I’ll see what I can do.

subtle plover
#

@tawny chasm , Glad to hear that you're all set up with Harvest. We've also addressed the issue (https://github.com/NetApp/harvest/issues/2621) that we identified during our conversation, as well as the permission issue related to admin vserver addition in our documentation. You can download our nightly build (https://github.com/NetApp/harvest/releases/tag/nightly) to access these fixes. If you encounter the issue detailed in https://github.com/NetApp/harvest/issues/2544, fix is available in the nightly build as well.

tawny chasm
tawny chasm
#

I will be doing a nightly build upgrade before the weekend i pressume, just need to handle some other things aswell then il get time to sit with it

tawny chasm
#

Hey Guys i have a cople operational questions, that i think i kindof \gathered\ is what you can , but i am also unsure about some metrics here

#
  1. as you can change db engine to another service, you can might aswell put that as a remote db host right?
#
  1. since its a dockerized setup, ytou could essentially split any of the containers onto individual hosts?
#

.. i am looking to seed the db, over to a separate host from the grafana itself, so that i can have that somewhat hid in the background

#

and i would like to have the pollers somewhere closer to the db, and a more central positioning in our infra

#

and then place the web parts of it, ie. the viewer, behind haproxy or nginx lb config maybe, reverse proxied with a active ip and heartbeat failover mechanism i suppose

#

do you guys have any inputsd, or suggestions. i would kindof look to get as close to fips 140-2 on containment of the data, as in encryption of db with sufficient ciphers, and so on

#

and, in the same process. try to secure it against illoyal employees, or accidental eomployees handling

subtle plover
tawny chasm
#

yeah ive seen that, and i do run it in a docker compose

#

what i am asking is mere, some tips or suggestions as i suppose you have allready fooled around with those things a bit beforeh and

#

or is there any clear do nots towards my ideaology

subtle plover
tawny chasm
#

thank you :))

#

Also, probably finall for today, i had a questoins on the ems/ alerts

subtle plover
#

Sure

tawny chasm
#

i got 3 alarms today,

#

or well, 3 types

#

\ is this one a % change in total % data on pr. volume based

#

within the last 4 weeks?

#

ie. what triggers, what measures and how can you determine false possitives based on those alerts

#

this is the other

#

that being the 3rd

#

ie. how can i know what is actually being put out, reasons / can we get that out somewhere like the actual "trigger"

subtle plover
tawny chasm
#

Ok, i undersstand and its based obviously on the measurements of ontap

#

but are you able, to get the

#

\actual\ trigger out in harvest

#

or how would i best let the soc assess the \alert\ and be able to verify it /

#

starting with that those alerts dont say anything to them, except something on volume. is triggering a alarm

#

sorry if im asking stupid, im evaluating a broader picture on a 3 datacentre redundant setup

subtle plover
tawny chasm
#

ok i see, think i understand / you can but you need to define what fields you want to pluck and what data to represent back of that polling. Ok
For long term retention of harvest data

#

do you have any like bearing points, where one should not go passed.

#

or is it just as long as you have resources enouhg to consume you can consume. nothing in harvest itself that doesnt handle "x amount of rows"|

#

(its about a 3DC stretched cloud envrionment with vmware as the main vendor, then san storage, and nas storage with both hardware replicated snaps, and then at bottom layuer s3/archive tiering)

fully active active and redundant

subtle plover
#

Could you explain your question better? Harvest doesn't have any limit on number of rows it can handle.

tawny chasm
#

and i need to setup / have some redundant solution for nabox/harvest that we can base some of the monitoring on that we will use to help keep this environment 99.96% Upo or whatever wee need to be at

#

on , some similar solutions you\ could say that a good advice would be not to keep more than 90days history in active db

#

and put rest in 'archived\ state or soemthing i dont know if im able to place the question well either sorry

tawny chasm
subtle plover
#

I think you mean data retention limit in database?

tawny chasm
#

yes

#

like, at what point would it be counterfeighting and slowing down, or is that not a factor

subtle plover
tawny chasm
#

and/or dependant on db engine

#

thanks