#ACP elements reports error?

1 messages · Page 1 of 1 (latest)

fallen valley
#

We have a FAS2720 that reports errors on the ACP elements. From "node run -node xxx -command environment status" we can see that one of the shelfs (the one that has the controllers installed) is reporting error on both the ACP elements. "ACP installed element list: 1, 2; with error: 1, 2" We have ran a few sasadmin commands and we could see that there were a few "Invalid DWord Count" which we have reset as we do not think it is related.. (there have not been any new counts since)... And all SAS lanes are linked at 12G... we have then tried to reboot the service-processors... disable and re-enable the "shelf acp".. but the error persists... (The health subsystem shows OK and no alerts) yet our monitoring system picks up on the errors on the modules... Any good ideas? Or is this another case for the support guys?

serene fox
#

you can try node run ... acpadmin ibacp_list_all and check if there's a shelf module that's missing, or that has missing data. e.g. the ACP modules should all be in the same state (0xd) and with the same status (0x20)

#

if required, you can reboot a single acpp processor with a command like node run ... acpadmin acpp_cli 0a.12.B reboot (change the hba, shelf ID and module number to match the one you want to reboot, 0a.12.B is just an example)

fallen valley
#

All are status 0x20... are you sure I can just reboot the acp on the controller node? Because it seems to be on the shelf 0 (which is the shelf where the two controllers are located)...

serene fox
#

I have done it successfully and non-disruptively with IOM6-based shelves in the past. I never tried it with an integrated shelf though...

#

after quickly checking it on a FAS26xx it seems like it doesn't work on internal shelves 😦

#

have you tried a takeover/giveback yet?

fallen valley
#

nope. but I guess I might as well, before opening a support case..

serene fox
#

that's probably the first thing they ask you to do anyway

fallen valley
#

yeah... rebooted both nodes but no difference...

serene fox
#

there's another command to reset the ACP processor, that also works nondisruptively on internal shelves that you might want to try

#

in node run, use this: sasadmin expander_cli 1a.5.A "acpmgr_tp 3 0" (for module A) or sasadmin expander_cli 1a.5.B "acpmgr_tp 3 1" for module B (note the last number must match the module, i.e. 0 for A and 1 for B)

#
Node11*> sasadmin expander_cli 0b.00.A "acpmgr_tp 3 0"
 Successfully sent reset cmd to ACP.

reset takes about 40 seconds

#

not sure if a regular reboot also resets the ACPP (it might not), so that's something you can try too

fallen valley
#

OK how do I see which IOM module to reset?

#

I have tried these two:fs07-dkaar1-02*> sasadmin expander_cli 0b.0.B "acpmgr_tp 3 0" Successfully sent reset cmd to ACP.

#

fs07-dkaar1-02*> sasadmin expander_cli 0b.0.B "acpmgr_tp 3 1" Successfully sent reset cmd to ACP.

#

But does not make any change to the environmental failure...

#

I think I need to bite the bullit can try to open a case..

serene fox
#

hm, yeah, probably best to open a case then...

smoky barn
#

Hello there

#

Got a similar issue on a FAS2650, interesting (failing ACP on IOM12E, on both IOMs)

#

Support made us replace one controller, no dice, and going to send us another one.

#

But I think it will change nothing, I'm betting more on a complete shelf power cycle, but obviously... not a fan of the idea of shutting down the service.

#

I'm not sure resetting the ACP is enough, at least in my case, as we already tried reseating both nodes (which AFAIK should have reset the ACPP, obviously)

#

Since this is a DS224C shelf, I'm betting more on i2c shenanigans or similar, been there...

sly cloud
#

@smoky barn If it is on the IOM12E, that is the controller itself. The 'E' represents 'embedded.'

fallen valley
# smoky barn Hello there

OK what I was thinking after trying almost everything else 🙂 I will get support on the horn once the easter has passed.. thankfully the system works even with this error...

smoky barn
#

@sly cloud Yep, that I know, I was maybe a bit imprecise. By experience, I'm quite confident both controllers (and thus IOM's) are fine, and the problem is the shelf itself. We'll know soon enough, the second controller replacement is on the way 🙂

ocean cape
#

The ACP functionality for the embedded controller expander (IOM12E) is actually handled by the BMCs in the controllers. I'd look at rebooting the BMCs and upgrading them to the latest version as a general hygiene thing.

fallen valley
sly cloud
#

@fallen valley That is a strong possibility since SAS-3 has ACP in-band.

smoky barn
#

@ocean cape Already tried unseating both nodes, and they were already up to date (we try to stay that way when possible), so the problem is not a simple software issue 🙂

#

But I did not know ACP was handled by the BMC on embedded IOMs, that's good to know !

smoky barn
#

Well, first node was replaced, no dice, but replacing the second one seems to have fixed the issue. My best case is that there was a hardware issue (especially since ACP was effed on both nodes during the problem, and since both nodes had been reseated / BMC-updated). To our local netapp hw experts, do you think one IOM12E BMC could be responsive / functional for everything except ibACP in some cases, and make ACP malfunction on both IOM12E's ?