#Shelf fault

1 messages · Page 1 of 1 (latest)

nimble lichen
#

Amber light on the outside, but functionality does not seem to be impacted.

What would be the correct course of action here?

#

the box is beeing decomissioned soon, i am trying to understand if this condition is such that our interim-but-low-competence hw support can handle it by swapping out something - i would fully expect to hold their hand during the operation. But if we can expect nothing bad to happen for another month or so.. then we probably leave the cluster as it is - we already ordered keystone from netapp. ....and then we fired 1000 people to finance it. (bad joke)

rugged sphinx
#

You need to determine what failed otherwise it's impossible to give you suggestions.
What's the output of this?

rows 0
run -node * "environment status"
nimble lichen
#

sehanna04st-01> environment status chassis all
Sensor Name State Current Critical Warning Warning Critical
Reading Low Low High High

PSU2 GOOD
PSU1 GOOD
Fan3 GOOD
Fan2 GOOD
Fan1 GOOD
SP Status IPMI_HB_OK
mSATA Status OK
mSATA Pres PRESENT
sehanna04st-01>

#

full output coming up:

#

...am i interpreting this right, the circuit required to monitoring voltage is broken, but the powersupply is still doing its job?

prime cedar
#

might be. But it also might be that the PSU is slowly failing. I'd try to reseat it to see if it will recover. Wait at least 30s between pulling and plugging it back in

nimble lichen
#

yeah, i was considering that - but i rather have a "slow fail" - than a provoced instant-fail if you understand what i mean.

#

we just need the time to evacuate data from there

#

cluster going to trashbin soon

#

maybe i let my boss make the call 🙂

rugged sphinx
#

Well currently you have a single-point-of-failure since PSU1 is not working. You can only improve things if you try the reseat.

nimble lichen
#

oh.. ok

#

in that case we will do that

#

thanks for your input gents

rugged sphinx
#

Do you see the same output regarding this shelf from the other node? Or is this a single-node system?

nimble lichen
#

its a nodepair, the output was done in clustershell and should show both ?

#

one sec

#

$ cat env.txt | egrep 'Environmental failure on shelves on this channel?'
Environmental failure on shelves on this channel? yes
Environmental failure on shelves on this channel? no
Environmental failure on shelves on this channel? yes
Environmental failure on shelves on this channel? no

#

so, yeah - we get it twice

#

see also first screenshots from syslog

#

does swapping out the PSU require takeover? i hope not, right?

rugged sphinx
#

no

nimble lichen
#

then our interim support can handle it

rugged sphinx
#

Simply press the PSU switch to 0, then pull it out 5cm, wait 30s and push it back in

nimble lichen
#

passed the instructions on to our data-center team

#

i am like one flight away from that place 🙂

#

you guys in raleigh?

rugged sphinx
#

Germany 😉

nimble lichen
#

Seit wann gibts da Support-people

#

Sehr schön 🙂

rugged sphinx
#

😄

nimble lichen
#

Als ich noch jung und schön war hab ich als TSE in Shiphol gearbeitet 🙂

rugged sphinx
#

Ich arbeite nicht bei NetApp, sondern bei einem Partner

nimble lichen
#

Oh, klasse...

rugged sphinx
#

das goldene NetApp-Logo ist für Partner

nimble lichen
#

wieder was gelernt 🙂

#

(ich bin uebrigens in finland, cluster in schweden)

rugged sphinx
#

ah nice

#

dort war ich noch nie, aber Amsterdam ist ein Trip wert

nimble lichen
#

war ganz nett seinerzeit

nimble lichen
#

reseating didnt work, we now source a spare somewhere