#AFF A800 BMC not responding

1 messages · Page 1 of 1 (latest)

chilly trellis
#

I've got a naughty BMC in A800. Firmware update failed with: sp.servprocd.upd.unexpt.evts: reason="Unable to transfer SP firmware image using network interface"
I tried to ssh into the BMC and get the password prompt, but then it hangs. SSH debug shows (slightly sanitized):
admin@xxx-xxx-sp's password:
debug3: send packet: type 50
debug2: we sent a password packet, wait for reply
debug3: receive packet: type 52
debug1: Authentication succeeded (password).
Authenticated to xxx-xxx-sp ([1.2.3.4]:22).

but I never get an SP prompt.
I've tried rebooting the SP with no change in behavior.

It's not dead according to the sp_configuration.txt file:
version

Booted primary firmware version 10.7
Primary firmware version 10.7
Backup firmware version 10.6
uptime

10:04:55 up 463 days, 10:43, 0 users, load average: 1.56, 1.38, 1.40
date

Fri Oct 13 10:04:55 UTC 2023

sp_system_event_log.txt does not show any recent heartbeats - they stopped on 9/23
Any suggestions on what to try?

vagrant lotus
#

Try:

system service-processor reboot-sp -node [nodename] -image backup

If that suceeds, attempt the upgrade again.

chilly trellis
#

That let me ssh into the BMC but the update is failing from there:
Downloading package...wget: can't open '/mnt/sapps/RLM_FW.pkg': No space left on device

vagrant lotus
#

I'm not somewhere where I can search the NetApp KB right now, but I bet if you search for "BMC update no space left on device", you'll find a solution for that.

chilly trellis
#

I see that 10.6 and 10.7 were on there. I deleted 10.7 since I'm on the 10.6 backup image. Still no space.
I finally beat it into submission by using ftp instead of http for the download. Thanks for the help!

tame temple
#

How did you get the sp firmware over? Sounds like possibly vol0 might be full?

chilly trellis
#

I ended up doing it by signing on to the BMC directly and then bmc update ftp://...

wicked verge
#

I see it did work for you. FTP, a classic solution 😄 +1 for double checking on if the rootvol is full that can cause all kinds of issues.

chilly trellis
#

I got the SP firmware updated but still have an alert on the Service-Processor subsystem. All SPs in the cluster are at 10.8.

tame temple
#

What’s the alert?

#

Coming from “system health alert show” or a system event ?

chilly trellis
#

system health subsystem show

#

system health alert show also reports it but it's more historical - it's reporting the firmware update failure whereas I was able to successfully update the firmware

warm pond
#

system health subsystem show does the same. It simply shows "degraded" in a certain subsystem if there currently is an alert. If you remove the alert the system health subsystem show will turn to "OK" again.
Sometimes alerts are obsolete but don't get removed automatically (many times it works but not always).

#

With system health alert show -fields monitor you can see what monitor was responsible for the alert and then delete it with system health alert delete -node [nodename] -monitor [monitor] If the issue causing the alert is still relevant the alert will reappear after some time. If it doesn't it was obsolete.

chilly trellis
#

Weird fix of the day (it's my month so far for weird alerts)
before:
primary 10.8
backup 10.7

sp reboot-sp -node xxxx -image backup

After reboot:
primary 10.8
backup 10.7

alert cleared.
sp.servprocd.upd.evts: reason="PostUpdate Check : SP firmware post-update check has PASSED "

The reboot into the backup didn't even work but cleared the alert with the correct and current version.
Halloween is coming...