ZHA quietly crashing | Home Assistant | Page 1

short stirrup Aug 3, 2022, 4:38 PM

#

Just got a few [bellows.zigbee.application] Watchdog heartbeat timeout: but it hasn't ccrashed completely

smoky tinsel Aug 3, 2022, 4:38 PM

#

That definitely shouldn't happen

short stirrup Aug 3, 2022, 4:39 PM

#

Do I really need homeassistant.core: debug It makes the logs noisy AF

uneven urchin Aug 3, 2022, 4:39 PM

#

It will help us tell if other integrations are choking the loop

#

If you have a misbehaving integration that blocks the event loop zha will choke out

#

EZSP is very latency sensitive

short stirrup Aug 3, 2022, 4:41 PM

#

alright, when it crashes and if the currently enabled logs don't help, I'll turn that on.

#

2022-08-03 09:35:07 WARNING (MainThread) [bellows.zigbee.application] Watchdog heartbeat timeout: 
2022-08-03 09:35:22 WARNING (MainThread) [bellows.zigbee.application] Watchdog heartbeat timeout: 
2022-08-03 09:35:37 WARNING (MainThread) [bellows.zigbee.application] Watchdog heartbeat timeout: 
2022-08-03 09:35:52 WARNING (MainThread) [bellows.zigbee.application] Watchdog heartbeat timeout: 
2022-08-03 09:35:52 ERROR (bellows.thread_0) [homeassistant] Error doing job: Exception in callback ._call_connection_lost(None)
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/local/lib/python3.10/site-packages/serial_asyncio/__init__.py", line 417, in _call_connection_lost
    self._serial.close()
  File "/usr/local/lib/python3.10/site-packages/serial/urlhandler/protocol_socket.py", line 104, in close
    time.sleep(0.3)
  File "/usr/src/homeassistant/homeassistant/util/async_.py", line 180, in protected_loop_func
    check_loop(func, strict=strict)
  File "/usr/src/homeassistant/homeassistant/util/async_.py", line 141, in check_loop
    raise RuntimeError(
RuntimeError: Detected blocking call to sleep inside the event loop. Use `await hass.async_add_executor_job()`; This is causing stability issues. Please report issue```

smoky tinsel Aug 3, 2022, 4:42 PM

#

The whole log file is most helpful. These are just symptoms of EZSP crashing and a bug with pyserial during reconnect, which should not be happening to begin with.

short stirrup Aug 3, 2022, 4:42 PM

#

it didn't crash though but some interesting logs. We're still connected to the coordinator

#

When the true crash happens, I get a log entry on the coordinator [D][streamserver:055]: Client 10.1.10.2 disconnected

#

enabling homeassistant.core: debug and restarting

#

The wall of text, it burns!

uneven urchin Aug 3, 2022, 4:50 PM

#

Have you shared a diagnostic file for zha?

#

need to start comparing setups and things like that across these cases

short stirrup Aug 3, 2022, 4:51 PM

#

I have not, how do I do that?

#

the download diagnostics for the integration?

uneven urchin Aug 3, 2022, 4:52 PM

#

Yes

short stirrup Aug 3, 2022, 4:52 PM

#

what's the preferred file sharing method?

uneven urchin Aug 3, 2022, 4:52 PM

#

You can post it here

short stirrup Aug 3, 2022, 4:54 PM

#

hmm, can't attach a file. Want a dump here?

smoky tinsel Aug 3, 2022, 4:54 PM

#

Sending it over PM is easier, it's way too big to fit in any reasonable number of comments

uneven urchin Aug 3, 2022, 4:54 PM

#

You can dm to me or just post it to the GH issue

#

Either way is fine

short stirrup Aug 3, 2022, 4:56 PM

#

DM'd to dmulcahey

uneven urchin Aug 3, 2022, 4:57 PM

#

What is your avg CPU usage

#

in system health

#

is there any chance at all frigate is getting busy when this happens?

#

I ask because it’s pretty resource intense and I actually won’t run it on my prod box. Had it set up with 2 corals and 4 cameras and it was making the box run hot

short stirrup Aug 3, 2022, 5:00 PM

#

No doubt the CPU is busy, I don' thave a avg CPU usage in system health (in Home Assistant) top shows load average: 6.38, 5.45, 5.02

uneven urchin Aug 3, 2022, 5:00 PM

#

what does the cpu sit at?

short stirrup Aug 3, 2022, 5:01 PM

#

Frigate has always been busy but ZHA dropping is new in 2022.7

uneven urchin Aug 3, 2022, 5:01 PM

#

yeah the full debug log will be helpful when it happens

short stirrup Aug 3, 2022, 5:01 PM

#

cpu is between 60-70% utilized, all day long

uneven urchin Aug 3, 2022, 5:02 PM

#

Wow

#

I’m sitting at 1.9

short stirrup Aug 3, 2022, 5:03 PM

#

This machine gets pushed hard, all the time, but it hasn't caused issues with ZHA until now.

uneven urchin Aug 3, 2022, 5:03 PM

#

with nearly 25 integrations and probably north of 500 total devices

#

on a HA Blue

short stirrup Aug 3, 2022, 5:07 PM

#

Ya, this is a full Linux server running 37 docker containers. Quad core Xeon processor, 16GB of RAM and about 60TB of storage

smoky tinsel Aug 3, 2022, 6:00 PM

#

@autumn knot Let's migrate this discussion here

#

Is the load average significantly different with the new HA OS? Or are the figures you posted for the release that works?

short stirrup Aug 3, 2022, 6:04 PM

#

Is that to me?

smoky tinsel Aug 3, 2022, 6:05 PM

#

Both 😄

short stirrup Aug 3, 2022, 6:05 PM

#

no significant change in proc usage in 2022.7

autumn knot Aug 3, 2022, 6:05 PM

#

I don't know the historical load average and to be honest I never really found it a useful metric of anything, I would usually look at CPU %

#

But I am on the latest HA OS with a downgraded HA core version and everything is working fine

#

For me it was crashing about once every 1-3 days, for e.g. I can see in my log from yesterday 7 AM the NCP entered failed state message, but since I am on 2022.6.7 it recovered without any noticeable issue.

smoky tinsel Aug 3, 2022, 6:06 PM

#

So the latest OS but downgraded Core still works. Interesting.

autumn knot Aug 3, 2022, 6:07 PM

#

Yea, I'm not even sure how you would downgrade the OS... maybe I should start taking snapshots

short stirrup Aug 3, 2022, 6:33 PM

#

No crashes yet. Not sure if something changed or that all of the debugging has changed the environment in such a way that the error won't occur.

subtle basalt Aug 3, 2022, 6:34 PM

#

debug logging actually ups IO - writing a lot more data to log

autumn knot Aug 3, 2022, 6:40 PM

#

I can simulate a connection drop and the logs appear nothing like the crash issue

short stirrup Aug 3, 2022, 6:43 PM

#

off topic, should Last Seen update regularly? I'm looking at the Zigbee Coordinator an it only shows the last restart of Home Assistant.

autumn knot Aug 3, 2022, 6:44 PM

#

I just simulated a connection drop and my last seen appears to be when the connection was reestablished

uneven urchin Aug 3, 2022, 6:46 PM

#

short stirrup off topic, should **Last Seen** update regularly? I'm looking at the Zigbee Coor...

the coordinator wont change often

#

devices should update each time we have any communication with them at all

short stirrup Aug 3, 2022, 6:47 PM

#

off to lunch for a bit. I'll update with any changes. Thanks for y'alls engagement and assistance.

short stirrup Aug 3, 2022, 8:50 PM

#

I had a crash but it restarted and re-established the connection to the coordinator. Logs have been shared with @uneven urchin

uneven urchin Aug 3, 2022, 8:50 PM

#

I sent them to puddly as well

short stirrup Aug 3, 2022, 8:51 PM

#

cool

#

lots of PII in those if you wouldn't mind keeping the audience small

smoky tinsel Aug 3, 2022, 8:52 PM

#

Of course

short stirrup Aug 3, 2022, 8:53 PM

#

welp, there's my Diun 🔔 notification to upgrade Home Assistant to 2022.8 :). I'll hold off while we troubleshoot unless you think I should just go for it

smoky tinsel Aug 3, 2022, 8:58 PM

#

So regarding your log: I see five consecutive watchdog failures (coordinator doesn't respond to a simple ping), the connection is assumed to be dead and closed, and then everything reconnects

#

What exactly is broken here? Is this disconnect not happening with an older HA OS release?

subtle basalt Aug 3, 2022, 9:01 PM

#

any CRC errors?

smoky tinsel Aug 3, 2022, 9:02 PM

#

None

subtle basalt Aug 3, 2022, 9:02 PM

#

(have seen in few cases when esp32 not working well)

uneven urchin Aug 3, 2022, 9:05 PM

#

yeah everything starts right back up and is in working order....

#

before the crash is looks like automations were reloading several times

#

what's up w/ that?

short stirrup Aug 3, 2022, 9:06 PM

#

I was editing an automation in the frontend editor

uneven urchin Aug 3, 2022, 9:07 PM

#

and it "crashed" right after the last edit?

short stirrup Aug 3, 2022, 9:07 PM

#

yes

uneven urchin Aug 3, 2022, 9:08 PM

#

back to the original thought though... what is actually wrong

#

after the watchdog quits

#

ZHA spins right back up...

short stirrup Aug 3, 2022, 9:09 PM

#

yep, this is a new behavior today, or since enabling the debug. Previous behavior was for it to go away and not restart. That's the same error as reported here - https://github.com/home-assistant/core/issues/75292

#

What brought me to that issue was Logger: zigpy.application Source: /usr/local/lib/python3.10/site-packages/zigpy/application.py:85 First occurred: 2:31:52 PM (1 occurrences) Last logged: 2:31:52 PM Couldn't start application

smoky tinsel Aug 3, 2022, 9:15 PM

#

It's unfortunate that debug logging fixes the problem 😅

short stirrup Aug 3, 2022, 9:16 PM

#

I know, I was really hoping if would crash while under debug. Just for sanity, I'm going to disable debug logging and see if it crashed out without a restart

subtle basalt Aug 3, 2022, 9:18 PM

#

I fired up dev instance to see if I can replicate.

smoky tinsel Aug 3, 2022, 9:27 PM

#

short stirrup What brought me to that issue was `Logger: zigpy.application Source: /usr/local/...

Was there any more info in the log? Any tracebacks?

short stirrup Aug 3, 2022, 10:09 PM

#

smoky tinsel Was there any more info in the log? Any tracebacks?

There was but I don't have the full log, nor was I debugging yesterday

smoky tinsel Aug 3, 2022, 10:10 PM

#

If you could post that it would be helpful

#

The traceback for that specific error message in the log

short stirrup Aug 3, 2022, 10:25 PM

#

I'm digging around, here is one, but it could be from startup https://pastebin.com/5u9Krh7T

#

I've disabled debugging and am upgrading to 2022.8. I'm sorry I couldn't reproduce the error. If I can get it to happen again (and not automatically restart), I'll ping the thread. Thanks again for all of the help and guidance (and an amazing integration!)

smoky tinsel Aug 3, 2022, 10:39 PM

#

No worries, thanks for checking it out

uneven urchin Aug 3, 2022, 10:53 PM

#

Yeah, thanks for helping us poke at this

short stirrup Aug 4, 2022, 5:12 AM

#

Welp it failed hard again

#

https://pastebin.com/HYQmkQ89

smoky tinsel Aug 4, 2022, 5:34 AM

#

Any way you can get it to fail again with debug logging enabled 😆
I have a few theories but would like to confirm before blindly making changes to various packages and hoping this problem gets fixed

short stirrup Aug 4, 2022, 2:48 PM

#

Yes. I've enabled debug logging this morning and will leave it be. Now I know it will eventually go again. I happened again around 12:30am so it's at least a bit consistent

short stirrup Aug 4, 2022, 3:10 PM

#

I did notice there is zero indication of the failure. All of my Zigbee devices show the last state they were in. Battery readings are still there from the last poll. The only indication was that my FP1 didn't "see" me walk into my room. I reset it and checked the automation before going to logs to see the dreaded [zigpy.application] Couldn't start application

short stirrup Aug 4, 2022, 6:14 PM

#

So far, I'm seeing disconnects with subsequent reconnects. That's what happens when I have debugging on. Still watching [D][streamserver:055]: Client 10.1.10.2 disconnected [D][streamserver:106]: New client connected from 10.1.10.2 [D][streamserver:055]: Client 10.1.10.2 disconnected [D][streamserver:106]: New client connected from 10.1.10.2 [D][streamserver:055]: Client 10.1.10.2 disconnected [D][streamserver:106]: New client connected from 10.1.10.2

short stirrup Aug 5, 2022, 12:20 AM

#

Alright, captured a full crash and sent to @uneven urchin

#

with debug on!

#

[36m2022-08-04 16:46:06.833 DEBUG (MainThread) [bellows.zigbee.application] Cancelling watchdog[0m
[36m2022-08-04 16:46:06.834 DEBUG (MainThread) [bellows.zigbee.application] ControllerApplication reset unsuccessful: [0m
[31m2022-08-04 16:46:06.836 ERROR (bellows.thread_0) [homeassistant] Error doing job: Exception in callback SerialTransport._call_connection_lost(None)
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/local/lib/python3.10/site-packages/serial_asyncio/__init__.py", line 417, in _call_connection_lost
    self._serial.close()
  File "/usr/local/lib/python3.10/site-packages/serial/urlhandler/protocol_socket.py", line 104, in close
    time.sleep(0.3)
  File "/usr/src/homeassistant/homeassistant/util/async_.py", line 180, in protected_loop_func
    check_loop(func, strict=strict)
  File "/usr/src/homeassistant/homeassistant/util/async_.py", line 141, in check_loop
    raise RuntimeError(
RuntimeError: Detected blocking call to sleep inside the event loop. Use `await hass.async_add_executor_job()`; This is causing stability issues. Please report issue[0m```

smoky tinsel Aug 5, 2022, 1:00 AM

#

short stirrup ```[31m2022-08-04 16:46:06.761 ERROR (MainThread) [zigpy.application] Couldn't ...

Great! Could you pm me the log?

short stirrup Aug 5, 2022, 1:02 AM

#

done

short stirrup Aug 5, 2022, 2:58 AM

#

Is there any benefit to leaving it in a crashed state?

#

Or am I okay to turn off debugging and restart.

smoky tinsel Aug 5, 2022, 2:59 AM

#

I think you're fine to restart normally, I'll take a look at the log in a bit

short stirrup Aug 5, 2022, 2:59 AM

#

Thanks

short stirrup Aug 5, 2022, 3:20 PM

#

Good morning. Did the logs provide anything useful?

smoky tinsel Aug 5, 2022, 3:24 PM

#

Indeed, a potential fix should be in the next HA release

short stirrup Aug 5, 2022, 3:27 PM

#

fantastic! Is this something I could install as a custom component now to test?

#

restarting HA 2-3 times a day is getting old 😉

smoky tinsel Aug 5, 2022, 3:30 PM

#

Sure. I don't remember exactly what the best way is nowadays to override dependencies manually but here are the new package versions: https://github.com/home-assistant/core/pull/76275/files

short stirrup Aug 5, 2022, 3:35 PM

#

Alright, giving this a shot. 1) clone the full repo 2) in config/custom_components, linked your version of zha

uneven urchin Aug 5, 2022, 3:40 PM

#

the real release should be out either today or tomorrow I think... the . releases usually flow fairly quickly after a release

short stirrup Aug 5, 2022, 3:46 PM

#

gotcha. Was hoping to test that fix a bit earlier.

uneven urchin Aug 5, 2022, 3:52 PM

#

i get it... just putting it out there.. I'd try it early too if I was crashing like you are!

short stirrup Aug 5, 2022, 5:17 PM

#

well, I get 08:44:21.196 WARNING (SyncWorker_0) [homeassistant.loader] We found a custom integration zha which has not been tested by Home Assistant. This component might cause stability problems, be sure to disable it if you experience issues with Home Assistant but it doesn't show the bumped version of bellows or zigpy so I missed something

smoky tinsel Aug 5, 2022, 5:42 PM

#

short stirrup well, I get `08:44:21.196 WARNING (SyncWorker_0) [homeassistant.loader] We found...

Hmm. Maybe try the custom deps addon? https://github.com/zigpy/zigpy-xbee#testing-new-releases

short stirrup Aug 5, 2022, 6:34 PM

#

no dice for a docker version. I manually upgraded the python package in both regular and virt envs and download diagnostics still shows them on their older versions.

#

no big, I can wait for the new release

smoky tinsel Aug 5, 2022, 6:34 PM

#

HA auto installs the exact version in the manifest, unfortunately

#

So it'll downgrade. If you can control how HA starts, you can run hass --skip-pip.

subtle basalt Aug 5, 2022, 6:36 PM

#

can you edit the mainfest.json in the installed component dir?

smoky tinsel Aug 5, 2022, 6:36 PM

#

Yeah, if you are in the Docker container itself you can find the manifest.json of the actual ZHA integration and edit the versions in there.

short stirrup Aug 5, 2022, 8:08 PM

#

did that, zha won't load now Unable to install package pyserial==3.4: ERROR: Cannot install pyserial==3.4 because these package versions have conflicting dependencies. ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts

#

again, not worth the fuss, I can wait 😄

short stirrup Aug 8, 2022, 4:59 AM

#

It's still dying but we're not dropping the connection.

#

Source: /usr/local/lib/python3.10/site-packages/bellows/zigbee/application.py:655
First occurred: 7:15:35 PM (1 occurrences)
Last logged: 7:15:35 PM

ControllerApplication reset unsuccessful: TimeoutError()
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/bellows/zigbee/application.py", line 652, in _reset_controller_loop
    await self._reset_controller()
  File "/usr/local/lib/python3.10/site-packages/bellows/zigbee/application.py", line 670, in _reset_controller
    await self.startup()
  File "/usr/local/lib/python3.10/site-packages/zigpy/application.py", line 70, in startup
    await self.load_network_info(load_devices=False)
  File "/usr/local/lib/python3.10/site-packages/bellows/zigbee/application.py", line 238, in load_network_info
    (status, network_key) = await ezsp.getKey(
  File "/usr/local/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

#

Source: /usr/local/lib/python3.10/site-packages/zigpy/application.py:96
First occurred: 7:15:35 PM (1 occurrences)
Last logged: 7:15:35 PM

Couldn't start application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/zigpy/application.py", line 70, in startup
    await self.load_network_info(load_devices=False)
  File "/usr/local/lib/python3.10/site-packages/bellows/zigbee/application.py", line 238, in load_network_info
    (status, network_key) = await ezsp.getKey(
  File "/usr/local/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

smoky tinsel Aug 8, 2022, 5:01 AM

#

Does it function after the reset failures? It should re-establish the connection again, no matter how many attempts it takes.

short stirrup Aug 8, 2022, 5:01 AM

#

Nope, fp1s go unavailable and connection is never reestablished

#

Back to capturing debug logs?

smoky tinsel Aug 8, 2022, 5:02 AM

#

Hmmm, that's not what I was expecting.

#

Unfortunately. Does it keep retrying or is this the last you see of the error during reconnection?

short stirrup Aug 8, 2022, 5:04 AM

#

That's the last. No more attempts

short stirrup Aug 8, 2022, 2:42 PM

#

Well here's a new one ```Logger: bellows.ezsp.protocol
Source: runner.py:119
First occurred: August 7, 2022 at 10:18:46 PM (6 occurrences)
Last logged: 1:05:37 AM

Unknown application frame 0x0208 received: b'306a' (b'0080000802306a'). This is a bug!```

subtle basalt Aug 8, 2022, 3:05 PM

#

that's from this PR - apparently not everything is documented in the ezsp docs. https://github.com/zigpy/bellows/pull/478

short stirrup Aug 8, 2022, 4:28 PM

#

Hey @subtle basalt is there any other testing I can do from the EFR32 side? I pinged for about 30-min to make sure there is no packet loss (none).

subtle basalt Aug 8, 2022, 4:29 PM

#

you mean ESP32/ethernet side?

short stirrup Aug 8, 2022, 4:30 PM

#

yep

subtle basalt Aug 8, 2022, 4:30 PM

#

I guess you could run a ping for longer to look for disconnects.

short stirrup Aug 8, 2022, 5:21 PM

#

Running a continuous ping with timestamps to see if the drops are network related. I did notice we lost connection at 3:08am which coincides with my server maintenance window, tons of disk IO for backups, snapraid calculations, etc. The server does lose connection with a few integrations at that time; sabnzbd, frigate, isy994 and chromecast all throw a few disconnect errors but re-establish successfully. So far, ZHA seems to be the only integration that doesn't gracefully recover.

#

on a positive note, I do see heartbeats fail at around 9am but connectivity is reestablished to the coordinator

smoky tinsel Aug 8, 2022, 5:27 PM

#

https://github.com/zigpy/bellows/pull/481 should help, I think I was able to reproduce your problem with a USB stick last night.

short stirrup Aug 8, 2022, 5:30 PM

#

Awesome, thank you @smoky tinsel !

short stirrup Aug 8, 2022, 6:31 PM

#

That ethernet connection looking pretty stable - 6295/6295 packets, 0% loss, min/avg/ewma/max = 0.192/0.304/0.288/3.994 ms

subtle basalt Aug 8, 2022, 6:32 PM

#

that's good. I think if you were having esp32 issues you'd be seeing CRC errors in the log as well

short stirrup Aug 9, 2022, 2:51 PM

#

3 more unrecovered quiet crashes in the last 24-hours (16:30, 20:01 and 6:42), not coinciding with any system instability or packet loss. 77647/77647 packets, 0% loss, min/avg/ewma/max = 0.183/0.318/0.307/8.027 ms. You can see my worst latency to the coordinator was 8ms. Fingers crossed that @smoky tinsel next patch does the trick. Will additional logs be of any help?

#

FYI I did try to downgrade back to 2022.6 and none of my Zigbee devices work anymore. I assume some sort of 1-way upgrade was made This zigpy release uses database schema v8 but the database is v10. Downgrading zigpy is *not* recommended and may result in data loss. Use at your own risk.

smoky tinsel Aug 9, 2022, 3:58 PM

#

short stirrup FYI I did try to downgrade back to 2022.6 and none of my Zigbee devices work any...

You will have seen this in the logs since 2022.8.0, it's nothing new. If you want to try out the bellows patch before I make a new release, it would be quite helpful. I'm not too sure how you'd do it with a Docker install, maybe you can directly apply it within the container?

short stirrup Aug 10, 2022, 9:18 PM

#

I guess it's a good thing that my Zigbee network is only 4 devices or I'd be a sad panda. It will be great to get the FP1s up and working. When ZHA is up. they are magic.

uneven urchin Aug 10, 2022, 9:24 PM

#

Yeah they are

short stirrup Aug 11, 2022, 4:50 PM

#

smoky tinsel You will have seen this in the logs since 2022.8.0, it's nothing new. If you wan...

Thinking about how to test this. I can remap specific directories in the docker container. If I remap /usr/local/lib/python3.10/site-packages/bellows to a directory containing your PR, will that work? I also noticed there is a compiled version of bellows at /usr/local/bin/bellows. I assume that is a utility not used by HA.

short stirrup Aug 11, 2022, 6:51 PM

#

hmm, that might have actually worked

      "bellows": "0.33.0.dev0",```

short stirrup Aug 12, 2022, 4:33 AM

#

Still dying out with both the patched bellows and the longer backup cycle recommended in the bug.

#

Far, far less often but not stable yet

smoky tinsel Aug 12, 2022, 4:36 AM

#

Hmmm Can you get a debug log of it not reconnecting? I've looked at about every corner case I can think of and can't see how it can still fail.

autumn knot Aug 12, 2022, 4:59 AM

#

My zigbee crashed again with 2022.8.2

#

I have debug log enabled

#

NCP entered failed state. Requesting APP controller restart followed by Couldn't start application

smoky tinsel Aug 12, 2022, 5:02 AM

#

2022.8.3 was released a few days ago with potential fixes, a debug log from 2022.8.2 likely won't contain anything useful

autumn knot Aug 12, 2022, 5:05 AM

#

Uhhh I went to 2022.8.2 because I thought there was a fix

smoky tinsel Aug 12, 2022, 5:07 AM

#

There was. And there's one in 2022.8.3 as well, and in 2022.8.4 whenever it will get released.

#

Ah, never mind, nothing is in 2022.8.3. If you want to post/PM that debug log l can take a look. Did you also add the config from https://github.com/home-assistant/core/issues/75292#issuecomment-1210982416 as well?

autumn knot Aug 12, 2022, 5:33 AM

#

I just updated, would the log still be there and what exactly am I looking for?

#

And no I haven't added anything to my config but I updated on Sunday right after you said to give it a try so crashing after 4 days doesn't seem like extra activity is causing issues

smoky tinsel Aug 12, 2022, 5:35 AM

#

If it breaks again and doesn't recover, that'd be it. I don't think it should with the correct network backup interval config from the linked comment.

autumn knot Aug 12, 2022, 5:36 AM

#

How, unless I disable backup? For me the issue is not that it crashes, lost connection, or w/e. For me the issue is that this happens and even hours later is not recovered.

smoky tinsel Aug 12, 2022, 5:40 AM

#

That's what I'm trying to address. If it doesn't re-connect, that's a bug. I believe it's only partially addressed in 2022.8.2, what I expect to be the final one will be in 2022.8.4 (or earlier if you want to try manually pinning bellows==0.32.0).

autumn knot Aug 12, 2022, 5:44 AM

#

I have a log if you want it

smoky tinsel Aug 12, 2022, 5:45 AM

#

Sure, PM it

short stirrup Aug 12, 2022, 6:46 PM

#

I've enabled debug logging. Other than the diagnostic download that showed I'm running bellows 0.33.0.dev0, is there a way to make sure?

smoky tinsel Aug 12, 2022, 6:47 PM

#

How did you install the dev version?

short stirrup Aug 12, 2022, 6:49 PM

#

Using docker-compose, I mapped /usr/local/lib/python3.10/site-packages/bellows to a host directory containing your PR

#

volumes:
  - "/home/jshank/bellows/bellows/:/usr/local/lib/python3.10/site-packages/bellows"```

smoky tinsel Aug 12, 2022, 6:50 PM

#

2022.8.4 was just released about a minute ago, that will reference the correct packages if you want to be 100% sure

#

But I believe mapping it should also work

short stirrup Aug 12, 2022, 6:51 PM

#

cool. I'll take out the remap AND comment out the backup_period: 86400

#

once the update hits

uneven urchin Aug 12, 2022, 7:58 PM

#

It’s out

short stirrup Aug 13, 2022, 2:55 PM

#

It crashed overnight and I caught it with full logging

#

I sent the logs to @smoky tinsel and @uneven urchin

#

what is interesting about this is that I still see updates to [homeassistant.components.zha.core.device] [0x1B55](lumi.sensor_wleak.aq1): Update device availability - device available: True - new availability: True - changed: False in the logs. It just loses both FP1s after the crash and never recovers.

#

Ok, maybe this is something else. It looks like ZHA is actually staying up. I'm just losing both FP1s. 2022-08-13 06:45:26.861 DEBUG (MainThread) [homeassistant.core] Bus:Handling <Event state_changed[L]: entity_id=sensor.aqara_lumi_motion_ac01_d27d5100_basic_lqi, old_state=<state sensor.aqara_lumi_motion_ac01_d27d5100_basic_lqi=255; state_class=measurement, device_class=signal_strength, friendly_name=Office Occupancy Lqi @ 2022-08-12T15:55:45.503465-07:00>, new_state=<state sensor.aqara_lumi_motion_ac01_d27d5100_basic_lqi=unavailable; state_class=measurement, device_class=signal_strength, friendly_name=Office Occupancy Lqi @ 2022-08-13T06:45:26.861854-07:00>> 2022-08-13 06:46:08.301 DEBUG (MainThread) [homeassistant.components.zha.core.device] [0x08C5](lumi.motion.ac01): last_seen is 7610.337707281113 seconds ago and ping attempts have been exhausted, marking the device unavailable

smoky tinsel Aug 13, 2022, 6:57 PM

#

This may be a lower level problem than I originally thought. Home Assistant switched from Python 3.9 to 3.10 in 2022.7.0, if I remember correctly, and I'm getting strange errors during testing with a makeshift Ethernet coordinator.

Can you try disabling the bellows UART thread? https://github.com/zigpy/bellows/blob/cb11d6cf9175b3a924a0c89ce57c93e4a54dd63c/bellows/uart.py#L357 . Change the True to a False and see if this changes anything.

ZHA won't be super stable if you disable the separate thread and if your HA Core has a lot of running integrations (hence the separate thread for bellows) but I'm unable to get bellows to misbehave if I disable the thread.

short stirrup Aug 13, 2022, 6:58 PM

#

sure

#

so go back to pointing bellows to the external version and then change line 357 to =False?

smoky tinsel Aug 13, 2022, 7:00 PM

#

Sure. If you can, point it to the Git repo head, not my feature branch.

short stirrup Aug 13, 2022, 7:00 PM

#

k

#

ya, looks like I'm pointing at dev which is the same as HEAD

uneven urchin Aug 13, 2022, 7:04 PM

#

@short stirrup have you checked to make sure your coordinator is using hashed link keys?

#

I’d assume you are if the FP1’s stay connected for any amount of time

short stirrup Aug 13, 2022, 7:05 PM

#

not sure how to check that. It's been a few months since I setup @subtle basalt coordinator

uneven urchin Aug 13, 2022, 7:06 PM

#

You can tell by looking at the data in a backup

short stirrup Aug 13, 2022, 7:06 PM

#

just added it through the integrations section and put this in configuration.yaml zha: zigpy_config: source_routing: true ezsp_config: CONFIG_APS_UNICAST_MESSAGE_COUNT: 30 CONFIG_MAX_END_DEVICE_CHILDREN: 64 CONFIG_SOURCE_ROUTE_TABLE_SIZE: 200 CONFIG_ROUTE_TABLE_SIZE: 16 CONFIG_ADDRESS_TABLE_SIZE: 32 CONFIG_PACKET_BUFFER_COUNT: 254 CONFIG_BINDING_TABLE_SIZE: 32 CONFIG_NEIGHBOR_TABLE_SIZE: 26

uneven urchin Aug 13, 2022, 7:06 PM

#

Of the coordinator

short stirrup Aug 13, 2022, 7:08 PM

#

alright, back on "versions": { "bellows": "0.33.0.dev0",

#

sorry to be dumb @uneven urchin but where do I find the backup to check?

uneven urchin Aug 13, 2022, 7:09 PM

#

You’d have to take one

#

With bellows or Zigpy cli

#

@smoky tinsel you have the docs for that handy?

smoky tinsel Aug 13, 2022, 7:10 PM

#

It's also in the zigbee.db file

#

I think the hashed link key is saved in the backup in a stack specific key

subtle basalt Aug 13, 2022, 7:11 PM

#

bellows -d socket://ip:port info should show it too

smoky tinsel Aug 13, 2022, 7:12 PM

#

It will also be in the startup debug log, NetworkInfo(...

short stirrup Aug 13, 2022, 7:15 PM

#

Network info: NetworkInfo(extended_pan_id=redactifiedcb:86:dc, pan_id=0x5359, nwk_update_id=0, nwk_manager_id=0x0000, channel=15, channel_mask=<Channels.ALL_CHANNELS: 134215680>, security_level=5, network_key=Key(key=redactified, tx_counter=9814017, rx_counter=0, seq=0, partner_ieee=ff:ff:ff:ff:ff:ff:ff:ff), tc_link_key=Key(key=redactified, tx_counter=24576, rx_counter=0, seq=0, partner_ieee=04:cd:15:ff:fe:35:b7:10), key_table=[], children=[], nwk_addresses={}, stack_specific={'ezsp': {'hashed_tclk': 'redactified'}}, metadata={'ezsp': {'manufacturer': 'tubesZB', 'board': 'Series1 Pro V2', 'version': '6.10.3.0 build 297', 'stack_version': 8, 'can_write_custom_eui64': True}}, source='bellows@0.33.0.dev0')

#

hashed_tclk?

uneven urchin Aug 13, 2022, 7:26 PM

#

Restoring NCP
(venv) ➜  home-assistant git:(dm/zha-coordinator-always-available) ✗ bellows -d /dev/cu.GoControl_zigbee$'\r' info[00:0d:6f:00:0a:ff:74:55]
[0x0000]
[<EmberNetworkStatus.JOINED_NETWORK: 2>]
[<EmberStatus.SUCCESS: 0>, <EmberNodeType.COORDINATOR: 1>, EmberNetworkParameters(extendedPanId=d8:1a:3b:68:86:49:fe:23, panId=0xcc6e, radioTxPower=8, radioChannel=15, joinMethod=<EmberJoinMethod.USE_MAC_ASSOCIATION: 0>, nwkManagerId=0x0000, nwkUpdateId=0, channels=<Channels.ALL_CHANNELS: 134215680>)]
[<EmberStatus.SUCCESS: 0>, EmberCurrentSecurityState(bitmask=<EmberCurrentSecurityBitmask.TRUST_CENTER_USES_HASHED_LINK_KEY|64|32|HAVE_TRUST_CENTER_LINK_KEY|GLOBAL_LINK_KEY: 244>, trustCenterLongAddress=00:0d:6f:00:0a:ff:74:55)]
Manufacturer: HubZ ZigBee
Board name: HUSBZB-1
EmberZNet version: 6.7.8.0 build 373
```

#

As an example

#

EmberCurrentSecurityState

short stirrup Aug 13, 2022, 7:28 PM

#

[04:cd:15:ff:fe:35:b7:10]
[0x0000]
[<EmberNetworkStatus.JOINED_NETWORK: 2>]
[<EmberStatus.SUCCESS: 0>, <EmberNodeType.COORDINATOR: 1>, EmberNetworkParameters(extendedPanId=a2:16:f9:a7:c4:cb:86:dc, panId=0x5359, radioTxPower=8, radioChannel=15, joinMethod=<EmberJoinMethod.USE_MAC_ASSOCIATION: 0>, nwkManagerId=0x0000, nwkUpdateId=0, channels=<Channels.ALL_CHANNELS: 134215680>)]
Exception in callback SerialTransport._read_ready()
handle: <Handle SerialTransport._read_ready()>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/local/lib/python3.10/site-packages/serial_asyncio/__init__.py", line 120, in _read_ready
    self._protocol.data_received(data)
  File "/usr/local/lib/python3.10/site-packages/bellows/uart.py", line 73, in data_received
    self.frame_received(frame)
  File "/usr/local/lib/python3.10/site-packages/bellows/uart.py", line 100, in frame_received
    self.data_frame_received(data)
  File "/usr/local/lib/python3.10/site-packages/bellows/uart.py", line 120, in data_frame_received
    self._handle_ack(data[0])
  File "/usr/local/lib/python3.10/site-packages/bellows/uart.py", line 245, in _handle_ack
    pending[1].set_result(True)
asyncio.exceptions.InvalidStateError: invalid state
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError```

#

probably because another copy is already talking to it (Home Assistant)

short stirrup Aug 13, 2022, 8:01 PM

#

some weird behavior. The two battery operated zigbee devices lumi.sensor_wleak.aq1 and lumi.sensor_magnet.aq2 won't respond or become available. The two FP1s lumi.motion.ac01 seem to be working just fine.

#

I've restarted the coordinator and home assistant. Going to back out the custom bellows and see if anything changes...

#

Both battery devices work with the non-dev version of bellows. Reinstating dev and checking one more time for sanity sake...

#

well, scratch that. They work for both now. Weird

uneven urchin Aug 13, 2022, 8:09 PM

#

Battery devices generally can’t be polled

#

Esp Xiaomi

#

if you do poll them with manage clusters it’s best to push the button on them right after issuing the command

#

Because they go to sleep to preserve battery

uneven urchin Aug 13, 2022, 8:10 PM

#

uneven urchin if you do poll them with manage clusters it’s best to push the button on them ri...

This will wake them up

short stirrup Aug 14, 2022, 3:46 AM

#

Another crash after changing the use_thread = false

#

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

#


Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/bellows/zigbee/application.py", line 653, in _reset_controller_loop
    await self._reset_controller()
  File "/usr/local/lib/python3.10/site-packages/bellows/zigbee/application.py", line 674, in _reset_controller
    await self.startup()
  File "/usr/local/lib/python3.10/site-packages/zigpy/application.py", line 66, in startup
    await self.connect()
  File "/usr/local/lib/python3.10/site-packages/bellows/zigbee/application.py", line 134, in connect

#

  File "/usr/local/lib/python3.10/site-packages/bellows/ezsp/__init__.py", line 90, in initialize
    await ezsp._protocol.initialize(zigpy_config)
  File "/usr/local/lib/python3.10/site-packages/bellows/ezsp/protocol.py", line 76, in initialize
    await self._cfg(self.types.EzspConfigId[config], value)
  File "/usr/local/lib/python3.10/site-packages/bellows/ezsp/protocol.py", line 35, in _cfg
    (status,) = await self.setConfigurationValue(config_id, value)
  File "/usr/local/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError[0m
[31m2022-08-13 16:42:58.785 ERROR (MainThread) [homeassistant] Error doing job: Exception in callback SerialTransport._call_connection_lost(None)
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/local/lib/python3.10/site-packages/serial_asyncio/__init__.py", line 417, in _call_connection_lost
    self._serial.close()
  File "/usr/local/lib/python3.10/site-packages/serial/urlhandler/protocol_socket.py", line 104, in close
    time.sleep(0.3)
  File "/usr/src/homeassistant/homeassistant/util/async_.py", line 180, in protected_loop_func
    check_loop(func, strict=strict)
  File "/usr/src/homeassistant/homeassistant/util/async_.py", line 141, in check_loop
    raise RuntimeError(
RuntimeError: Detected blocking call to sleep inside the event loop. Use `await hass.async_add_executor_job()`; This is causing stability issues. Please report issue[0m```

#

What's weird is that it take a while for the devices to finally show unavailable. the controller crashed at 16:42 but the FP1 didn't actually go unavailable until 18:43, a bit over two hours..

uneven urchin Aug 14, 2022, 12:54 PM

#

2 hours for mains devices and 6 for battery IIRC

#

it’s not uncommon for many zigbee devices to sleep

#

Many Xiaomi devices only report 1x per hour if they aren’t tripped

#

all 3 integrations do some flavor of buffering unavailable like this

short stirrup Aug 15, 2022, 12:55 AM

#

Hey @smoky tinsel, no rush but when you get a moment, let me know if there's anything else you'd like me to try. I'm happy to debug as long as you'd like. I think I'm going to install the Sonoff Zig 3.0 USB device and bring up a Z2M container for a bit to stabilize the FP1s while we debug.

short stirrup Aug 15, 2022, 9:15 PM

#

Is anyone else experiencing this issue or am I the only one? I see that https://github.com/home-assistant/core/issues/75292 is still open and quite a few folks "participating"

short stirrup Aug 16, 2022, 2:11 PM

#

Zha quietly crashed last night with just the one leak sensor on it. I just updated to 2022.8.5

smoky tinsel Aug 16, 2022, 2:24 PM

#

short stirrup Zha quietly crashed last night with just the one leak sensor on it. I just updat...

Nothing new made it into 2022.8.5 but I believe I finally have your reconnect problem figured out. If you want to try something temporary, change https://github.com/zigpy/bellows/blob/cb11d6cf9175b3a924a0c89ce57c93e4a54dd63c/bellows/zigbee/application.py#L674 to await self.connect().

short stirrup Aug 16, 2022, 4:34 PM

#

done jshank@shanknas:/mnt/media/docker-configs/hass$ docker exec -it hass bash bash-5.1# grep -n "await self.connect()" /usr/local/lib/python3.10/site-packages/bellows/zigbee/application.py 674: await self.connect() bash-5.1#

#

zigbee debug shows "source": "bellows@0.33.0.dev0"

short stirrup Aug 17, 2022, 12:04 AM

#

Crashed out a little less than an hour later. I don't see device go unavailable until ~6-hours later when the leak sensor fails to check in and is marked unavailable. Sending logs to @smoky tinsel

short stirrup Aug 18, 2022, 3:26 PM

#

I've backed out the modified version of bellows and disabled debugging. Let me know if you need any other testing done.

smoky tinsel Aug 18, 2022, 3:51 PM

#

short stirrup I've backed out the modified version of bellows and disabled debugging. Let me k...

I'm going to be receiving both the ZBBridge and the same type of coordinator that you have from Tube so I think we can hold off on the remote debugging until I can replicate things at home. I really appreciate your help though, there were quite a few reconnect bugs that have been fixed so far (just not the one you're experiencing 😆)

short stirrup Aug 18, 2022, 4:50 PM

#

Awesome, thanks for your work @smoky tinsel. I'd be interested to know if it's ZHA -> EFR32 or something to do with my Docker/Linux setup.

#ZHA quietly crashing