#ZHA quietly crashing

1 messages · Page 1 of 1 (latest)

short stirrup
#

Just got a few [bellows.zigbee.application] Watchdog heartbeat timeout: but it hasn't ccrashed completely

smoky tinsel
#

That definitely shouldn't happen

short stirrup
#

Do I really need homeassistant.core: debug It makes the logs noisy AF

uneven urchin
#

It will help us tell if other integrations are choking the loop

#

If you have a misbehaving integration that blocks the event loop zha will choke out

#

EZSP is very latency sensitive

short stirrup
#

alright, when it crashes and if the currently enabled logs don't help, I'll turn that on.

#
2022-08-03 09:35:07 WARNING (MainThread) [bellows.zigbee.application] Watchdog heartbeat timeout: 
2022-08-03 09:35:22 WARNING (MainThread) [bellows.zigbee.application] Watchdog heartbeat timeout: 
2022-08-03 09:35:37 WARNING (MainThread) [bellows.zigbee.application] Watchdog heartbeat timeout: 
2022-08-03 09:35:52 WARNING (MainThread) [bellows.zigbee.application] Watchdog heartbeat timeout: 
2022-08-03 09:35:52 ERROR (bellows.thread_0) [homeassistant] Error doing job: Exception in callback ._call_connection_lost(None)
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/local/lib/python3.10/site-packages/serial_asyncio/__init__.py", line 417, in _call_connection_lost
    self._serial.close()
  File "/usr/local/lib/python3.10/site-packages/serial/urlhandler/protocol_socket.py", line 104, in close
    time.sleep(0.3)
  File "/usr/src/homeassistant/homeassistant/util/async_.py", line 180, in protected_loop_func
    check_loop(func, strict=strict)
  File "/usr/src/homeassistant/homeassistant/util/async_.py", line 141, in check_loop
    raise RuntimeError(
RuntimeError: Detected blocking call to sleep inside the event loop. Use `await hass.async_add_executor_job()`; This is causing stability issues. Please report issue```
smoky tinsel
#

The whole log file is most helpful. These are just symptoms of EZSP crashing and a bug with pyserial during reconnect, which should not be happening to begin with.

short stirrup
#

it didn't crash though but some interesting logs. We're still connected to the coordinator

#

When the true crash happens, I get a log entry on the coordinator [D][streamserver:055]: Client 10.1.10.2 disconnected

#

enabling homeassistant.core: debug and restarting

#

The wall of text, it burns!

uneven urchin
#

Have you shared a diagnostic file for zha?

#

need to start comparing setups and things like that across these cases

short stirrup
#

I have not, how do I do that?

#

the download diagnostics for the integration?

uneven urchin
#

Yes

short stirrup
#

what's the preferred file sharing method?

uneven urchin
#

You can post it here

short stirrup
#

hmm, can't attach a file. Want a dump here?

smoky tinsel
#

Sending it over PM is easier, it's way too big to fit in any reasonable number of comments

uneven urchin
#

You can dm to me or just post it to the GH issue

#

Either way is fine

short stirrup
#

DM'd to dmulcahey

uneven urchin
#

What is your avg CPU usage

#

in system health

#

is there any chance at all frigate is getting busy when this happens?

#

I ask because it’s pretty resource intense and I actually won’t run it on my prod box. Had it set up with 2 corals and 4 cameras and it was making the box run hot

short stirrup
#

No doubt the CPU is busy, I don' thave a avg CPU usage in system health (in Home Assistant) top shows load average: 6.38, 5.45, 5.02

uneven urchin
#

what does the cpu sit at?

short stirrup
#

Frigate has always been busy but ZHA dropping is new in 2022.7

uneven urchin
#

yeah the full debug log will be helpful when it happens

short stirrup
#

cpu is between 60-70% utilized, all day long

uneven urchin
#

Wow

#

I’m sitting at 1.9

short stirrup
#

This machine gets pushed hard, all the time, but it hasn't caused issues with ZHA until now.

uneven urchin
#

with nearly 25 integrations and probably north of 500 total devices

#

on a HA Blue

short stirrup
#

Ya, this is a full Linux server running 37 docker containers. Quad core Xeon processor, 16GB of RAM and about 60TB of storage

smoky tinsel
#

@autumn knot Let's migrate this discussion here

#

Is the load average significantly different with the new HA OS? Or are the figures you posted for the release that works?

short stirrup
#

Is that to me?

smoky tinsel
#

Both 😄

short stirrup
#

no significant change in proc usage in 2022.7

autumn knot
#

I don't know the historical load average and to be honest I never really found it a useful metric of anything, I would usually look at CPU %

#

But I am on the latest HA OS with a downgraded HA core version and everything is working fine

#

For me it was crashing about once every 1-3 days, for e.g. I can see in my log from yesterday 7 AM the NCP entered failed state message, but since I am on 2022.6.7 it recovered without any noticeable issue.

smoky tinsel
#

So the latest OS but downgraded Core still works. Interesting.

autumn knot
#

Yea, I'm not even sure how you would downgrade the OS... maybe I should start taking snapshots

short stirrup
#

No crashes yet. Not sure if something changed or that all of the debugging has changed the environment in such a way that the error won't occur.

subtle basalt
#

debug logging actually ups IO - writing a lot more data to log

autumn knot
#

I can simulate a connection drop and the logs appear nothing like the crash issue

short stirrup
#

off topic, should Last Seen update regularly? I'm looking at the Zigbee Coordinator an it only shows the last restart of Home Assistant.

autumn knot
#

I just simulated a connection drop and my last seen appears to be when the connection was reestablished

uneven urchin
#

devices should update each time we have any communication with them at all

short stirrup
#

off to lunch for a bit. I'll update with any changes. Thanks for y'alls engagement and assistance.

short stirrup
#

I had a crash but it restarted and re-established the connection to the coordinator. Logs have been shared with @uneven urchin

uneven urchin
#

I sent them to puddly as well

short stirrup
#

cool

#

lots of PII in those if you wouldn't mind keeping the audience small

smoky tinsel
#

Of course

short stirrup
#

welp, there's my Diun 🔔 notification to upgrade Home Assistant to 2022.8 :). I'll hold off while we troubleshoot unless you think I should just go for it

smoky tinsel
#

So regarding your log: I see five consecutive watchdog failures (coordinator doesn't respond to a simple ping), the connection is assumed to be dead and closed, and then everything reconnects

#

What exactly is broken here? Is this disconnect not happening with an older HA OS release?

subtle basalt
#

any CRC errors?

smoky tinsel
#

None

subtle basalt
#

(have seen in few cases when esp32 not working well)

uneven urchin
#

yeah everything starts right back up and is in working order....

#

before the crash is looks like automations were reloading several times

#

what's up w/ that?

short stirrup
#

I was editing an automation in the frontend editor

uneven urchin
#

and it "crashed" right after the last edit?

short stirrup
#

yes

uneven urchin
#

back to the original thought though... what is actually wrong

#

after the watchdog quits

#

ZHA spins right back up...

short stirrup
#

What brought me to that issue was Logger: zigpy.application Source: /usr/local/lib/python3.10/site-packages/zigpy/application.py:85 First occurred: 2:31:52 PM (1 occurrences) Last logged: 2:31:52 PM Couldn't start application

smoky tinsel
#

It's unfortunate that debug logging fixes the problem 😅

short stirrup
#

I know, I was really hoping if would crash while under debug. Just for sanity, I'm going to disable debug logging and see if it crashed out without a restart

subtle basalt
#

I fired up dev instance to see if I can replicate.

smoky tinsel
short stirrup
smoky tinsel
#

If you could post that it would be helpful

#

The traceback for that specific error message in the log

short stirrup
#

I've disabled debugging and am upgrading to 2022.8. I'm sorry I couldn't reproduce the error. If I can get it to happen again (and not automatically restart), I'll ping the thread. Thanks again for all of the help and guidance (and an amazing integration!)

smoky tinsel
#

No worries, thanks for checking it out

uneven urchin
#

Yeah, thanks for helping us poke at this

short stirrup
#

Welp it failed hard again

smoky tinsel
#

Any way you can get it to fail again with debug logging enabled 😆
I have a few theories but would like to confirm before blindly making changes to various packages and hoping this problem gets fixed

short stirrup
#

Yes. I've enabled debug logging this morning and will leave it be. Now I know it will eventually go again. I happened again around 12:30am so it's at least a bit consistent

short stirrup
#

I did notice there is zero indication of the failure. All of my Zigbee devices show the last state they were in. Battery readings are still there from the last poll. The only indication was that my FP1 didn't "see" me walk into my room. I reset it and checked the automation before going to logs to see the dreaded [zigpy.application] Couldn't start application

short stirrup
#

So far, I'm seeing disconnects with subsequent reconnects. That's what happens when I have debugging on. Still watching [D][streamserver:055]: Client 10.1.10.2 disconnected [D][streamserver:106]: New client connected from 10.1.10.2 [D][streamserver:055]: Client 10.1.10.2 disconnected [D][streamserver:106]: New client connected from 10.1.10.2 [D][streamserver:055]: Client 10.1.10.2 disconnected [D][streamserver:106]: New client connected from 10.1.10.2

short stirrup
#

Alright, captured a full crash and sent to @uneven urchin

#

with debug on!

#
2022-08-04 16:46:06.833 DEBUG (MainThread) [bellows.zigbee.application] Cancelling watchdog
2022-08-04 16:46:06.834 DEBUG (MainThread) [bellows.zigbee.application] ControllerApplication reset unsuccessful: 
2022-08-04 16:46:06.836 ERROR (bellows.thread_0) [homeassistant] Error doing job: Exception in callback SerialTransport._call_connection_lost(None)
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/local/lib/python3.10/site-packages/serial_asyncio/__init__.py", line 417, in _call_connection_lost
    self._serial.close()
  File "/usr/local/lib/python3.10/site-packages/serial/urlhandler/protocol_socket.py", line 104, in close
    time.sleep(0.3)
  File "/usr/src/homeassistant/homeassistant/util/async_.py", line 180, in protected_loop_func
    check_loop(func, strict=strict)
  File "/usr/src/homeassistant/homeassistant/util/async_.py", line 141, in check_loop
    raise RuntimeError(
RuntimeError: Detected blocking call to sleep inside the event loop. Use `await hass.async_add_executor_job()`; This is causing stability issues. Please report issue```
smoky tinsel
short stirrup
#

done

short stirrup
#

Is there any benefit to leaving it in a crashed state?

#

Or am I okay to turn off debugging and restart.

smoky tinsel
#

I think you're fine to restart normally, I'll take a look at the log in a bit

short stirrup
#

Thanks

short stirrup
#

Good morning. Did the logs provide anything useful?

smoky tinsel
#

Indeed, a potential fix should be in the next HA release

short stirrup
#

fantastic! Is this something I could install as a custom component now to test?

#

restarting HA 2-3 times a day is getting old 😉

smoky tinsel
short stirrup
#

Alright, giving this a shot. 1) clone the full repo 2) in config/custom_components, linked your version of zha

uneven urchin
#

the real release should be out either today or tomorrow I think... the . releases usually flow fairly quickly after a release

short stirrup
#

gotcha. Was hoping to test that fix a bit earlier.

uneven urchin
#

i get it... just putting it out there.. I'd try it early too if I was crashing like you are!

short stirrup
#

well, I get 08:44:21.196 WARNING (SyncWorker_0) [homeassistant.loader] We found a custom integration zha which has not been tested by Home Assistant. This component might cause stability problems, be sure to disable it if you experience issues with Home Assistant but it doesn't show the bumped version of bellows or zigpy so I missed something

short stirrup
#

no dice for a docker version. I manually upgraded the python package in both regular and virt envs and download diagnostics still shows them on their older versions.

#

no big, I can wait for the new release

smoky tinsel
#

HA auto installs the exact version in the manifest, unfortunately

#

So it'll downgrade. If you can control how HA starts, you can run hass --skip-pip.

subtle basalt
#

can you edit the mainfest.json in the installed component dir?

smoky tinsel
#

Yeah, if you are in the Docker container itself you can find the manifest.json of the actual ZHA integration and edit the versions in there.

short stirrup
#

did that, zha won't load now Unable to install package pyserial==3.4: ERROR: Cannot install pyserial==3.4 because these package versions have conflicting dependencies. ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts

#

again, not worth the fuss, I can wait 😄

short stirrup
#

It's still dying but we're not dropping the connection.

#
Source: /usr/local/lib/python3.10/site-packages/bellows/zigbee/application.py:655
First occurred: 7:15:35 PM (1 occurrences)
Last logged: 7:15:35 PM

ControllerApplication reset unsuccessful: TimeoutError()
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/bellows/zigbee/application.py", line 652, in _reset_controller_loop
    await self._reset_controller()
  File "/usr/local/lib/python3.10/site-packages/bellows/zigbee/application.py", line 670, in _reset_controller
    await self.startup()
  File "/usr/local/lib/python3.10/site-packages/zigpy/application.py", line 70, in startup
    await self.load_network_info(load_devices=False)
  File "/usr/local/lib/python3.10/site-packages/bellows/zigbee/application.py", line 238, in load_network_info
    (status, network_key) = await ezsp.getKey(
  File "/usr/local/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
#
Source: /usr/local/lib/python3.10/site-packages/zigpy/application.py:96
First occurred: 7:15:35 PM (1 occurrences)
Last logged: 7:15:35 PM

Couldn't start application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/zigpy/application.py", line 70, in startup
    await self.load_network_info(load_devices=False)
  File "/usr/local/lib/python3.10/site-packages/bellows/zigbee/application.py", line 238, in load_network_info
    (status, network_key) = await ezsp.getKey(
  File "/usr/local/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
smoky tinsel
#

Does it function after the reset failures? It should re-establish the connection again, no matter how many attempts it takes.

short stirrup
#

Nope, fp1s go unavailable and connection is never reestablished

#

Back to capturing debug logs?

smoky tinsel
#

Hmmm, that's not what I was expecting.

#

Unfortunately. Does it keep retrying or is this the last you see of the error during reconnection?

short stirrup
#

That's the last. No more attempts

short stirrup
#

Well here's a new one ```Logger: bellows.ezsp.protocol
Source: runner.py:119
First occurred: August 7, 2022 at 10:18:46 PM (6 occurrences)
Last logged: 1:05:37 AM

Unknown application frame 0x0208 received: b'306a' (b'0080000802306a'). This is a bug!```

subtle basalt
short stirrup
#

Hey @subtle basalt is there any other testing I can do from the EFR32 side? I pinged for about 30-min to make sure there is no packet loss (none).

subtle basalt
#

you mean ESP32/ethernet side?

short stirrup
#

yep

subtle basalt
#

I guess you could run a ping for longer to look for disconnects.

short stirrup
#

Running a continuous ping with timestamps to see if the drops are network related. I did notice we lost connection at 3:08am which coincides with my server maintenance window, tons of disk IO for backups, snapraid calculations, etc. The server does lose connection with a few integrations at that time; sabnzbd, frigate, isy994 and chromecast all throw a few disconnect errors but re-establish successfully. So far, ZHA seems to be the only integration that doesn't gracefully recover.

#

on a positive note, I do see heartbeats fail at around 9am but connectivity is reestablished to the coordinator

smoky tinsel
short stirrup
#

Awesome, thank you @smoky tinsel !

short stirrup
#

That ethernet connection looking pretty stable - 6295/6295 packets, 0% loss, min/avg/ewma/max = 0.192/0.304/0.288/3.994 ms

subtle basalt
#

that's good. I think if you were having esp32 issues you'd be seeing CRC errors in the log as well

short stirrup
#

3 more unrecovered quiet crashes in the last 24-hours (16:30, 20:01 and 6:42), not coinciding with any system instability or packet loss. 77647/77647 packets, 0% loss, min/avg/ewma/max = 0.183/0.318/0.307/8.027 ms. You can see my worst latency to the coordinator was 8ms. Fingers crossed that @smoky tinsel next patch does the trick. Will additional logs be of any help?

#

FYI I did try to downgrade back to 2022.6 and none of my Zigbee devices work anymore. I assume some sort of 1-way upgrade was made This zigpy release uses database schema v8 but the database is v10. Downgrading zigpy is *not* recommended and may result in data loss. Use at your own risk.

smoky tinsel
short stirrup
#

I guess it's a good thing that my Zigbee network is only 4 devices or I'd be a sad panda. It will be great to get the FP1s up and working. When ZHA is up. they are magic.

uneven urchin
#

Yeah they are

short stirrup
short stirrup
#

hmm, that might have actually worked

      "bellows": "0.33.0.dev0",```
short stirrup
#

Still dying out with both the patched bellows and the longer backup cycle recommended in the bug.

#

Far, far less often but not stable yet

smoky tinsel
#

Hmmm Can you get a debug log of it not reconnecting? I've looked at about every corner case I can think of and can't see how it can still fail.

autumn knot
#

My zigbee crashed again with 2022.8.2

#

I have debug log enabled

#

NCP entered failed state. Requesting APP controller restart followed by Couldn't start application

smoky tinsel
#

2022.8.3 was released a few days ago with potential fixes, a debug log from 2022.8.2 likely won't contain anything useful

autumn knot
#

Uhhh I went to 2022.8.2 because I thought there was a fix

smoky tinsel
#

There was. And there's one in 2022.8.3 as well, and in 2022.8.4 whenever it will get released.

autumn knot
#

I just updated, would the log still be there and what exactly am I looking for?

#

And no I haven't added anything to my config but I updated on Sunday right after you said to give it a try so crashing after 4 days doesn't seem like extra activity is causing issues

smoky tinsel
#

If it breaks again and doesn't recover, that'd be it. I don't think it should with the correct network backup interval config from the linked comment.

autumn knot
#

How, unless I disable backup? For me the issue is not that it crashes, lost connection, or w/e. For me the issue is that this happens and even hours later is not recovered.

smoky tinsel
#

That's what I'm trying to address. If it doesn't re-connect, that's a bug. I believe it's only partially addressed in 2022.8.2, what I expect to be the final one will be in 2022.8.4 (or earlier if you want to try manually pinning bellows==0.32.0).

autumn knot
#

I have a log if you want it

smoky tinsel
#

Sure, PM it

short stirrup
#

I've enabled debug logging. Other than the diagnostic download that showed I'm running bellows 0.33.0.dev0, is there a way to make sure?

smoky tinsel
#

How did you install the dev version?

short stirrup
#

Using docker-compose, I mapped /usr/local/lib/python3.10/site-packages/bellows to a host directory containing your PR

#
volumes:
  - "/home/jshank/bellows/bellows/:/usr/local/lib/python3.10/site-packages/bellows"```
smoky tinsel
#

2022.8.4 was just released about a minute ago, that will reference the correct packages if you want to be 100% sure

#

But I believe mapping it should also work

short stirrup
#

cool. I'll take out the remap AND comment out the backup_period: 86400

#

once the update hits

uneven urchin
#

It’s out

short stirrup
#

It crashed overnight and I caught it with full logging

#

I sent the logs to @smoky tinsel and @uneven urchin

#

what is interesting about this is that I still see updates to [homeassistant.components.zha.core.device] [0x1B55](lumi.sensor_wleak.aq1): Update device availability - device available: True - new availability: True - changed: False in the logs. It just loses both FP1s after the crash and never recovers.

#

Ok, maybe this is something else. It looks like ZHA is actually staying up. I'm just losing both FP1s. 2022-08-13 06:45:26.861 DEBUG (MainThread) [homeassistant.core] Bus:Handling <Event state_changed[L]: entity_id=sensor.aqara_lumi_motion_ac01_d27d5100_basic_lqi, old_state=<state sensor.aqara_lumi_motion_ac01_d27d5100_basic_lqi=255; state_class=measurement, device_class=signal_strength, friendly_name=Office Occupancy Lqi @ 2022-08-12T15:55:45.503465-07:00>, new_state=<state sensor.aqara_lumi_motion_ac01_d27d5100_basic_lqi=unavailable; state_class=measurement, device_class=signal_strength, friendly_name=Office Occupancy Lqi @ 2022-08-13T06:45:26.861854-07:00>> 2022-08-13 06:46:08.301 DEBUG (MainThread) [homeassistant.components.zha.core.device] [0x08C5](lumi.motion.ac01): last_seen is 7610.337707281113 seconds ago and ping attempts have been exhausted, marking the device unavailable

smoky tinsel
#

This may be a lower level problem than I originally thought. Home Assistant switched from Python 3.9 to 3.10 in 2022.7.0, if I remember correctly, and I'm getting strange errors during testing with a makeshift Ethernet coordinator.

Can you try disabling the bellows UART thread? https://github.com/zigpy/bellows/blob/cb11d6cf9175b3a924a0c89ce57c93e4a54dd63c/bellows/uart.py#L357 . Change the True to a False and see if this changes anything.

ZHA won't be super stable if you disable the separate thread and if your HA Core has a lot of running integrations (hence the separate thread for bellows) but I'm unable to get bellows to misbehave if I disable the thread.

short stirrup
#

sure

#

so go back to pointing bellows to the external version and then change line 357 to =False?

smoky tinsel
#

Sure. If you can, point it to the Git repo head, not my feature branch.

short stirrup
#

k

#

ya, looks like I'm pointing at dev which is the same as HEAD

uneven urchin
#

@short stirrup have you checked to make sure your coordinator is using hashed link keys?

#

I’d assume you are if the FP1’s stay connected for any amount of time

short stirrup
#

not sure how to check that. It's been a few months since I setup @subtle basalt coordinator

uneven urchin
#

You can tell by looking at the data in a backup

short stirrup
#

just added it through the integrations section and put this in configuration.yaml zha: zigpy_config: source_routing: true ezsp_config: CONFIG_APS_UNICAST_MESSAGE_COUNT: 30 CONFIG_MAX_END_DEVICE_CHILDREN: 64 CONFIG_SOURCE_ROUTE_TABLE_SIZE: 200 CONFIG_ROUTE_TABLE_SIZE: 16 CONFIG_ADDRESS_TABLE_SIZE: 32 CONFIG_PACKET_BUFFER_COUNT: 254 CONFIG_BINDING_TABLE_SIZE: 32 CONFIG_NEIGHBOR_TABLE_SIZE: 26

uneven urchin
#

Of the coordinator

short stirrup
#

alright, back on "versions": { "bellows": "0.33.0.dev0",

#

sorry to be dumb @uneven urchin but where do I find the backup to check?

uneven urchin
#

You’d have to take one

#

With bellows or Zigpy cli

#

@smoky tinsel you have the docs for that handy?

smoky tinsel
#

It's also in the zigbee.db file

#

I think the hashed link key is saved in the backup in a stack specific key

subtle basalt
#

bellows -d socket://ip:port info should show it too

smoky tinsel
#

It will also be in the startup debug log, NetworkInfo(...

short stirrup
#

Network info: NetworkInfo(extended_pan_id=redactifiedcb:86:dc, pan_id=0x5359, nwk_update_id=0, nwk_manager_id=0x0000, channel=15, channel_mask=<Channels.ALL_CHANNELS: 134215680>, security_level=5, network_key=Key(key=redactified, tx_counter=9814017, rx_counter=0, seq=0, partner_ieee=ff:ff:ff:ff:ff:ff:ff:ff), tc_link_key=Key(key=redactified, tx_counter=24576, rx_counter=0, seq=0, partner_ieee=04:cd:15:ff:fe:35:b7:10), key_table=[], children=[], nwk_addresses={}, stack_specific={'ezsp': {'hashed_tclk': 'redactified'}}, metadata={'ezsp': {'manufacturer': 'tubesZB', 'board': 'Series1 Pro V2', 'version': '6.10.3.0 build 297', 'stack_version': 8, 'can_write_custom_eui64': True}}, source='bellows@0.33.0.dev0')

#

hashed_tclk?

uneven urchin
#
Restoring NCP
(venv) ➜  home-assistant git:(dm/zha-coordinator-always-available) ✗ bellows -d /dev/cu.GoControl_zigbee$'\r' info[00:0d:6f:00:0a:ff:74:55]
[0x0000]
[<EmberNetworkStatus.JOINED_NETWORK: 2>]
[<EmberStatus.SUCCESS: 0>, <EmberNodeType.COORDINATOR: 1>, EmberNetworkParameters(extendedPanId=d8:1a:3b:68:86:49:fe:23, panId=0xcc6e, radioTxPower=8, radioChannel=15, joinMethod=<EmberJoinMethod.USE_MAC_ASSOCIATION: 0>, nwkManagerId=0x0000, nwkUpdateId=0, channels=<Channels.ALL_CHANNELS: 134215680>)]
[<EmberStatus.SUCCESS: 0>, EmberCurrentSecurityState(bitmask=<EmberCurrentSecurityBitmask.TRUST_CENTER_USES_HASHED_LINK_KEY|64|32|HAVE_TRUST_CENTER_LINK_KEY|GLOBAL_LINK_KEY: 244>, trustCenterLongAddress=00:0d:6f:00:0a:ff:74:55)]
Manufacturer: HubZ ZigBee
Board name: HUSBZB-1
EmberZNet version: 6.7.8.0 build 373
```
#

As an example

#

EmberCurrentSecurityState

short stirrup
#
[04:cd:15:ff:fe:35:b7:10]
[0x0000]
[<EmberNetworkStatus.JOINED_NETWORK: 2>]
[<EmberStatus.SUCCESS: 0>, <EmberNodeType.COORDINATOR: 1>, EmberNetworkParameters(extendedPanId=a2:16:f9:a7:c4:cb:86:dc, panId=0x5359, radioTxPower=8, radioChannel=15, joinMethod=<EmberJoinMethod.USE_MAC_ASSOCIATION: 0>, nwkManagerId=0x0000, nwkUpdateId=0, channels=<Channels.ALL_CHANNELS: 134215680>)]
Exception in callback SerialTransport._read_ready()
handle: <Handle SerialTransport._read_ready()>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/local/lib/python3.10/site-packages/serial_asyncio/__init__.py", line 120, in _read_ready
    self._protocol.data_received(data)
  File "/usr/local/lib/python3.10/site-packages/bellows/uart.py", line 73, in data_received
    self.frame_received(frame)
  File "/usr/local/lib/python3.10/site-packages/bellows/uart.py", line 100, in frame_received
    self.data_frame_received(data)
  File "/usr/local/lib/python3.10/site-packages/bellows/uart.py", line 120, in data_frame_received
    self._handle_ack(data[0])
  File "/usr/local/lib/python3.10/site-packages/bellows/uart.py", line 245, in _handle_ack
    pending[1].set_result(True)
asyncio.exceptions.InvalidStateError: invalid state
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError```
#

probably because another copy is already talking to it (Home Assistant)

short stirrup
#

some weird behavior. The two battery operated zigbee devices lumi.sensor_wleak.aq1 and lumi.sensor_magnet.aq2 won't respond or become available. The two FP1s lumi.motion.ac01 seem to be working just fine.

#

I've restarted the coordinator and home assistant. Going to back out the custom bellows and see if anything changes...

#

Both battery devices work with the non-dev version of bellows. Reinstating dev and checking one more time for sanity sake...

#

well, scratch that. They work for both now. Weird

uneven urchin
#

Battery devices generally can’t be polled

#

Esp Xiaomi

#

if you do poll them with manage clusters it’s best to push the button on them right after issuing the command

#

Because they go to sleep to preserve battery

short stirrup
#

Another crash after changing the use_thread = false

#
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

#

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/bellows/zigbee/application.py", line 653, in _reset_controller_loop
    await self._reset_controller()
  File "/usr/local/lib/python3.10/site-packages/bellows/zigbee/application.py", line 674, in _reset_controller
    await self.startup()
  File "/usr/local/lib/python3.10/site-packages/zigpy/application.py", line 66, in startup
    await self.connect()
  File "/usr/local/lib/python3.10/site-packages/bellows/zigbee/application.py", line 134, in connect
#
  File "/usr/local/lib/python3.10/site-packages/bellows/ezsp/__init__.py", line 90, in initialize
    await ezsp._protocol.initialize(zigpy_config)
  File "/usr/local/lib/python3.10/site-packages/bellows/ezsp/protocol.py", line 76, in initialize
    await self._cfg(self.types.EzspConfigId[config], value)
  File "/usr/local/lib/python3.10/site-packages/bellows/ezsp/protocol.py", line 35, in _cfg
    (status,) = await self.setConfigurationValue(config_id, value)
  File "/usr/local/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
2022-08-13 16:42:58.785 ERROR (MainThread) [homeassistant] Error doing job: Exception in callback SerialTransport._call_connection_lost(None)
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/local/lib/python3.10/site-packages/serial_asyncio/__init__.py", line 417, in _call_connection_lost
    self._serial.close()
  File "/usr/local/lib/python3.10/site-packages/serial/urlhandler/protocol_socket.py", line 104, in close
    time.sleep(0.3)
  File "/usr/src/homeassistant/homeassistant/util/async_.py", line 180, in protected_loop_func
    check_loop(func, strict=strict)
  File "/usr/src/homeassistant/homeassistant/util/async_.py", line 141, in check_loop
    raise RuntimeError(
RuntimeError: Detected blocking call to sleep inside the event loop. Use `await hass.async_add_executor_job()`; This is causing stability issues. Please report issue```
#

What's weird is that it take a while for the devices to finally show unavailable. the controller crashed at 16:42 but the FP1 didn't actually go unavailable until 18:43, a bit over two hours..

uneven urchin
#

2 hours for mains devices and 6 for battery IIRC

#

it’s not uncommon for many zigbee devices to sleep

#

Many Xiaomi devices only report 1x per hour if they aren’t tripped

#

all 3 integrations do some flavor of buffering unavailable like this

short stirrup
#

Hey @smoky tinsel, no rush but when you get a moment, let me know if there's anything else you'd like me to try. I'm happy to debug as long as you'd like. I think I'm going to install the Sonoff Zig 3.0 USB device and bring up a Z2M container for a bit to stabilize the FP1s while we debug.

short stirrup
short stirrup
#

Zha quietly crashed last night with just the one leak sensor on it. I just updated to 2022.8.5

smoky tinsel
short stirrup
#

done jshank@shanknas:/mnt/media/docker-configs/hass$ docker exec -it hass bash bash-5.1# grep -n "await self.connect()" /usr/local/lib/python3.10/site-packages/bellows/zigbee/application.py 674: await self.connect() bash-5.1#

#

zigbee debug shows "source": "bellows@0.33.0.dev0"

short stirrup
#

Crashed out a little less than an hour later. I don't see device go unavailable until ~6-hours later when the leak sensor fails to check in and is marked unavailable. Sending logs to @smoky tinsel

short stirrup
#

I've backed out the modified version of bellows and disabled debugging. Let me know if you need any other testing done.

smoky tinsel
short stirrup
#

Awesome, thanks for your work @smoky tinsel. I'd be interested to know if it's ZHA -> EFR32 or something to do with my Docker/Linux setup.