#ZHA quietly crashing
1 messages · Page 1 of 1 (latest)
Just got a few [bellows.zigbee.application] Watchdog heartbeat timeout: but it hasn't ccrashed completely
That definitely shouldn't happen
Do I really need homeassistant.core: debug It makes the logs noisy AF
It will help us tell if other integrations are choking the loop
If you have a misbehaving integration that blocks the event loop zha will choke out
EZSP is very latency sensitive
alright, when it crashes and if the currently enabled logs don't help, I'll turn that on.
2022-08-03 09:35:07 WARNING (MainThread) [bellows.zigbee.application] Watchdog heartbeat timeout:
2022-08-03 09:35:22 WARNING (MainThread) [bellows.zigbee.application] Watchdog heartbeat timeout:
2022-08-03 09:35:37 WARNING (MainThread) [bellows.zigbee.application] Watchdog heartbeat timeout:
2022-08-03 09:35:52 WARNING (MainThread) [bellows.zigbee.application] Watchdog heartbeat timeout:
2022-08-03 09:35:52 ERROR (bellows.thread_0) [homeassistant] Error doing job: Exception in callback ._call_connection_lost(None)
Traceback (most recent call last):
File "/usr/local/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/usr/local/lib/python3.10/site-packages/serial_asyncio/__init__.py", line 417, in _call_connection_lost
self._serial.close()
File "/usr/local/lib/python3.10/site-packages/serial/urlhandler/protocol_socket.py", line 104, in close
time.sleep(0.3)
File "/usr/src/homeassistant/homeassistant/util/async_.py", line 180, in protected_loop_func
check_loop(func, strict=strict)
File "/usr/src/homeassistant/homeassistant/util/async_.py", line 141, in check_loop
raise RuntimeError(
RuntimeError: Detected blocking call to sleep inside the event loop. Use `await hass.async_add_executor_job()`; This is causing stability issues. Please report issue```
The whole log file is most helpful. These are just symptoms of EZSP crashing and a bug with pyserial during reconnect, which should not be happening to begin with.
it didn't crash though but some interesting logs. We're still connected to the coordinator
When the true crash happens, I get a log entry on the coordinator [D][streamserver:055]: Client 10.1.10.2 disconnected
enabling homeassistant.core: debug and restarting
The wall of text, it burns!
Have you shared a diagnostic file for zha?
need to start comparing setups and things like that across these cases
Yes
what's the preferred file sharing method?
You can post it here
hmm, can't attach a file. Want a dump here?
Sending it over PM is easier, it's way too big to fit in any reasonable number of comments
DM'd to dmulcahey
What is your avg CPU usage
in system health
is there any chance at all frigate is getting busy when this happens?
I ask because it’s pretty resource intense and I actually won’t run it on my prod box. Had it set up with 2 corals and 4 cameras and it was making the box run hot
No doubt the CPU is busy, I don' thave a avg CPU usage in system health (in Home Assistant) top shows load average: 6.38, 5.45, 5.02
what does the cpu sit at?
Frigate has always been busy but ZHA dropping is new in 2022.7
yeah the full debug log will be helpful when it happens
cpu is between 60-70% utilized, all day long
This machine gets pushed hard, all the time, but it hasn't caused issues with ZHA until now.
Ya, this is a full Linux server running 37 docker containers. Quad core Xeon processor, 16GB of RAM and about 60TB of storage
@autumn knot Let's migrate this discussion here
Is the load average significantly different with the new HA OS? Or are the figures you posted for the release that works?
Is that to me?
Both 😄
no significant change in proc usage in 2022.7
I don't know the historical load average and to be honest I never really found it a useful metric of anything, I would usually look at CPU %
But I am on the latest HA OS with a downgraded HA core version and everything is working fine
For me it was crashing about once every 1-3 days, for e.g. I can see in my log from yesterday 7 AM the NCP entered failed state message, but since I am on 2022.6.7 it recovered without any noticeable issue.
So the latest OS but downgraded Core still works. Interesting.
Yea, I'm not even sure how you would downgrade the OS... maybe I should start taking snapshots
No crashes yet. Not sure if something changed or that all of the debugging has changed the environment in such a way that the error won't occur.
debug logging actually ups IO - writing a lot more data to log
I can simulate a connection drop and the logs appear nothing like the crash issue
off topic, should Last Seen update regularly? I'm looking at the Zigbee Coordinator an it only shows the last restart of Home Assistant.
I just simulated a connection drop and my last seen appears to be when the connection was reestablished
the coordinator wont change often
devices should update each time we have any communication with them at all
off to lunch for a bit. I'll update with any changes. Thanks for y'alls engagement and assistance.
I had a crash but it restarted and re-established the connection to the coordinator. Logs have been shared with @uneven urchin
I sent them to puddly as well
Of course
welp, there's my Diun 🔔 notification to upgrade Home Assistant to 2022.8 :). I'll hold off while we troubleshoot unless you think I should just go for it
So regarding your log: I see five consecutive watchdog failures (coordinator doesn't respond to a simple ping), the connection is assumed to be dead and closed, and then everything reconnects
What exactly is broken here? Is this disconnect not happening with an older HA OS release?
any CRC errors?
None
(have seen in few cases when esp32 not working well)
yeah everything starts right back up and is in working order....
before the crash is looks like automations were reloading several times
what's up w/ that?
I was editing an automation in the frontend editor
and it "crashed" right after the last edit?
yes
back to the original thought though... what is actually wrong
after the watchdog quits
ZHA spins right back up...
yep, this is a new behavior today, or since enabling the debug. Previous behavior was for it to go away and not restart. That's the same error as reported here - https://github.com/home-assistant/core/issues/75292
What brought me to that issue was Logger: zigpy.application Source: /usr/local/lib/python3.10/site-packages/zigpy/application.py:85 First occurred: 2:31:52 PM (1 occurrences) Last logged: 2:31:52 PM Couldn't start application
It's unfortunate that debug logging fixes the problem 😅
I know, I was really hoping if would crash while under debug. Just for sanity, I'm going to disable debug logging and see if it crashed out without a restart
I fired up dev instance to see if I can replicate.
Was there any more info in the log? Any tracebacks?
There was but I don't have the full log, nor was I debugging yesterday
If you could post that it would be helpful
The traceback for that specific error message in the log
I'm digging around, here is one, but it could be from startup https://pastebin.com/5u9Krh7T
I've disabled debugging and am upgrading to 2022.8. I'm sorry I couldn't reproduce the error. If I can get it to happen again (and not automatically restart), I'll ping the thread. Thanks again for all of the help and guidance (and an amazing integration!)
No worries, thanks for checking it out
Yeah, thanks for helping us poke at this
Any way you can get it to fail again with debug logging enabled 😆
I have a few theories but would like to confirm before blindly making changes to various packages and hoping this problem gets fixed
Yes. I've enabled debug logging this morning and will leave it be. Now I know it will eventually go again. I happened again around 12:30am so it's at least a bit consistent
I did notice there is zero indication of the failure. All of my Zigbee devices show the last state they were in. Battery readings are still there from the last poll. The only indication was that my FP1 didn't "see" me walk into my room. I reset it and checked the automation before going to logs to see the dreaded [zigpy.application] Couldn't start application
So far, I'm seeing disconnects with subsequent reconnects. That's what happens when I have debugging on. Still watching [D][streamserver:055]: Client 10.1.10.2 disconnected [D][streamserver:106]: New client connected from 10.1.10.2 [D][streamserver:055]: Client 10.1.10.2 disconnected [D][streamserver:106]: New client connected from 10.1.10.2 [D][streamserver:055]: Client 10.1.10.2 disconnected [D][streamserver:106]: New client connected from 10.1.10.2
Alright, captured a full crash and sent to @uneven urchin
with debug on!
[36m2022-08-04 16:46:06.833 DEBUG (MainThread) [bellows.zigbee.application] Cancelling watchdog[0m
[36m2022-08-04 16:46:06.834 DEBUG (MainThread) [bellows.zigbee.application] ControllerApplication reset unsuccessful: [0m
[31m2022-08-04 16:46:06.836 ERROR (bellows.thread_0) [homeassistant] Error doing job: Exception in callback SerialTransport._call_connection_lost(None)
Traceback (most recent call last):
File "/usr/local/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/usr/local/lib/python3.10/site-packages/serial_asyncio/__init__.py", line 417, in _call_connection_lost
self._serial.close()
File "/usr/local/lib/python3.10/site-packages/serial/urlhandler/protocol_socket.py", line 104, in close
time.sleep(0.3)
File "/usr/src/homeassistant/homeassistant/util/async_.py", line 180, in protected_loop_func
check_loop(func, strict=strict)
File "/usr/src/homeassistant/homeassistant/util/async_.py", line 141, in check_loop
raise RuntimeError(
RuntimeError: Detected blocking call to sleep inside the event loop. Use `await hass.async_add_executor_job()`; This is causing stability issues. Please report issue[0m```
Great! Could you pm me the log?
done
Is there any benefit to leaving it in a crashed state?
Or am I okay to turn off debugging and restart.
I think you're fine to restart normally, I'll take a look at the log in a bit
Thanks
Good morning. Did the logs provide anything useful?
Indeed, a potential fix should be in the next HA release
fantastic! Is this something I could install as a custom component now to test?
restarting HA 2-3 times a day is getting old 😉
Sure. I don't remember exactly what the best way is nowadays to override dependencies manually but here are the new package versions: https://github.com/home-assistant/core/pull/76275/files
Alright, giving this a shot. 1) clone the full repo 2) in config/custom_components, linked your version of zha
the real release should be out either today or tomorrow I think... the . releases usually flow fairly quickly after a release
gotcha. Was hoping to test that fix a bit earlier.
i get it... just putting it out there.. I'd try it early too if I was crashing like you are!
well, I get 08:44:21.196 WARNING (SyncWorker_0) [homeassistant.loader] We found a custom integration zha which has not been tested by Home Assistant. This component might cause stability problems, be sure to disable it if you experience issues with Home Assistant but it doesn't show the bumped version of bellows or zigpy so I missed something
Hmm. Maybe try the custom deps addon? https://github.com/zigpy/zigpy-xbee#testing-new-releases
no dice for a docker version. I manually upgraded the python package in both regular and virt envs and download diagnostics still shows them on their older versions.
no big, I can wait for the new release
HA auto installs the exact version in the manifest, unfortunately
So it'll downgrade. If you can control how HA starts, you can run hass --skip-pip.
can you edit the mainfest.json in the installed component dir?
Yeah, if you are in the Docker container itself you can find the manifest.json of the actual ZHA integration and edit the versions in there.
did that, zha won't load now Unable to install package pyserial==3.4: ERROR: Cannot install pyserial==3.4 because these package versions have conflicting dependencies. ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts
again, not worth the fuss, I can wait 😄
It's still dying but we're not dropping the connection.
Source: /usr/local/lib/python3.10/site-packages/bellows/zigbee/application.py:655
First occurred: 7:15:35 PM (1 occurrences)
Last logged: 7:15:35 PM
ControllerApplication reset unsuccessful: TimeoutError()
Traceback (most recent call last):
File "/usr/local/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
return fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/bellows/zigbee/application.py", line 652, in _reset_controller_loop
await self._reset_controller()
File "/usr/local/lib/python3.10/site-packages/bellows/zigbee/application.py", line 670, in _reset_controller
await self.startup()
File "/usr/local/lib/python3.10/site-packages/zigpy/application.py", line 70, in startup
await self.load_network_info(load_devices=False)
File "/usr/local/lib/python3.10/site-packages/bellows/zigbee/application.py", line 238, in load_network_info
(status, network_key) = await ezsp.getKey(
File "/usr/local/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
Source: /usr/local/lib/python3.10/site-packages/zigpy/application.py:96
First occurred: 7:15:35 PM (1 occurrences)
Last logged: 7:15:35 PM
Couldn't start application
Traceback (most recent call last):
File "/usr/local/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
return fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/zigpy/application.py", line 70, in startup
await self.load_network_info(load_devices=False)
File "/usr/local/lib/python3.10/site-packages/bellows/zigbee/application.py", line 238, in load_network_info
(status, network_key) = await ezsp.getKey(
File "/usr/local/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
Does it function after the reset failures? It should re-establish the connection again, no matter how many attempts it takes.
Nope, fp1s go unavailable and connection is never reestablished
Back to capturing debug logs?
Hmmm, that's not what I was expecting.
Unfortunately. Does it keep retrying or is this the last you see of the error during reconnection?
That's the last. No more attempts
Well here's a new one ```Logger: bellows.ezsp.protocol
Source: runner.py:119
First occurred: August 7, 2022 at 10:18:46 PM (6 occurrences)
Last logged: 1:05:37 AM
Unknown application frame 0x0208 received: b'306a' (b'0080000802306a'). This is a bug!```
that's from this PR - apparently not everything is documented in the ezsp docs. https://github.com/zigpy/bellows/pull/478
Hey @subtle basalt is there any other testing I can do from the EFR32 side? I pinged for about 30-min to make sure there is no packet loss (none).
you mean ESP32/ethernet side?
yep
I guess you could run a ping for longer to look for disconnects.
Running a continuous ping with timestamps to see if the drops are network related. I did notice we lost connection at 3:08am which coincides with my server maintenance window, tons of disk IO for backups, snapraid calculations, etc. The server does lose connection with a few integrations at that time; sabnzbd, frigate, isy994 and chromecast all throw a few disconnect errors but re-establish successfully. So far, ZHA seems to be the only integration that doesn't gracefully recover.
on a positive note, I do see heartbeats fail at around 9am but connectivity is reestablished to the coordinator
https://github.com/zigpy/bellows/pull/481 should help, I think I was able to reproduce your problem with a USB stick last night.
Awesome, thank you @smoky tinsel !
That ethernet connection looking pretty stable - 6295/6295 packets, 0% loss, min/avg/ewma/max = 0.192/0.304/0.288/3.994 ms
that's good. I think if you were having esp32 issues you'd be seeing CRC errors in the log as well
3 more unrecovered quiet crashes in the last 24-hours (16:30, 20:01 and 6:42), not coinciding with any system instability or packet loss. 77647/77647 packets, 0% loss, min/avg/ewma/max = 0.183/0.318/0.307/8.027 ms. You can see my worst latency to the coordinator was 8ms. Fingers crossed that @smoky tinsel next patch does the trick. Will additional logs be of any help?
FYI I did try to downgrade back to 2022.6 and none of my Zigbee devices work anymore. I assume some sort of 1-way upgrade was made This zigpy release uses database schema v8 but the database is v10. Downgrading zigpy is *not* recommended and may result in data loss. Use at your own risk.
You will have seen this in the logs since 2022.8.0, it's nothing new. If you want to try out the bellows patch before I make a new release, it would be quite helpful. I'm not too sure how you'd do it with a Docker install, maybe you can directly apply it within the container?
I guess it's a good thing that my Zigbee network is only 4 devices or I'd be a sad panda. It will be great to get the FP1s up and working. When ZHA is up. they are magic.
Yeah they are
Thinking about how to test this. I can remap specific directories in the docker container. If I remap /usr/local/lib/python3.10/site-packages/bellows to a directory containing your PR, will that work? I also noticed there is a compiled version of bellows at /usr/local/bin/bellows. I assume that is a utility not used by HA.
hmm, that might have actually worked
"bellows": "0.33.0.dev0",```
Still dying out with both the patched bellows and the longer backup cycle recommended in the bug.
Far, far less often but not stable yet
Hmmm Can you get a debug log of it not reconnecting? I've looked at about every corner case I can think of and can't see how it can still fail.
My zigbee crashed again with 2022.8.2
I have debug log enabled
NCP entered failed state. Requesting APP controller restart followed by Couldn't start application
2022.8.3 was released a few days ago with potential fixes, a debug log from 2022.8.2 likely won't contain anything useful
Uhhh I went to 2022.8.2 because I thought there was a fix
There was. And there's one in 2022.8.3 as well, and in 2022.8.4 whenever it will get released.
Ah, never mind, nothing is in 2022.8.3. If you want to post/PM that debug log l can take a look. Did you also add the config from https://github.com/home-assistant/core/issues/75292#issuecomment-1210982416 as well?
I just updated, would the log still be there and what exactly am I looking for?
And no I haven't added anything to my config but I updated on Sunday right after you said to give it a try so crashing after 4 days doesn't seem like extra activity is causing issues
If it breaks again and doesn't recover, that'd be it. I don't think it should with the correct network backup interval config from the linked comment.
How, unless I disable backup? For me the issue is not that it crashes, lost connection, or w/e. For me the issue is that this happens and even hours later is not recovered.
That's what I'm trying to address. If it doesn't re-connect, that's a bug. I believe it's only partially addressed in 2022.8.2, what I expect to be the final one will be in 2022.8.4 (or earlier if you want to try manually pinning bellows==0.32.0).
I have a log if you want it
Sure, PM it
I've enabled debug logging. Other than the diagnostic download that showed I'm running bellows 0.33.0.dev0, is there a way to make sure?
How did you install the dev version?
Using docker-compose, I mapped /usr/local/lib/python3.10/site-packages/bellows to a host directory containing your PR
volumes:
- "/home/jshank/bellows/bellows/:/usr/local/lib/python3.10/site-packages/bellows"```
2022.8.4 was just released about a minute ago, that will reference the correct packages if you want to be 100% sure
But I believe mapping it should also work
cool. I'll take out the remap AND comment out the backup_period: 86400
once the update hits
It’s out
It crashed overnight and I caught it with full logging
I sent the logs to @smoky tinsel and @uneven urchin
what is interesting about this is that I still see updates to [homeassistant.components.zha.core.device] [0x1B55](lumi.sensor_wleak.aq1): Update device availability - device available: True - new availability: True - changed: False in the logs. It just loses both FP1s after the crash and never recovers.
Ok, maybe this is something else. It looks like ZHA is actually staying up. I'm just losing both FP1s. 2022-08-13 06:45:26.861 DEBUG (MainThread) [homeassistant.core] Bus:Handling <Event state_changed[L]: entity_id=sensor.aqara_lumi_motion_ac01_d27d5100_basic_lqi, old_state=<state sensor.aqara_lumi_motion_ac01_d27d5100_basic_lqi=255; state_class=measurement, device_class=signal_strength, friendly_name=Office Occupancy Lqi @ 2022-08-12T15:55:45.503465-07:00>, new_state=<state sensor.aqara_lumi_motion_ac01_d27d5100_basic_lqi=unavailable; state_class=measurement, device_class=signal_strength, friendly_name=Office Occupancy Lqi @ 2022-08-13T06:45:26.861854-07:00>> 2022-08-13 06:46:08.301 DEBUG (MainThread) [homeassistant.components.zha.core.device] [0x08C5](lumi.motion.ac01): last_seen is 7610.337707281113 seconds ago and ping attempts have been exhausted, marking the device unavailable
This may be a lower level problem than I originally thought. Home Assistant switched from Python 3.9 to 3.10 in 2022.7.0, if I remember correctly, and I'm getting strange errors during testing with a makeshift Ethernet coordinator.
Can you try disabling the bellows UART thread? https://github.com/zigpy/bellows/blob/cb11d6cf9175b3a924a0c89ce57c93e4a54dd63c/bellows/uart.py#L357 . Change the True to a False and see if this changes anything.
ZHA won't be super stable if you disable the separate thread and if your HA Core has a lot of running integrations (hence the separate thread for bellows) but I'm unable to get bellows to misbehave if I disable the thread.
sure
so go back to pointing bellows to the external version and then change line 357 to =False?
Sure. If you can, point it to the Git repo head, not my feature branch.
@short stirrup have you checked to make sure your coordinator is using hashed link keys?
I’d assume you are if the FP1’s stay connected for any amount of time
not sure how to check that. It's been a few months since I setup @subtle basalt coordinator
You can tell by looking at the data in a backup
just added it through the integrations section and put this in configuration.yaml zha: zigpy_config: source_routing: true ezsp_config: CONFIG_APS_UNICAST_MESSAGE_COUNT: 30 CONFIG_MAX_END_DEVICE_CHILDREN: 64 CONFIG_SOURCE_ROUTE_TABLE_SIZE: 200 CONFIG_ROUTE_TABLE_SIZE: 16 CONFIG_ADDRESS_TABLE_SIZE: 32 CONFIG_PACKET_BUFFER_COUNT: 254 CONFIG_BINDING_TABLE_SIZE: 32 CONFIG_NEIGHBOR_TABLE_SIZE: 26
Of the coordinator
alright, back on "versions": { "bellows": "0.33.0.dev0",
sorry to be dumb @uneven urchin but where do I find the backup to check?
You’d have to take one
With bellows or Zigpy cli
@smoky tinsel you have the docs for that handy?
It's also in the zigbee.db file
I think the hashed link key is saved in the backup in a stack specific key
bellows -d socket://ip:port info should show it too
It will also be in the startup debug log, NetworkInfo(...
Network info: NetworkInfo(extended_pan_id=redactifiedcb:86:dc, pan_id=0x5359, nwk_update_id=0, nwk_manager_id=0x0000, channel=15, channel_mask=<Channels.ALL_CHANNELS: 134215680>, security_level=5, network_key=Key(key=redactified, tx_counter=9814017, rx_counter=0, seq=0, partner_ieee=ff:ff:ff:ff:ff:ff:ff:ff), tc_link_key=Key(key=redactified, tx_counter=24576, rx_counter=0, seq=0, partner_ieee=04:cd:15:ff:fe:35:b7:10), key_table=[], children=[], nwk_addresses={}, stack_specific={'ezsp': {'hashed_tclk': 'redactified'}}, metadata={'ezsp': {'manufacturer': 'tubesZB', 'board': 'Series1 Pro V2', 'version': '6.10.3.0 build 297', 'stack_version': 8, 'can_write_custom_eui64': True}}, source='bellows@0.33.0.dev0')
hashed_tclk?
Restoring NCP
(venv) ➜ home-assistant git:(dm/zha-coordinator-always-available) ✗ bellows -d /dev/cu.GoControl_zigbee$'\r' info[00:0d:6f:00:0a:ff:74:55]
[0x0000]
[<EmberNetworkStatus.JOINED_NETWORK: 2>]
[<EmberStatus.SUCCESS: 0>, <EmberNodeType.COORDINATOR: 1>, EmberNetworkParameters(extendedPanId=d8:1a:3b:68:86:49:fe:23, panId=0xcc6e, radioTxPower=8, radioChannel=15, joinMethod=<EmberJoinMethod.USE_MAC_ASSOCIATION: 0>, nwkManagerId=0x0000, nwkUpdateId=0, channels=<Channels.ALL_CHANNELS: 134215680>)]
[<EmberStatus.SUCCESS: 0>, EmberCurrentSecurityState(bitmask=<EmberCurrentSecurityBitmask.TRUST_CENTER_USES_HASHED_LINK_KEY|64|32|HAVE_TRUST_CENTER_LINK_KEY|GLOBAL_LINK_KEY: 244>, trustCenterLongAddress=00:0d:6f:00:0a:ff:74:55)]
Manufacturer: HubZ ZigBee
Board name: HUSBZB-1
EmberZNet version: 6.7.8.0 build 373
```
As an example
EmberCurrentSecurityState
[04:cd:15:ff:fe:35:b7:10]
[0x0000]
[<EmberNetworkStatus.JOINED_NETWORK: 2>]
[<EmberStatus.SUCCESS: 0>, <EmberNodeType.COORDINATOR: 1>, EmberNetworkParameters(extendedPanId=a2:16:f9:a7:c4:cb:86:dc, panId=0x5359, radioTxPower=8, radioChannel=15, joinMethod=<EmberJoinMethod.USE_MAC_ASSOCIATION: 0>, nwkManagerId=0x0000, nwkUpdateId=0, channels=<Channels.ALL_CHANNELS: 134215680>)]
Exception in callback SerialTransport._read_ready()
handle: <Handle SerialTransport._read_ready()>
Traceback (most recent call last):
File "/usr/local/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/usr/local/lib/python3.10/site-packages/serial_asyncio/__init__.py", line 120, in _read_ready
self._protocol.data_received(data)
File "/usr/local/lib/python3.10/site-packages/bellows/uart.py", line 73, in data_received
self.frame_received(frame)
File "/usr/local/lib/python3.10/site-packages/bellows/uart.py", line 100, in frame_received
self.data_frame_received(data)
File "/usr/local/lib/python3.10/site-packages/bellows/uart.py", line 120, in data_frame_received
self._handle_ack(data[0])
File "/usr/local/lib/python3.10/site-packages/bellows/uart.py", line 245, in _handle_ack
pending[1].set_result(True)
asyncio.exceptions.InvalidStateError: invalid state
Traceback (most recent call last):
File "/usr/local/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
return fut.result()
asyncio.exceptions.CancelledError```
probably because another copy is already talking to it (Home Assistant)
some weird behavior. The two battery operated zigbee devices lumi.sensor_wleak.aq1 and lumi.sensor_magnet.aq2 won't respond or become available. The two FP1s lumi.motion.ac01 seem to be working just fine.
I've restarted the coordinator and home assistant. Going to back out the custom bellows and see if anything changes...
Both battery devices work with the non-dev version of bellows. Reinstating dev and checking one more time for sanity sake...
well, scratch that. They work for both now. Weird
Battery devices generally can’t be polled
Esp Xiaomi
if you do poll them with manage clusters it’s best to push the button on them right after issuing the command
Because they go to sleep to preserve battery
This will wake them up
Another crash after changing the use_thread = false
Traceback (most recent call last):
File "/usr/local/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
return fut.result()
asyncio.exceptions.CancelledError
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/bellows/zigbee/application.py", line 653, in _reset_controller_loop
await self._reset_controller()
File "/usr/local/lib/python3.10/site-packages/bellows/zigbee/application.py", line 674, in _reset_controller
await self.startup()
File "/usr/local/lib/python3.10/site-packages/zigpy/application.py", line 66, in startup
await self.connect()
File "/usr/local/lib/python3.10/site-packages/bellows/zigbee/application.py", line 134, in connect
File "/usr/local/lib/python3.10/site-packages/bellows/ezsp/__init__.py", line 90, in initialize
await ezsp._protocol.initialize(zigpy_config)
File "/usr/local/lib/python3.10/site-packages/bellows/ezsp/protocol.py", line 76, in initialize
await self._cfg(self.types.EzspConfigId[config], value)
File "/usr/local/lib/python3.10/site-packages/bellows/ezsp/protocol.py", line 35, in _cfg
(status,) = await self.setConfigurationValue(config_id, value)
File "/usr/local/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError[0m
[31m2022-08-13 16:42:58.785 ERROR (MainThread) [homeassistant] Error doing job: Exception in callback SerialTransport._call_connection_lost(None)
Traceback (most recent call last):
File "/usr/local/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/usr/local/lib/python3.10/site-packages/serial_asyncio/__init__.py", line 417, in _call_connection_lost
self._serial.close()
File "/usr/local/lib/python3.10/site-packages/serial/urlhandler/protocol_socket.py", line 104, in close
time.sleep(0.3)
File "/usr/src/homeassistant/homeassistant/util/async_.py", line 180, in protected_loop_func
check_loop(func, strict=strict)
File "/usr/src/homeassistant/homeassistant/util/async_.py", line 141, in check_loop
raise RuntimeError(
RuntimeError: Detected blocking call to sleep inside the event loop. Use `await hass.async_add_executor_job()`; This is causing stability issues. Please report issue[0m```
What's weird is that it take a while for the devices to finally show unavailable. the controller crashed at 16:42 but the FP1 didn't actually go unavailable until 18:43, a bit over two hours..
2 hours for mains devices and 6 for battery IIRC
it’s not uncommon for many zigbee devices to sleep
Many Xiaomi devices only report 1x per hour if they aren’t tripped
all 3 integrations do some flavor of buffering unavailable like this
Hey @smoky tinsel, no rush but when you get a moment, let me know if there's anything else you'd like me to try. I'm happy to debug as long as you'd like. I think I'm going to install the Sonoff Zig 3.0 USB device and bring up a Z2M container for a bit to stabilize the FP1s while we debug.
Is anyone else experiencing this issue or am I the only one? I see that https://github.com/home-assistant/core/issues/75292 is still open and quite a few folks "participating"
Zha quietly crashed last night with just the one leak sensor on it. I just updated to 2022.8.5
Nothing new made it into 2022.8.5 but I believe I finally have your reconnect problem figured out. If you want to try something temporary, change https://github.com/zigpy/bellows/blob/cb11d6cf9175b3a924a0c89ce57c93e4a54dd63c/bellows/zigbee/application.py#L674 to await self.connect().
done jshank@shanknas:/mnt/media/docker-configs/hass$ docker exec -it hass bash bash-5.1# grep -n "await self.connect()" /usr/local/lib/python3.10/site-packages/bellows/zigbee/application.py 674: await self.connect() bash-5.1#
zigbee debug shows "source": "bellows@0.33.0.dev0"
Crashed out a little less than an hour later. I don't see device go unavailable until ~6-hours later when the leak sensor fails to check in and is marked unavailable. Sending logs to @smoky tinsel
I've backed out the modified version of bellows and disabled debugging. Let me know if you need any other testing done.
I'm going to be receiving both the ZBBridge and the same type of coordinator that you have from Tube so I think we can hold off on the remote debugging until I can replicate things at home. I really appreciate your help though, there were quite a few reconnect bugs that have been fixed so far (just not the one you're experiencing 😆)
Awesome, thanks for your work @smoky tinsel. I'd be interested to know if it's ZHA -> EFR32 or something to do with my Docker/Linux setup.