#Excessive Logging 'Data received on an unknown session. Dropping it!'
1 messages · Page 1 of 1 (latest)
Hi @manic moth and @nimble ridge,
regarding CHIP_ERROR Data received on an unknown session. Dropping it
Yesterday evening I restarted my Matter Server addon. Round about 5 minutes later my 55 Matter over Thread devices were available in Home Assistant. Another 5 minutes later the dropping messages began.
- Here is a my log file. I see this dropping logs never ending: https://dpaste.org/MnkNh
- Here is an overview of all my devices (brand/model) and their node ids (sorted by node id): https://dpaste.org/Ri5kU
As already said. I see these messages since 12 May. First thought was the Apple tvOS/HomePodOS 17.5 update. But that was launched on 13 May. In my Home Assistant backups I see the following:
- addon_core_matter_server_5.5.1 - 3 Mai 2024
- core_2024.5.2 - 11 Mai 2024
- core 2024.5.4 and addon_core_matter_server_6.0.0 - 20 Mai 2024
So, also no direct indication that a HA Core or a Matter server update introduced that logs. My Matter system works really stable. So, its not a major issue for me. But it makes it hard to read the logs.
Any idea about this?
Just checked, I don't have any such logs in my logs, and I have a couple of Eve devices in my production systems (2x Motion sensor, 1x Blind, 2x Energy Plug). It sounds a bit like a currupted/incomplete message got received? 🤔 Or maybe deleyed received? 🤔
Given the Apple update is closeset in date, maybe it is still that? While it was launched on May 13, maybe rollout started a bit earlier? 🤔
The error comes from this line of code:
https://github.com/project-chip/connectedhomeip/blob/v1.3.0.0/src/transport/SessionManager.cpp#L722
It seems that the server just can't find a session for that device at that point in time.
What I would try here is enabling more logs. You should find more information about the session in the logs when searching from the error upwards for the session ID (LSID). Maybe it got closed? But why...? Now in a production setup of your size this is really hard if you enable SDK debug log, things get completely overwhelming. What I would suggest to do is to isolate one of the problematic devices on a separate HA test installation, and then enable full logs there. Thanks to Matter multi-admin this should be fairly straight forward 😎
Thanks for checking. No, the latest Apple Release was on Monday (Mai 13). It’s typically on Tuesday. 😉
What I don’t understand is why only battery powered devices show this behavior? Where is the difference between a battery and a mains powered device in regards to the session?
Thanks for the suggestion. When I find the time, I will setup a vm for testing purposes.
Thanks again, I will keep you updated.
I see the same error messages since re-adding my Eve Energy Plugs yesterday evening. I'm on the latest beta and it started after commisioning the 5th or 6th plug out of 8.
I had removed the plugs for the last days because of massive thread network instability with my Apple TV's 4K (2nd gen) after tvOS 17.5
Node 33 - 41 are Eve Motionblinds
I've removed the last four Eve Energy Plugs (out of 8 total) a few minutes ago and the error messages are gone. 🧐
Maybe the polling of the energy attributes has anything to do with it.
In that case you maybe want to switch to HA beta 2024.6 and the latest matter server beta to see if that helps.
I'm on the latest Matter Server 6.1.0 BETA 1 and HA 2024.6.0b3
I've never seen the error messages on the stable beside the instability problem with my Apple TV's and tvOS 17.5(.1)
OK, in that case its probably not related as @alpine bobcat above is on stable.
In my case it seems that also only battery powered Eve Motionblinds are re-subscribing after these messages and maybe caused by the Eve Energy Plugs.
Yeah, I have a feeling the Eve Energy plugs are causing the issue. Did they receive an update recently ?
No update recently. They are still on 3.2.1.
My Eve Motionblinds are mostly stable on re-subscription. I only see it sometimes after closing or opening all 9 blinds together. I think it's because they all send position updates every second at the same time.
@nimble ridge Can I stop polling the energy attributes in the latest beta if I disable the relevant entities in HA?
No, and there is no (more) need to do so - we have optimized all that
Yes, I am using all the latest stable version.
Since the Nanoleaf firmware update 3.6.173 (Matter 1.2) I have the feeling that my Nanoleaf bulbs are more stable then my EVE Energies. But it's really not unstable. Everything works fine. Throughout the day I see 2 to 5 short disconnects, mostly from the EVE Energies. Nanoleaf bulbs have have maybe 1 to 3 short disconnects per week.
My setup is:
Thread Border Routers:
- 2 hardwired AppleTV 4K 3rd Gen
- 4 HomePod Minis
- 1 HomePod v2
Devices:
- 14 EVE Energies
- 9 EVE Thermos
- 7 EVE Door & Window
- 3 EVE Motion
- 1 EVE MotionBlinds
- 11 Nanoleaf A19-E27
- 10 Nanoleaf GU10
Yes, I see the 'dropping it' logs, but I do not see any issue in practice. However, I also have the feeling, that tvOS/HomePodOS 17.5.1 made my setup a bit more prone to the small disconnects. These disconnects vary from 5 seconds to 3-5 minutes. But the devices always come back by itself.
Shall I update to all the latest beta for testing purposes? @pine shoal has the 'dropping it' logs, even though he uses the latest beta of HA and Matter. I am unsure, when I find the time to setup a new HA instance to bind one or more of my battery powered devices to that Matter fabric and enabling more logs. This will take some time.
Most likely this is just something that is now being logged (and was hidden before) so if you dont experience any stability issues I wouldnt care too much, no rush. Still it would be interesting to see if you can reproduce it on an isolated install like Stefan suggested. But like I said, ro rush.
Maybe/hopefully we will be able to reproduce it as well at some point (until now I cant)
Which date did you the messages the very first time? You can grep through your logs via ssh with the following command.
ha host logs --identifier addon_core_matter_server -n 100000 | grep "Dropping it!"
Yesterday evening after re-adding my last 4 plugs. See the first log.
OK, I grep'ed through my logs and brought the data to an Excel pivottable. Here is the result:
2024-05-15 36
2024-05-19 1
2024-05-20 7
2024-05-21 1
2024-05-23 48
2024-05-24 11
2024-05-26 31
2024-05-27 1634
2024-05-28 9500
2024-05-29 10413
2024-05-30 10558
2024-05-31 5356
overall result 37596```
Crazy, wtf... whats going on? 😄
Sadly I do not get back to May 12, even though I brought the lines in the command above to 10000000.
I mean 17.5.1 not 17.5
Ah sorry, you mean 17.5.1. I meant 17.5.
Yeah, I don't know. Some days ago.
😉
WOW 10000 dropping logs a day.
My 'dropping it' logs started at May 12, so one day before tvOS 17.5 release. Maybe I misread it yesterday.
I had massive instability problems since tvOS 17.5 with my ATV's 4K 2nd gen so I switched the active home hub to a homepod mini. After Apple released 17.5.1 only for ATV's it became a mandatory active home hub. I changed to 3nd gen ATV's this week and my network is now stable back again. 🤷♂️
@nimble ridge Any idea how to get into the logfile, to see the first log line? I am firm with vi or nano, but I am unsure where to search for matter server logfile on HAOS.
@nimble ridge I'm trying to reproduce the error messages so I removed all of my Eve Energy Plugs from HA and started to re-add them. Now I get a Traceback in _custom_attributes_poller on every plug while commissioning them.
Ah, that is a small oversight (race condition) if you start removing nodes that are part as polled.
Will fix that soon. Just restart the server to get rid of the error now
Add-on logs are stored in systemd-journald on the host OS. It is automatically dropping the last logs. If you don't get more when increasing the number of lines, then you already get the oldest line stored 😢
Ok, thanks for the information. Have a nice weekend
After updating the Matter Server to 6.1.0 I've readded my Eve Energy Plugs to HA (nodes 109, 111, 112, 113, 114, 115, 116, 117). After adding the fifth plug (node 114) the messages "Data received on an unknown session. Dropping it!" started. I think it is not related to specific nodes because I removed nodes 112, 113, 116 & 117 again. After reducing back to 4 Eve Energies the messages stopped except the 4 lines at 17:12 o'clock.
Is that with a 2024.6.0bx Core version?
2024.6.0b8
@pine shoal You are able to reproduce the issue easily as it seems. Can you enable more logs how @manic moth suggested and post the logs?
In your case you maybe do not need to pair it to another fresh HA instance. How many devices do you have?
I still see a lot of droppings and it’s impossible to read my logfile
My setup is quite big with 63 Thread devices (including the 8 Eve Energy‘s). Without being able to filter the debug logging on devices I can confirm it’s almost impossible to read the logs. 😉
Atm I‘ve removed my Eve Energy’s from HA.
Ok, I have 61 devices. 😅
@manic moth Is it possible to hide, reduce or disable this log?
I mean, there are log levels. What log are you looking to remove?
The ’dropping it‘ logs:
2024-06-25 22:12:04.806 (Dummy-2) CHIP_ERROR [chip.native.IN] Data received on an unknown session (LSID=39581). Dropping it!
——-
Matter server log level: Info
Matter SDK log level: Error
What would be the correct settings to hide these logs?
Hm, I guess "Matter SDK Log Level" to None should make this go away. But error logs should be left enabled, just in case 😅
😃 Yeah, that’s not what I want. The question was if it is somehow possible for you to disable this log explicitly? I do not see any issues in practice. But I can’t read my logfile anymore, because of all the ’dropping it‘ messages. I do not have the time for any testing at the moment. Too many hobbies and it’s summer… 😃 But I really want get back to a readable logfile.
I mean, we could patch it out in the SDK source, but not sure if that is a good idea.
I really wonder what is going on here, why only Eve are affected, and what change caused it 🤔
Yeah, that is what I also don’t understand. The issue came up near Apple OS 17.5 release. Maybe this solves itself with iOS 17.6. But we have to wait for it.
I think we do not need to do it right now. Let’s wait until the next Apple OS release 17.6 and then decide, if it still occurs.
I've grepped through my logs, and I had that error once the last few days, but this was with a Nanoleaf bulb 🤔 Seems more coincidence
I do have two Eve Smart Plugs. But those are not connected through a Apple BR.
There is a 6.2.0b2 release, did you try it? I don't think it will make a difference for that particular issue though, the underlying SDK is largely the same.
Remember my pivot table:
No, I didn’t try it.
But I can do it.
Patching it out isn’t a good idea. Besides these messages my battery powered Eve MotionBlinds start to resubscibe. After removing the Eve Energy’s it goes back to normal.
This does sound like a network overload issue 🤔
I don’t think so. My devices are paired to Apple Home and HA and are working fine if I leave the Eve Energy’s only paired with Apple Home. I also tried HA as the only controller to all devices and got the same issues after pairing the Eve Energy’s. 🙁
Downloading python_matter_server-6.2.0b1-py3-none-any.whl.metadata
Ok, I also have 60 Matter over Thread devices and really do not recognize any issues, besides the dropping logs.
Is the CPU load high caused by the Matter Server?
If I remember correctly you said that your battery powered Eve devices also resubscribe after this messages? 🤔
Sounds like some type of lock contention or deadlock then, if not on the network layer
The dropping it logs are still there with the latest Matter server beta.
I think that CPU load was normal but I can check this on Thursday evening again.
Maybe it’s an issue on Apple side. My HomePod Minis caused instability issues with 17.5 after changing the AP. I‘m now on the 17.6 beta since Sunday and it seems to run better.
No need to do it: https://imgur.com/a/ssuwHNe cpu/ram looks fine on a HA Yellow.
I‘m also using a Yellow
Here is my latest logfile: https://dpaste.org/4KAxP
We have a very similiar setup: HA Yellow, Unifi network, Apple TBRs only, round about 60 MoT devices
HA Yellow Thread radio is disabled here.
Your log also shows resubscription messages from battery powered Eve devices.
I am looking which are re-subscribing, give me some minutes.
Everything starts here with node 16. Its an EVE Energy:
2024-06-25 23:18:04.415 (MainThread) INFO [matter_server.server.device_controller.mdns] Node 16 activity on MDNS, trigger resubscribe
2024-06-25 23:18:06.412 (MainThread) INFO [matter_server.server.device_controller.mdns] Node 16 activity on MDNS, trigger resubscribe
2024-06-25 23:18:08.413 (MainThread) INFO [matter_server.server.device_controller.mdns] Node 16 activity on MDNS, trigger resubscribe
2024-06-25 23:23:00.660 (Dummy-2) CHIP_ERROR [chip.native.IN] Data received on an unknown session (LSID=4729). Dropping it!
2024-06-25 23:23:02.135 (Dummy-2) CHIP_ERROR [chip.native.IN] Data received on an unknown session (LSID=4729). Dropping it!
2024-06-25 23:23:03.537 (Dummy-2) CHIP_ERROR [chip.native.IN] Data received on an unknown session (LSID=4729). Dropping it!
```
node id - brand model
158 - EVE Thermo
11 - EVE Motion
159 - EVE Thermo
165 - EVE Thermo
170 - EVE Thermo
169 - EVE Thermo
163 - EVE Thermo
160 - EVE Thermo
162 - EVE Thermo
220 - EVE Motion
158 - EVE Thermo
87 - EVE Motionsblinds
11 - EVE Motion
165 - EVE Thermo (second time)
237 - EVE Weather
157 - EVE Thermo
in chronological order to the log file
So yes, only battery powered devices produce this 'Dropping it' logs.
Same like me. My other battery powered devices (Aqara & tado) are not affected.
Interesting! I only have EVE battery powered Matter over Thread devices.
I think these Dropping logs are caused by the Eve Energy and this produces re-supscriptions for battery powered Eve devices.
Yeah, maybe, I have 14 EVE Energies paired to Apple Home and Home Assistant.
But you said, you removed all your EVE Energies from Apple Home and had them paired to HA only and the issue was still there. You had to unpair them from HA. Right?
Right. I would swear that your Dropping logs will also reduce and then stop if you reduce your paired Eve Energy’s in HA. For me it started after adding the fifth plug.
Apple Home also seems to be more responsive after removing the Eve Energy’s from HA completely. After opening Apple Home it normally refreshed every device in the past. Now it mostly opens without refreshing.
Very Interesting finding. 😄
I also see this happening. When I shutdown all my Apple TBRs, restart them and wait until the mesh procedure completes, opening Apple Home is really fast. But it slows down over time. It takes 1 or 2 seconds until everything is refreshed and available.
Maybe this should be my next test, removing all EVE Energies from HA. But what do we win, when I do this? 😄
Do we maybe have a firmware problem with the EVE Energies? They are still using Matter 1.0, right? Yes, I know it should be downwards compatible.
But why do you see these logs only with EVE battery powered devices, while aqara and tado do not show this behavior? They also use the EVE Energies as their Thread Routers.
Maybe we run into a CASE session limit here or something.
So it seems that EVE energy are really causing this. Can you try add this to a separate test HA instance? Just 1,2 (or more) Eve energies, and see if that shows the error too? Thanks to multi-admin it should be rather easy to commission it with an addtiional controller.
First this would be interesting while still connected to the productive HA. I would expect that the erros happen on the test HA instance too, as the CASE sessions are shared resources. If it doesn't happen on the test HA instance, then it is more likely a controller issue. If it does happen on the test intsance too, then we probably run into a device limit of some sort.
Maybe I was wrong about the EVE Energy's causing the problem. A few day's ago I removed 4 Nanoleaf GU10 bulbs for special testing from my HA instance. At this moment I was down to 41 active nodes in HA. Today I could pair all of my 8 EVE Energy's without any "Dropping it" errors into HA (back to 49 active nodes). After this I started to re-pair my 4 GU10 bulbs. Shortly after pairing the first bulb the error messages appeared again. I removed the bulb again and the error messages stopped. The same happens with pairing a Aqara P2 Door & Windows sensor. It seems that these error messages start to appear with 50 active nodes. Curiously only battery powered EVE devices re-subscribe after these "Dropping it" messages.
Interesting finding. At the moment I am also down to 52 devices, because of beta testing something. So it is easy to validate for me. I only need to remove three devices from my HA instance. Give me some minutes…
Maybe its too early to say that you found the problem, but my log is quiet, after removing 3 devices from HA. I removed these 3 nodes:
~ # ha host logs --identifier addon_core_matter_server -n 1000 | grep "successfully removed from Matter server"
2024-06-28 00:55:29.825 (MainThread) INFO [matter_server.server.device_controller] Node ID 14 successfully removed from Matter server.
2024-06-28 00:57:39.956 (MainThread) INFO [matter_server.server.device_controller] Node ID 19 successfully removed from Matter server.
2024-06-28 01:04:10.728 (MainThread) INFO [matter_server.server.device_controller] Node ID 17 successfully removed from Matter server.
Here is a snippet from my HA Matter server log from round about 10 minutes before I removed the first of the three nodes: https://dpaste.org/MpQzm
Now its 2024-06-28 01:26 here in Germany, the last log line is still from 2024-06-28 01:07:07.859.
@manic moth Are these operational discovery failed expected after removing a node id from HA? I first removed them in Apple Home from the HA service 4939, before I removed them from HA.
@pine shoal I can't believe it, I can read my logs again, already 20 minutes without a single log line. 😉 Great finding!
I hope this limitation can be set to at least 100 devices. I still have round about 20 bulbs lying around. 😄
Ok, great 1 hour later, still no new log line. Awesome! 👏 Now let’s hope that the limit is somehow configurable. 😉
Good night
Yeah if you remove our fabric entry via Apple Home then this is expected. We still try to find that device, but it doesn't publish an operational service anymore for our fabric/node id.
Actually, there is an event notifying when one removes us from the fabric:
2024-06-28 10:23:10.007 (MainThread) VERBOSE [matter_server.server.device_controller.node_215] Received node event: EventReadResult(Header=EventHeader(EndpointId=0, ClusterId=40, EventId=2, EventNumber=20, Priority=<EventPriority.INFO: 1>, Timestamp=6015131, TimestampType=<EventTimestampType.SYSTEM: 0>), Status=<Status.Success: 0>, Data=BasicInformation.Events.Leave(fabricIndex=10)) - transaction: <Subscription (Id=2154774328)>
So we can react on this , and remove the device on our end....
@alpine bobcat @pine shoal these are very interesting finds! I was thinking about that this problem could be related to a CASE session limit. Your findings very much support that suspcion
I thought it was a per device one, but this seems to point to a controller global limit somehow 🤔
#ifndef CHIP_CONFIG_CONTROLLER_MAX_ACTIVE_DEVICES
#define CHIP_CONFIG_CONTROLLER_MAX_ACTIVE_DEVICES 64
#endif
Soooo how many do you plan to add? 😎
How many devices can you make possible? 😂
I think in the long term I will run 100 maybe 150 Matter devices. What is this limit used for? Why is it configured to 50? Is there any performance reason? Let’s try it, with 100 in the next step. What do you think?
It would be interesting to know how many devices a controller "must" support. Does the Matter standard has a definition for it?
Well, I don’t think there is a limit (there is some discussion about that though) but it must support 100s of devices with ease
I'd say your memory is the limit. Or maybe int max😉
I go with 1024 for now.
Yeah, like Marcel, I am not aware of a limit either. The standard really doesn't talk a lot about controllers in general. 🤷♂️
That’s great. When do you think we will see this implemented? 😃
6.2.0 BETA 2 with this fixed is out now 🎉
Sadly, I am not at home today. Will check it out tomorrow. 😃 Thanks
After updating to 6.2.0b2 I still see „Dropping it“ messages. 🤔
Hm
There is another closely related config CHIP_CONFIG_CONTROLLER_MAX_ACTIVE_CASE_CLIENTS but from what I understand that is only relevant during session establishment
From what I can tell, the other config got successfully applied in the builds 🤔
Maybe we should wait about @alpine bobcat feedback. For me it’s the same behavior like before.
Yeah, its the same for me. I installed the new Matter server 6.20.b2 at 19:33, waited until all devices were connected again and shared one of my EVE Energies (node id 243) that are connected to Apple Home only at 19:43 with Home Assistant. The "Dropping it" logs came back immediately. Now I have 50 MoT devices again.
So, 64 max active devices were configured before the lift to 1024. But why is the border at 50 devices?
Yeah not sure, I can only assume that some of the devices occupy multiple CASE sessions 🤔
Hm…. One Moment… No new Dropping it messages, only these 7 log lines that are already in the log above. Still 50 MoT devices.
Ok, keep crossing my fingers then 🤞 😅
I am going to pair some 2 more Mot devices now. That was my situation before the last test with a logfile full of dropping it messages.
I will keep you updated.
No, with 51 MoT devices its already going crazy. A lot of dropping it logs. I can post a complete log, if you need it.
I‘ve seen the same. With 50 devices it starts with rare „Dropping it“ messages and if I pair more devices, messages also appear more often.
So, maybe we want to try this next.
Yep, I removed the 51. MoT device half an hour ago and the dropping it messages stopped. So I am back to 50 devices for now.
I still see some dropping messages after I un-paired/removed the 51. MoT device (node 244) at 22:11 o'clock. Now where I look through my logs, I see that one of my EVE Energies (node 22) got unavailable after un-pairing (node 244). This node is still paired to Apple Home and works as expected. How can that be? Any idea?
Here is my complete log: https://dpaste.org/bOzGj
Turns out the config did not get applied correctly 😢
But meanwhile, bug found and fixed 🎉
6.2.0 BETA 4 should resolve the problem.
This was my mistake. I deleted the home assistant service from the device, but forgot that I did that. Maybe I am getting old... 😄
Great! I am testing it now. After commissioning the 50th device (node 245), I already see some dropping messages: https://dpaste.org/5Oaie
Now I am going to pair the 51th device.
The log for node 246 looks the same. After the node commissioning completes I see four dropping it messages for the device. https://dpaste.org/Hntqy
And now it starts again: https://dpaste.org/kwzKd
@manic moth that was not the solution.
Yep, my log is full of dropping it now…
This doesn't really make sense 😥
Can you check if b4 really got installed on add-on startup?
Oh, I always posted the log from the last startup…
Downloading python_matter_server-6.2.0b4-py3-none-any.whl.metadata
Yeah that looks good
Did you change this setting, too?
Yes, it should be at 192 now.
So... just thinking out loud here... did you also try to add another matter device ?
anything else than the eve plug I mean
Hi @nimble ridge
@pine shoal tested this here:
Do you think it’s still worth a try? I can try that later. Now I have to bring my toddler to bed.
Actually, that was another setting which I've set to 192. The CHIP_CONFIG_CONTROLLER_MAX_ACTIVE_CASE_CLIENTS setting is still at 16.
Hmm,
#ifndef CHIP_CONFIG_MAX_FABRICS
#define CHIP_CONFIG_MAX_FABRICS 16
#endif // CHIP_CONFIG_MAX_FABRICS
#ifndef CHIP_CONFIG_SECURE_SESSION_POOL_SIZE
#define CHIP_CONFIG_SECURE_SESSION_POOL_SIZE (CHIP_CONFIG_MAX_FABRICS * 3 + 2)
#endif // CHIP_CONFIG_SECURE_SESSION_POOL_SIZE
That adds up to exactly 50, so it seems this is the limit we actually hit.
a bit strange to calculate the pool size on max fabrics ?
I guess for a device this makes sense.
For a controller it is a bit odd indeed. Especially that then there are only 3 secure sessions per fabric allocated 😅
16 very small networks 😅
yeah indeed
So we have a secure session pool of 16 * 3 + 2. Yeah, 16 very small networks. 😃
Can you change chip_config_max_fabrics to 64 for example?
So we get 64 very small networks, which probably nobody needs, but a usable safe session pool size… 😃
I've bumped CHIP_CONFIG_SECURE_SESSION_POOL_SIZE to 4096 directly. So with that we can have multiple fabrics with many many devices 😅
As soon as @alpine bobcat hits that limit we need to do an intervention that he needs to stop adding devices to his home network 🤣
🤣🤣🤣
@alpine bobcat @pine shoal 6.2.0 BETA 5 out with the secure session limit bumped to 4096. This time! 🤞
Downloading python_matter_server-6.2.0b5-py3-none-any.whl.metadata
I think we got it! 🎉 
Round about 40 minutes after the update to 6.2.0b5, I do not see a single "Dropping it!" log line with now 52 devices paired to Home Assistant. Here is my complete log: https://dpaste.org/tschE
Great, thank you! 😃👍
Awesome! Thanks for testing!
Yes, still no new "dropping it" lines. 😄
One last thing. Can you have a short look at my log file, please?
https://dpaste.org/bwO2J
Are these constant (re-)subscribing log lines an expected behavior? I know its all INFO, but I want to be sure. 😉
I also have some CHIP_ERRORs, mostly Time out! failed to receive report data from Exchange. Any idea, why this happens?
These CHIP_ERRORs are not new to me.
So these are triggered by the SDK:
2024-07-02 10:53:33.402 (Dummy-2) CHIP_ERROR [chip.native.DMG] Time out! failed to receive report data from Exchange: 1824r with Node: <00000000000000D5, 1>
2024-07-02 10:53:33.404 (MainThread) INFO [matter_server.server.device_controller.node_213] Previous subscription failed with Error: 50, re-subscribing in 0 ms...
2024-07-02 10:53:36.420 (MainThread) INFO [root] Re-subscription succeeded!
They mean that the device failed to check-in in time. The reason could be that the liveness message got lost. RF/Ethernet networks do not guarnatee delivery, so this can happen every now and then under normal circumstances.
The MRP (Message Reliabiliy Protocol) should trigger up to 4 re-transmits. But it could be that temporary super busy 2.4GHz network takes them all out. Or maybe the device just crashed, but I think then it would take longer to re-subscribe. There was also some discussion if the liveness check-in is a bit too picky (as in, timeouts too early). I expect those to become fewer and fewer as the timings and timeouts of firmware and controllers are improved.
And then there are those:
2024-07-02 11:47:57.721 (MainThread) INFO [matter_server.server.device_controller.mdns] Node 176 activity on MDNS, trigger resubscribe
2024-07-02 11:48:14.665 (MainThread) INFO [matter_server.server.device_controller.mdns] Node 158 re-discovered on MDNS
These are from the Python Matter Server's zeroconf discovery. I would say your log has a bit too many of those 😅 But I'd need to look into why those get triggered that regularly.
OK, thanks again for the explanation. This really helps me to understand what's going on. As long as everything works as expected in practice, I am happy. No rush with the zeroconf discovery messages. Finally these "Dropping it" log lines have disappeared and I can read my log file again. Have a nice evening. 😃
I am a bit inclined to say that these logs are expected on a big network like you have.
Its a lot of traffic for a lot of devices over various transports, including thread. Connections can be lost from time to time or packets dropped. As long as it restores itself and doesnt hurt actual operation. We log every fart of the SDK operations in HA, while the ecosystems like Apple and Google hide that all away from you. So incidental connection resubscriptions are "part of the deal", you see those especially more often with SDK 1.0 devices. Keep an eye on it of things stay responsive or you see logging at super steady/often intervals, like the one about the "dropping it".
I see it exactly the same way and you keep telling me the same things again and again. 😅 We are the early adopters, so we have to live with certain bugs/issues at the moment, especially when you have so many Matter over Thread devices. I will continue to keep an eye on my thread network as a whole and keep an close eye on when the situation worsens or improves. I'm already looking forward to Apple OS 17.6 and 18, which may bring further improvements in addition to new device models. The next EVE/Nanoleaf firmware based on Matter 1.3 will also be interesting.
@nimble ridge I agree and disagree here at the same time.
There are different sources of messages with "resubscribe" in the logs: Some from the SDK, and some from the mDNS discovery.
Over the 3h of operation in the above logs there are a total of 2 from the SDK. This seems somewhat within normal operation normal to me too, I agree.
However, there are a total of 66 from the Matter Servers mdns implementation. That is a bit too many for my liking, tbh 😢 I wonder if we have a timing issue or something there 🤔
@alpine bobcat are the ones from the matter_server.server.device_controller.mdns logger continue to show? Can you share another log maybe?
If we see excessive logs or a pattern then we need to have a better look but in general mdns is very noisy.
If you think there is something to follow-up let's do that - to me this looks pretty normal (as its the same I see in test networks)
OK, here is another log since the last Matter server restart: https://dpaste.org/77nP2
I see lot of CHIP_ERRORs, but in this dashboard history card I do not see any big trouble: https://imgur.com/a/P97HbXU
There was one EVE Energy unavailability within the last 24 hours: https://imgur.com/a/aDsi9Tx and three Nanoeaf unavailabilities for less than 40 seconds. I every case it healed itself. In general I do not have to power cycle devices anymore. They come back online by themselves.
I have some more Matter over Thread devices. But some Nanoleaf bulbs are not paired to HA at the moment for beta testing reasons. And some other devices, like EVE Thermo and EVE Weather, don't give me such a historical line. They show the temperature or humidity for example. I don't know how to get such a historical line for these device types.
The change is also working for me. No more "Dropping it" messages! 🙂
@pine shoal Can you also post a complete from the last Matter server restart? I am interested to see, what’s going on in your log file. 😃
I can upload the logs later this week if you‘re interested. 😉 Today I’ve updated all my Home Hubs to 17.6 public beta 2 and actually waiting for things to settle down. So logs make no sense atm.
Ok, please keep me updated, how it works and if you see some improvements. 😉
I would be interested in a log file of about 24 hours of normal operation to compare the type and amount of CHIP_ERRORs. You have a similar setup to mine.
@manic moth By the way... Here is a log of my last 24 hours: https://dpaste.org/cNEpC
The log is much quieter since the latest updates. Nearly all CHIP_ERRORs are from Nanoleaf bulbs, only one is from an EVE Energy (node 16). Overall I am fine with the log. Everthing works as expected. Thanks
@manic moth Even though its off topic. I see a small issue when I ssh into my HA Yellow and look at the new colourful logs. I can't read INFO, WARNING, CHIP_ERROR without marking the log lines, because its black on black. Is this changeable?
About the coloring: That depends on your virtual terminal adapter. We use default settings from the Python coloredlogs setting 🤷♂️ If you change the background to something more greyish it should be fine.
That log looks fine for the amount of devices you have @alpine bobcat.
One task for us maybe is to lower the log level of "Activity on mDNS, trigger resubscribe" to debug
A couple thigns in the logs which are interesting: You have two instances where the Matter SDK reports a long dispatch time. On startup, that is normal as the system is busy. But during regular operation, that is not ideal. Not sure, was there maybe high CPU load in general? 🤔
2024-07-17 20:51:03.858 (Dummy-2) CHIP_ERROR [chip.native.DL] Long dispatch time: 363 ms, for event type 2
...
2024-07-18 00:28:04.889 (Dummy-2) CHIP_ERROR [chip.native.DL] Long dispatch time: 398 ms, for event type 2
It is not suuper long, so not that problematic, but still not ideal.
These Activity on mDNS, trigger resubscribe don't hurt much, as we just tell the SDK it should re-subscribe NOW, in case that node is currently waiting to retry re-subscribe...
2024-07-17 15:13:30.751 (MainThread) INFO [matter_server.server.device_controller.mdns] <Node:159> Activity on mDNS, trigger resubscribe
2024-07-17 15:13:32.758 (MainThread) INFO [matter_server.server.device_controller.mdns] <Node:159> Activity on mDNS, trigger resubscribe
All those nodes listed are probably not in that state, so it is somewhat wasteful. I think this stems from the fact that Thread BR update their mDNS entry every now and then. I am not sure why it doesn't happen regularly though 🤔
I can't say it, but I don't think so. Everytime I look at the cpu load, it is between 1 and 5%.
Here is another interesting log file: https://dpaste.org/j3beU
At this time stamp 2024-07-18 17:53:45.170 I started to update my APs where also my 5 HomePods are connected to. I was not at home until 2024-07-18 20:30. When I came home all devices were connected and everything worked as expected. But there was a lot of work in my Thread mesh, when I look at the logs. While the update of my APs the infrastructure links from the Thread mesh to the WiFi do not work for some minutes and the mesh has to rebuild and find new routes. It took over two hours until the log got quiet again. Not the ideal situation. What do think about this?
@pine shoal Any news about 17.6 beta or 18 beta? 😄
Do NOT install the tvOS/HomePodOS 18 public beta! 😉 I was on 17.6 public beta which was working quite well but my HomePods decided to update to 18 pb when it was released. Because of that I also updated my ATV's to tvOS 18 pb. After updating I saw lot of Matter timeouts in HA and it was not possible to run both Apple Home and HA as a controller at the same time. I could reduce the timeouts by removing Apple Home and leaving HA as the only controller. Still seeing a lot of timeouts compared to 17.6 but it works most of the time for now.
HomePodOS 17.6 public beta fixed the instability I've seen on 17.5 due to AP roaming. I also noticed that these annoying updating devices messages mostly disappeared after opening Apple Home app. It only updates individual devices that have changed their status.
Some days ago I also decided to reduce the traffic in my Thread mesh, by removing all 23 Nanoleaf bulbs from Apple Home. Now they are paired to Home Assistant and light groups of my Matter bulbs are bridged to HomeKit. That works fine for the moment.
Ok, thanks for the tip. I think I will stay away from those Apple betas. It’s summer time. I do not have much time, short before holidays and too much other hobbies… 😉
Yeah, I also recognized a lot of timeouts due to HomePod roaming. I thought the issue is Unifi.
Can I ask which Unifi equipment you have?
Yeah, I also see these update messages on some of my EVE devices. Great to hear that they found the issue. 😃👍
No Unifi at all. We‘re using a FRITZ!Box and a FRITZ!Repeater 6000.
Oh, ok… I thought, you are a Unifi user, too. Thanks for the enlightenment. 😃