#Multiplayer doesn't work if level loads too fast

1 messages · Page 1 of 1 (latest)

stoic nebula
#

Hi!

I think I found a bug in the multiplayer gem (or aznetworking?) but not sure if I'm just doing something wrong. I have two scenes, one works, one doesn't. Here's what happens:

  1. Load broken scene and host server on it
  2. Client connects to server
  3. Server sees the client connect and spawns a player prefab correctly
  4. Client connects and loads the scene, but does NOT spawn any player prefabs
  5. If I continue spawning clients on the same machine, eventually one of them might work

I think the issue is timing related. The broken scene is tiny, and loads instantly. The good scene has some more assets, including a big exr file for lighting, and thus takes longer to load. In fact, adding more stuff to the broken scene to artificially inflate load time fixes the issue!

For reference, this is a source build on Linux (Fedora 41) using the latest commit on stabilization/25100 branch (using this branch because latest release has build issue). The levels both have a SimplePlayerSpawnerComponent, and the spawn point entities have NetBindComponents attached.

EDIT: exact engine commit is afd0ae1e3d478d4b6134c281498ac460ca5fa05d

real nymph
#

the obvious question would be, what are the differences between the two scenes? Having one that always works and one that sometimes doesn't is going to be a great chance to enumerate the differences in broad strokes to narrow it down.. It sounds like it might JUST be size difference?

stoic nebula
#

I just recorded a video that shows the issue on a brand new scene with just the minimum for multiplayer and a basic light. It doesn't work until I drag in a second level (which is just a template scene with a couple of fbx assets dragged in, no custom components/scripts/etc)

Since this is a source build, testing different engine versions is a pain, but I'll try testing on Windows at least.

stoic nebula
#

I figured it out! It turns out the issue was ccache, which I set up to improve engine rebuild times. I guess something went horribly wrong in the cache, and manifested as this behavior. I noticed that the issue was only happening in profile builds, but release builds on the same machine, as well as builds on my CI machines worked. Clearing the cache and rebuilding fixed it.

So yeah, let this be a cautionary tale for anyone thinking of using ccache... 😭

real nymph
#

Thats a super weird error for something like a bad cache to expose

stoic nebula
#

It seems the issue has returned, so still no idea what's going on. So far I haven't seen it happen in release builds, only profile builds. I'm going to see if I can get the test suite running to see if that reveals anything. I'm running this in Fedora 41 (in a Toolbx container on a Bazzite host), but official docs suggest Ubuntu is the only distro explicitly supported. If that's the problem, I can try setting the engine up inside an Ubuntu container instead, and cross my fingers.

Other issues I've noticed: sometimes, network players all receive an id of -1 when connecting, but weirdly do replicate some stuff like animation flags, and rotation but not translation. I've only seen that happen in the UnifiedLauncher. Then on the GameLauncher/ServerLauncher use case, one player gets permanently stuck unable to move, with the server constantly reporting desync of the transform while other players work fine... due to the weird stuff going on, I'm not sure if this is a bug in my code or not 🙁

I also reverted to the 2505.1 tagged release, but that didn't resolve the problem.

real nymph
#

ok so its something thats been around a while. Possibly easily reproducable. I asusme its nothing like all assets needing compilation, they're all already compiled

#

is there a good set of steps I can get to just make it happen each time?

stoic nebula
#

I'm not sure about reproducing it reliably. It seems like sometimes it gets temporarily fixed when I do a clean build, but it returns after some unknown conditions are met. It does seem to be somehow related to the size of the level/the load times, as I've created a new set of levels from scratch to test and see the same behavior in the video above (so it's not just broken level files).

I guess if you're going to try to reproduce, just create as tiny a level as possible, preferably with no assets that need to be loaded, then launch the client/server launchers (don't run within the editor).

I ran the engine tests for 2505.1 in the engine build folder (not project folder) in both a Fedora and Ubuntu toolbox, and I got a couple of failures that a tracked down to missing system dependencies (I guess the list of Linux deps in the docs is incomplete). After fixing those, the remaining failures were in the project export script, and in the benchmark tests, but idk if that's related. There doesn't seem to be any difference in test failures between the Fedora and Ubuntu runs.

So right now I'm doing a full clean rebuild of the project in the Ubuntu image, to see if that fixes the issue (doubt it). If not, I'll try making a brand new multiplayer project and try to reproduce it there...

real nymph
#

the tiny a level as possible, I assume it at least needs to have a multiplayer level component, and then player spawners? So that people are at least assigned a player id

stoic nebula
#

Yes, like in the video above where I go through all the entities. At minimum it's a SimpleNetworkPlayerSpawner, and two spawn points with NetworkBinding components. The rest is just a whitebox for the floor, a directional light, and a camera.

#

Alright, just finished building in Ubuntu and it exhibits the same bug

stoic nebula
#

I just created a brand new project using the Multiplayer template, built it using the profile config, and tried running it locally. Immediately, I encountered a different bug. Idk if it's related, but I recorded it just in case, and may open an issue for it later.

But I tried to reproduce the original issue in a new test level, and so far it's not happening! I'm going to keep experimenting, but this gives me hope that it's some issue with my project rather than the engine.

real nymph
#

thats fascinating

stoic nebula
#

I discovered a possible way to reproduce my original issue:

  1. In the GameLauncher, connect to the server with connect
  2. After connecting successfully, call disconnect
  3. After disconnect, call connect again

The result at step 3 is the exact behavior I'm observing in my project: the client connects to the server, but spawns no network entities, yet all other connected clients and the server DO spawn an entity for that client, and report it as being connected. This works regardless of the size of the level, but maybe the second connect call happens while the level is already in memory, so it could be a loading/timing issue after all?

real nymph
#

Its that, or, possibly a state clearance issue

#

like, when the player leaves the server and then comes back, something isn't cleared (from server... or from client)

#

maybe its not fully clearing whatever state it is supposed to clear when a player leaves

#

and so when they rejoin its dirty?

#

or it thinks its already done something for that player slot so doens't do it again, when it should

stoic nebula
#

Maybe. I just repeated this same exact test on a Windows machine which has O3DE 24.09 installed (from the SDK installer), and both bugs are present there too: the shooting desync and the reconnect issue. So whatever it is, it seems to not be anything introduced into the engine recently.

#

I'm going to prepare issue reports for this and hope I'm not just going crazy. Maybe I'll try running this on a laptop in the woods, in case there's a cosmic ray anomaly in my home 😛

real nymph
#

no, I doubt you're going crazy, but the fact that it happens every time if you disconnect and reconnect is a clue

stoic nebula
#

I think I figured it out! I found an older issue (https://github.com/o3de/o3de/issues/8862) that seemed similar to mine, except they call unloadlevel before calling connect again. In the first video I posted above, I have the client configured to a 'connect menu' which just executes connect when you click the button. Modifying that to execute unloadlevel first makes it work every time.

Idk if that's a bug or just an undocumented quirk. The correlation with level load time is weird, but I guess falls into undefined behavior category.

So I can finally continue working on my game, but now that shooting bug is the next thing to figure out because I'm planning on using the same implementation...

GitHub

Describe the bug MultiplayerSample project. Client can not connect to Server after disconnect. Steps to reproduce Steps to reproduce the behavior: Run MultiplayerSample.ServerLauncher.exe process R...

real nymph
#

I feel like if people struggle with it, its probably a bug