[to maybe extract some brain thinking from #🪣garbage-bin to someplace more discoverable]
My first assumption is that it would make sense to convert all your inputs into a single color space when training. I think I'd say pick the one with the widest color gamut. If that doesn't match your target platform for what you're generating, you could use existing color space conversion tools to deal with that, instead of making that the model's problem.
But on second thought, I suppose that in the same sort of way a wardrobe designer or makeup artist might say "don't wear that, it'll look shitty on stage / print / screen" or whatever, you might want to be able to condition the model to never generate out-of-gamut stuff in the first place. And be versatile enough to know how to do that for different gamuts.
Sounds theoretically doable -- maybe even with an adapter on top of a naive foundation model -- but kind of a pain. That would be an interesting training set to see!