I have an idea for how OpenAI could evolve their current 2D image generators into full animation systems and into models that understand and maintain 3D environments by treating images as layered spatial fields rather than single flat outputs.
The core idea is to generate each frame as a stack of image layers, where each layer corresponds to a specific object and is anchored to a fixed position in a shared 3D coordinate grid. For example, if I want an animation of a person walking from a third-person perspective, there would be a virtual camera placed behind the character, and one dedicated layer would represent the character alone. That layer would handle all of the character’s motion along the x, y, and z axes relative to the camera, producing a continuous procedural animation rather than regenerating the entire scene.
Every other object in the environment would also have its own layer, each one generated relative to the same camera-centered axes. When a new object appears for the first time, the system would generate its image layer and permanently anchor it to a location on a global spatial grid. From that point on, the object stays fixed in that position, and the system only adjusts how it looks by transforming the layer according to the camera’s movement. As the viewer moves through the space and discovers new objects, these new layers are added to the grid, while previously seen objects remain exactly where they were placed.
This transforms a 2D generator into a persistent world-builder: animations become stable camera-driven transformations of anchored layers, and environments become coherent 3D-like spaces that grow as the viewer explores them.