I would like to share a limitation I repeatedly encountered while working on complex image compositions and suggest a possible improvement.
🔴 The problem
While generating images, I faced two related issues:
- Object counting
I needed a single image with exactly eight distinct photovoltaic terraforming vehicles.
Despite precise prompts and many iterations, the model often:
generated 7 instead of 8,
merged two vehicles,
or dropped one object.
This happened especially when objects shared a similar style and materials. - Multi-panel layouts
I also requested an image made of five cubes, each showing three faces — 15 scenes total.
Even with numbered prompts and clear descriptions, the model struggled to keep panels unique and logically consistent, especially at small scale.
🧠 Why this feels like a thinking limitation
These issues are not caused by vague prompts. They appear when the model needs to plan first, not just render.
Current behavior often feels like:
“Render everything at once and hope it fits.”
But these cases require:
Plan → verify → render.
💡 Proposed improvement: “Deep Image Thinking”
layout/composition planning before rendering
hard constraints (e.g. “8/8 objects must be present”)
panel-aware generation (comic/page mode)
multi-pass rendering (layout → validation → final image)
🌍 Why this matters
This would enable reliable technical diagrams, educational visuals, complex storytelling, and engineering concepts. These images communicate systems and structure, not just aesthetics.
🔗 Relation to 3D generation (secondary)
The same limitations apply to OpenAI’s 3D work (e.g. Shap-E). Missing or merged components can break entire models. A layout-first, constraint-aware approach could benefit both 2D images and structured 3D generation.
✅ Summary
This is not about prettier images, but about making generation controllable and reliable as complexity grows.