#“Deep Image Thinking” — Layout Planning, Object Counting & Multi-Pass Rendering

1 messages · Page 1 of 1 (latest)

hearty pewter
#

openai I would like to share a limitation I repeatedly encountered while working on complex image compositions and suggest a possible improvement.
🔴 The problem
While generating images, I faced two related issues:

  1. Object counting
    I needed a single image with exactly eight distinct photovoltaic terraforming vehicles.
    Despite precise prompts and many iterations, the model often:
    generated 7 instead of 8,
    merged two vehicles,
    or dropped one object.
    This happened especially when objects shared a similar style and materials.
  2. Multi-panel layouts
    I also requested an image made of five cubes, each showing three faces — 15 scenes total.
    Even with numbered prompts and clear descriptions, the model struggled to keep panels unique and logically consistent, especially at small scale.
    🧠 Why this feels like a thinking limitation
    These issues are not caused by vague prompts. They appear when the model needs to plan first, not just render.
    Current behavior often feels like:
    “Render everything at once and hope it fits.”
    But these cases require:
    Plan → verify → render.
    💡 Proposed improvement: “Deep Image Thinking”
    layout/composition planning before rendering
    hard constraints (e.g. “8/8 objects must be present”)
    panel-aware generation (comic/page mode)
    multi-pass rendering (layout → validation → final image)
    🌍 Why this matters
    This would enable reliable technical diagrams, educational visuals, complex storytelling, and engineering concepts. These images communicate systems and structure, not just aesthetics.
    🔗 Relation to 3D generation (secondary)
    The same limitations apply to OpenAI’s 3D work (e.g. Shap-E). Missing or merged components can break entire models. A layout-first, constraint-aware approach could benefit both 2D images and structured 3D generation.
    ✅ Summary
    This is not about prettier images, but about making generation controllable and reliable as complexity grows.
tired ibex
#

Considering this would be a Pro feature, I only see benefits for those requesting images full of detailed objects from ChatGPT. The main issue is that this feature would consume a lot of data center processing capacity if used by thousands of people, making it unsustainable in the medium term. Therefore, I don’t support it 🫤👎🏻