#Better understanding of world
1 messages · Page 1 of 1 (latest)
Hi @modern dust. I'm kapa.ai, your friendly support bot. I will try to answer your question by searching through the knowledge base. :wave:
There isn’t anything in the provided knowledge sources that directly addresses your specific example (e.g., recognizing “a cup with a sealed top and no bottom is an upside‑down cup”), so I can’t say exactly when that kind of world understanding will be reliably solved.
What the sources do say is:
- Newer models like GPT‑4o have been updated with more recent training data and better multimodal (image + text) understanding, including improved spatial reasoning on benchmarks like MathVista and better interpretation of diagrams, charts, and layouts. [Updates to GPT‑4o]
- Vision models still have known limitations, including struggles with precise spatial localization and some visual elements, which is closely related to the kind of object‑orientation reasoning you’re describing. [Images & vision limits]
- Reasoning‑focused models in the o‑series (like o3 and o4‑mini) are explicitly designed to “think for longer” and are particularly strong at visual perception and complex reasoning, but the notes don’t claim they fully solve all such edge cases. [o3 and o4-mini]
So, based on these sources:
- OpenAI is actively improving both world knowledge and visual/spatial reasoning over time (e.g., successive GPT‑4o updates, and newer o‑series models).
- However, there is no specific timeline or guarantee in these documents that models will reliably handle every nuanced physical‑world inference like your cup example, only that multimodal and reasoning capabilities are steadily improving.
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
@modern dust Kapa.ai is still learning and improving, please let me know how I did by reacting below.
@sage tapir what is main issue why some models understand world well, and others even after reasoning fail at kindergarten level tasks?
Let me check the knowledge base for you again. One moment! :books:
The knowledge sources don’t directly explain why some models “understand the world” better than others or why they sometimes fail at very simple, kindergarten‑level tasks, so I can only answer partially and at a high level.
From what is covered, there are two big themes:
1. Different model families are optimized for different things
OpenAI distinguishes between:
- GPT models (e.g., GPT‑4.1, GPT‑4o) – “workhorses” optimized for speed, cost, and general instruction following on well‑defined tasks. [Reasoning best practices; Model evolution]
- o‑series reasoning models (o3, o4‑mini) – “planners” trained (via reinforcement learning) to think longer and handle complex, multi‑step reasoning, including math, coding, and visual tasks. [Reasoning best practices; o3 and o4-mini notes]
Because they’re optimized differently:
- A GPT model might be fast and good at many everyday tasks but still make shallow or brittle inferences.
- An o‑series model can reason more deeply, but that doesn’t guarantee it will never miss a simple commonsense or visual edge case.
So one core “issue” is simply design trade‑offs: some models are tuned for deep reasoning, others for speed and cost, and neither is perfect across all task types. [Reasoning best practices; Model‑intro matrix]
2. Visual and compositional reasoning are still hard, even for strong models
The docs explicitly note that:
- Visual tasks that require compositional reasoning (combining multiple pieces of information across an image, like charts, tables, or complex layouts) are a known bottleneck; models often need higher “reasoning effort” to do well. [Multimodal reasoning tips]
- Even advanced reasoning models can struggle with challenging visuals or ambiguous structure; that’s exactly why specialized reasoning‑plus‑vision models (like o‑series with vision) are emphasized for hard visual tasks. [Reasoning best practices – visual reasoning; o3 and o4-mini notes]
Your “kindergarten‑level” examples (like understanding an upside‑down cup) often sit at the intersection of:
- Perception (what’s in the image / description?)
- Spatial reasoning (orientation, inside/outside, top/bottom)
- Commonsense physics (what cups are like, how they’re used)
The sources make clear that models are improving at this kind of multi‑step, compositional reasoning, but they do not claim that current models have robust, human‑like commonsense across all such cases—only that o3/o4‑mini and newer GPT‑4o variants are significantly better than earlier generations on complex reasoning and visual benchmarks. [o3 and o4-mini notes; Model evolution]
So, based on the available docs:
- The main “issue” isn’t a single bug, but architectural and training trade‑offs: some models are tuned for deep, deliberate reasoning, others for fast general use.
- Even the reasoning‑optimized models still have known limitations on visual/spatial and compositional commonsense tasks, which is why OpenAI keeps iterating with newer o‑series and multimodal models rather than claiming the problem is solved.