Does GPT-4 include multimodal capabilities? | OpenAI | Page 1

stable acorn · 2024-03-05T12:33:39.109Z

Considering that GPT-4 can now interpret the meanings of images, many news articles and discussions seem to indicate that it is actually GPT-4V that represents true multimodality. Can anyone explain this correctly?

Hi @stable acorn. I'm kapa.ai, your friendly support bot. I will try to answer your question by searching through the knowledge base. :wave:

Yes, GPT-4 does include multimodal capabilities. Specifically, GPT-4 has the ability to accept both text and image inputs and output text. This is a significant advancement over previous models, which were limited to a single input modality, typically text.

The version of GPT-4 that includes vision capabilities is sometimes referred to as GPT-4V or gpt-4-vision-preview in the API. This model can take in images and answer questions about them, expanding the areas where models like GPT-4 can be used. It's important to note that GPT-4 Turbo with vision performs equally as well on text tasks as the GPT-4 Turbo preview model, but it has additional vision capabilities.

Images can be made available to the model in two main ways: by passing a link to the image or by passing the base64 encoded image directly in the request. Images can be passed in the user, system, and assistant messages.

However, it's worth noting that the free API tier currently does not have access to the GPT-4 series, including GPT-4V. Access to GPT-4 is available to ChatGPT Plus, Team, and Enterprise users.

For more information, you can refer to the following sources:

GPT-4 and GPT-4 Turbo (https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turbo)
Vision Guide (https://platform.openai.com/docs/guides/vision)
Access to GPT-4 (https://help.openai.com/en/articles/7102672-how-can-i-access-gpt-4)
Forum Post on GPT-4V (https://community.openai.com/t/589706)

#Does GPT-4 include multimodal capabilities?

Note: you can ask me a follow up question by @ mentioning me again :speech_balloon: