#gpt 4o near instantly analyze & understand videos?

1 messages · Page 1 of 1 (latest)

uneven hatch
#

i've used it on multiple videos longer than 15 minutes and it even gives me timestamps for specific moments.

formal pierBOT
#

Hi @uneven hatch. I'm kapa.ai, your friendly support bot. I will try to answer your question by searching through the knowledge base. :wave:

#

Yes, GPT-4o can analyze and understand videos, but it doesn't process videos directly. Instead, it understands videos by sampling frames and providing them as images. This approach allows GPT-4o to generate summaries, answer questions, and perform other tasks based on the video content. However, it's important to note that as of May 2024, GPT-4o in the API does not yet support audio-in. To process both the audio and visual elements of a video, a combination of GPT-4o and Whisper is used.

Here's an example of how you can use Python packages like OpenCV and MoviePy to process videos for GPT-4o:

%pip install opencv-python --quiet
%pip install moviepy --quiet

And here's how you can extract frames from a video using OpenCV:

video = cv2.VideoCapture("data/bison.mp4")

base64Frames = []
while video.isOpened():
 success, frame = video.read()
 if not success:
 break
 _, buffer = cv2.imencode(".jpg", frame)
 base64Frames.append(base64.b64encode(buffer).decode("utf-8"))

video.release()
print(len(base64Frames), "frames read.")

Once you have the frames, you can send them to GPT-4o for processing:

PROMPT_MESSAGES = [
 {
 "role": "user",
 "content": [
 "These are frames from a video that I want to upload. Generate a compelling description that I can upload along with the video.",
 *map(lambda x: {"image": x, "resize": 768}, base64Frames[0::50]),
 ],
 },
]
params = {
 "model": "gpt-4o",
 "messages": PROMPT_MESSAGES,
 "max_tokens": 200,
}

result = client.chat.completions.create(**params)
print(result.choices[0].message.content)

Please note that the timestamps for specific moments are likely generated based on the sequence of frames provided to the model, and not because the model inherently understands the concept of video time.

#

For more details, you can refer to the following sources:


Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:

#

@uneven hatch kapa.ai is still learning and improving, please let us know how it did by reacting below

uneven hatch
#

@formal pier so normal gpt 4o from the mac desktop app doesn't process nor is aware of anything being spoken or mentioned through audio, everything is strictly through frame imaging?

formal pierBOT
#

Let me check the knowledge base for you. One moment! :books:

#

Yes, as of the information available in the provided sources (as of May 2024), GPT-4o in the API supports text and image inputs only. It does not directly process audio or video. For video understanding, it relies on frame sampling where frames from the video are provided as images to the model.

However, it's important to note that OpenAI has plans to introduce audio capabilities to GPT-4o in the future. Once this feature is available, GPT-4o will be able to process audio data as well.

Here are the relevant sources for more information:


Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:

#

@uneven hatch kapa.ai is still learning and improving, please let us know how it did by reacting below

uneven hatch
#

@formal pier so how does normal gpt4o on the macos desktop app give me summaries and explain what's being said or discussed in videos?

formal pierBOT
#

Let me check the knowledge base for you. One moment! :books:

#

GPT-4o doesn't process videos directly. Instead, it can analyze and understand videos by sampling frames and providing them as images. For the audio part, as of the information available in the provided sources (as of May 2024), GPT-4o in the API does not yet support audio-in. To process both the audio and visual elements of a video, a combination of GPT-4o and Whisper (OpenAI's automatic speech recognition system) is used.

Here's an example of how you might use GPT-4o and Whisper to generate a summary of a video:

  1. First, you would use a tool like OpenCV to extract frames from the video.
  2. Then, you would use Whisper to transcribe the audio from the video.
  3. Finally, you would send both the frames and the transcription to GPT-4o, which would generate a summary based on both the visual and audio inputs.

Here's a code snippet that demonstrates this process:

# Transcribe the audio
transcription = client.audio.transcriptions.create(
 model="whisper-1",
 file=open(audio_path, "rb"),
)

response = client.chat.completions.create(
 model=MODEL,
 messages=[
 {"role": "system", "content":"""You are generating a video summary. Create a summary of the provided video and its transcript. Respond in Markdown."""},
 {"role": "user", "content": [
 {"type": "text", "text": f"The audio transcription is: {transcription.text}"}
 ],
 }
 ],
 temperature=0,
)
print(response.choices[0].message.content)

Please note that the timestamps for specific moments are likely generated based on the sequence of frames provided to the model, and not because the model inherently understands the concept of video time.

#

For more details, you can refer to the following sources:


Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:

#

@uneven hatch kapa.ai is still learning and improving, please let us know how it did by reacting below

uneven hatch
#

@formal pier when will screensharing be added and when can gpt 4o hear video links?

formal pierBOT
#

Let me check the knowledge base for you. One moment! :books:

#

Based on the provided knowledge sources, there is no specific information available about when screensharing will be added or when GPT-4o will be able to process audio from video links directly.

As of the latest updates (as of May 2024), GPT-4o supports text and vision modalities, and the support for audio is planned to be added in the future. However, no specific timeline has been provided for this feature.

For more updates and details, you can refer to the following sources:

Please keep an eye on OpenAI's official announcements for the most accurate and up-to-date information.

Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:

#

@uneven hatch kapa.ai is still learning and improving, please let us know how it did by reacting below

left panther
uneven hatch
left panther
#

ohhhh yes that's not video input; that's just reading the youtube video's transcript and summarizing it

uneven hatch
#

yeah. i dont think i mentioned it was input

versed pagoda
#

@formal pier how to get word timestamps with whisper open ai api?

formal pierBOT
#

Let me check the knowledge base for you. One moment! :books:

#

To get word-level timestamps with the Whisper API, you can use the timestamp_granularities[] parameter and set it to "word". This will enable a more structured and timestamped JSON output format, with timestamps at the word level. Here's an example of how you can do this in Python:

from openai import OpenAI

client = OpenAI()

audio_file = open("speech.mp3", "rb")
transcript = client.audio.transcriptions.create(
 file=audio_file,
 model="whisper-1",
 response_format="verbose_json",
 timestamp_granularities=["word"]
)

print(transcript.words)

This code will print out the transcript of the audio file with word-level timestamps. Please note that the audio file should be in a format that the Whisper API supports.

For more information, you can refer to the OpenAI API documentation (https://platform.openai.com/docs/guides/speech-to-text).

Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:

#

@versed pagoda kapa.ai is still learning and improving, please let us know how it did by reacting below