You can just input multiple images and they can understand them(you can ask questions or give instructions) They support video understanding and I believe oryx 1.5 and llava onevision support 3d understanding too.
I think MiniCPM v2.6 is the only one supported by ollama so far and I prefer that for vision understanding. They are around 7b-8b in size. I forgot that theres also Qwen2 vl which is pretty great(supports img, multi-img, vids long as 20min).











