#MultiGPU Support ollama main

11 messages · Page 1 of 1 (latest)

pastel summit
#

Hi all, forgive me if I'm posting this incorrectly. I'm new to discord.

I'm using an Unraid docker called Intel-IPEX-LLM-Ollama which includes the July 10th ollama portable (https://hub.docker.com/r/uberchuckie/ollama-intel-gpu > https://github.com/charlescng/ollama-intel-gpu > https://hub.docker.com/r/intelanalytics/ipex-llm-inference-cpp-xpu ) which works--mostly--by listing both ARC (A380/Card0 & A770/Card1) cards within my machine, and allowing processing to be sorted to the more powerful A770 correctly, but doesn't appear to pipeline-parallel (https://github.com/intel/ipex-llm/tree/main/python/llm/example/GPU/Pipeline-Parallel-Inference) a larger model across cards. Monitoring the processes and loads on both cards shows that intel_gpu_top reports ollama-bin usage on both cards simultaneously, but memory on card1 overflows and card0 remains empty and unused. I have tried with several of the supported models. But I recognize that I'm not using IPEX-llm in the 'correct' capacity on bare metal. I'm working within limitations of either Unraid 7.1.4 (Kernel 6.12.24) or the docker containers that I'm utilizing. But I've hit a wall on how else to trouble shoot this.

What I've tried:
I've tried passing various envars to the docker container. Some changes are reflected, but do not solve the issue:

NUM_GPUS=2
OLLAMA_NUM_PARALLEL=1 #(2)
OLLAMA_NUM_CTX=16384 #(8192)
DEVICE=Arc
ZES_ENABLE_SYSMAN=1
OLLAMA_NUM_GPU=999
OLLAMA_INTEL_GPU=1 #(=true, not num gpus)
IPEX_LLM_NUM_CTX=16384 #8192
device=/dev/dri
#device=/dev/dri/renderD128
#device=/dev/dri/renderD129
SYCL_DEVICE_CHECK=0
OLLAMA_KV_CACHE_TYPE=q8_0
tensor_parallel_size=2

and others. But these should be the most relevant.

#

What I know is that I'm running out of vram to run the selected model. The obvious answer is to use a smaller model. What I'm trying to do is get parallel piping working. Really, just for funsies. I'm early on in understanding how all of this works and trying to test limits. I understand that context adds to the vram requirements. I have tried models that are 16gb in size which should in theory offload context to the A380-6gb card.

Can anyone give me a tip on where to look or what I might be missing.

What ollama reports (what I believe are the relevant parts, anyway) are attached.

gentle wave
#

Lm studio has latest vulkan llama.cpp which supports gpt-oss models and pipeline paralell.

#

I run 3x A770s and its working for me.

pastel summit
#

I tried a vulkan branch of ollama, it's much slower than IPEX, but does work. I saw a comment that Intel is moving from preferring SYCL to Vulkan because it's faster, but I haven't seen that. While I can try lmstudio, I'd like to keep banging my head against the wall with ollama. I like openwebui.

gentle wave
#

You can use openwebui with lm studio, ollama- most projects which support openai api should work.

My project OpenArc works with openwebui and uses OpenVINO. The latest release notes have some discsussion about how I'm going to implement PIPELINE_PARALELL which is coming in next openvino genai release.
https://github.com/SearchSavior/OpenArc

GitHub

Lightweight Inference server for OpenVINO. Contribute to SearchSavior/OpenArc development by creating an account on GitHub.

pastel summit
#

Ah, I've seen your posts on reddit. I hadn't considered it because I was trying to actively troubleshoot the issue itself while trying to work within the confines of Unraid community apps, but I'm familiar enough with docker that I can probably make this work for my needs. Although, I will say, I'm more of a 'troubleshoot/tinker/and fix' kind of person so I was hoping to get a pointer on where to troubleshoot next. Maybe that's just not viable, because it doesn't appear that the ollama team is all that interested in supporting Arc.

Frankly, I'm pleased with the Arc compute performance. I started this journey with a Tesla P4 to just toy around with. The Arc cards are equally performant, which is good enough for text inferencing. Reading along allows me to validate the information generated in realtime.

pastel summit
#

@gentle wave Do you have a docker yaml for this?

gentle wave
#

Not yet but passing the dri directory in a compose will work better now that we have uv

#

Yeah llama.cpp ipex llm with llama server offers more of the customizations than with ollama ipex, I think but your mileage will probably still vary.

Hard agree, interest in Arc can be hard to gauge. The community around my project are, unfortunately, a minority. Most of the dev on the intel libs come from staff, and a guy who joined recently said in talks with intel engineers that business customer needs drive open source development priorities. To make changes to ipex involves knowing not just about drivers but the primitives and operators which describe model topology down to how they execute on devices from onednn, oneccl- oneapi specification family of libs. Not trivial knowledge to attain, especially without the community resources around cuda.
So usually a high effort issue might work better. It has for us with some pretty challenging problems.

Many times we have found that different features are in development and are usually pushed on a quarterly basis and trickle downstream from there.

#

Anyway, checkout our discord linked on git. Others there are actively working with Arc across the ecosystem and, imho, are a lively technical resource on this subject.