Hi all, forgive me if I'm posting this incorrectly. I'm new to discord.
I'm using an Unraid docker called Intel-IPEX-LLM-Ollama which includes the July 10th ollama portable (https://hub.docker.com/r/uberchuckie/ollama-intel-gpu > https://github.com/charlescng/ollama-intel-gpu > https://hub.docker.com/r/intelanalytics/ipex-llm-inference-cpp-xpu ) which works--mostly--by listing both ARC (A380/Card0 & A770/Card1) cards within my machine, and allowing processing to be sorted to the more powerful A770 correctly, but doesn't appear to pipeline-parallel (https://github.com/intel/ipex-llm/tree/main/python/llm/example/GPU/Pipeline-Parallel-Inference) a larger model across cards. Monitoring the processes and loads on both cards shows that intel_gpu_top reports ollama-bin usage on both cards simultaneously, but memory on card1 overflows and card0 remains empty and unused. I have tried with several of the supported models. But I recognize that I'm not using IPEX-llm in the 'correct' capacity on bare metal. I'm working within limitations of either Unraid 7.1.4 (Kernel 6.12.24) or the docker containers that I'm utilizing. But I've hit a wall on how else to trouble shoot this.
What I've tried:
I've tried passing various envars to the docker container. Some changes are reflected, but do not solve the issue:
NUM_GPUS=2
OLLAMA_NUM_PARALLEL=1 #(2)
OLLAMA_NUM_CTX=16384 #(8192)
DEVICE=Arc
ZES_ENABLE_SYSMAN=1
OLLAMA_NUM_GPU=999
OLLAMA_INTEL_GPU=1 #(=true, not num gpus)
IPEX_LLM_NUM_CTX=16384 #8192
device=/dev/dri
#device=/dev/dri/renderD128
#device=/dev/dri/renderD129
SYCL_DEVICE_CHECK=0
OLLAMA_KV_CACHE_TYPE=q8_0
tensor_parallel_size=2
and others. But these should be the most relevant.