llamafile-0.8.13 on Windows fails to allocate CUDA memory | AI @ Mozilla | Page 1

narrow sleet Aug 21, 2024, 5:09 PM

#

OS: Windows 11 Home
Hardware: i7 processor, GeForce RTX 4060 Ti 16GB

I get an error when trying to use the gpu in either Windows or WSL:

 .\llamafile-0.8.13.exe -ngl 9999 -m Meta-Llama-3.1-8B-Instruct-Q6_K.gguf

...

llama_kv_cache_init:      CUDA0 KV buffer size = 16384.00 MiB
llama_new_context_with_model: KV self size  = 16384.00 MiB, K (f16): 8192.00 MiB, V (f16): 8192.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 8480.00 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 8891928576
llama_new_context_with_model: failed to allocate compute buffers
llama_init_from_gpt_params: error: failed to create context with model 'Meta-Llama-3.1-8B-Instruct-Q6_K.gguf'
{"function":"load_model","level":"ERR","line":452,"model":"Meta-Llama-3.1-8B-Instruct-Q6_K.gguf","msg":"unable to load model","tid":"11681088","timestamp":1724259752}

However, this llamafile works without errors

.\llava-v1.5-7b-q4.llamafile.exe -ngl 9999

I've updated the Nvidia driver, not sure where to look next.

pallid roost Aug 21, 2024, 5:34 PM

#

Try passing -c 2048 to use a smaller context window.

narrow sleet Aug 21, 2024, 6:01 PM

#

Thank you! Worked like a charm.

#

Is there a formula to determine the max size of a context window given the size of memory available?

pallid roost Aug 21, 2024, 6:20 PM

#

There is. I just don't know it off the top of my head. What I'd do is use the binary search algorithm to figure out the max context that won't run out of memory on my GPU rig.

#

The default is 128k. Try 64k. Try 32k. Oh it worked. Try 48k, etc.

#llamafile-0.8.13 on Windows fails to allocate CUDA memory