OS: Windows 11 Home
Hardware: i7 processor, GeForce RTX 4060 Ti 16GB
I get an error when trying to use the gpu in either Windows or WSL:
.\llamafile-0.8.13.exe -ngl 9999 -m Meta-Llama-3.1-8B-Instruct-Q6_K.gguf
...
llama_kv_cache_init: CUDA0 KV buffer size = 16384.00 MiB
llama_new_context_with_model: KV self size = 16384.00 MiB, K (f16): 8192.00 MiB, V (f16): 8192.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.49 MiB
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 8480.00 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 8891928576
llama_new_context_with_model: failed to allocate compute buffers
llama_init_from_gpt_params: error: failed to create context with model 'Meta-Llama-3.1-8B-Instruct-Q6_K.gguf'
{"function":"load_model","level":"ERR","line":452,"model":"Meta-Llama-3.1-8B-Instruct-Q6_K.gguf","msg":"unable to load model","tid":"11681088","timestamp":1724259752}
However, this llamafile works without errors
.\llava-v1.5-7b-q4.llamafile.exe -ngl 9999
I've updated the Nvidia driver, not sure where to look next.