#Is there any colab notebook for inferencing unsloth model using vllm
32 messages · Page 1 of 1 (latest)
yes
well i used it with aphrodite
but
same stuff
vllm has example notebooks are they not working
Thank you! Would you mind point me to the link? They have a code. I tried it but does not work with Unsloth fine tune. Will share it later what I used.
its the same one you found, the official vllm notebook
and all models should work with it
if it doesnt you might've exported it wrong
whats the error?
Here is the colab
https://colab.research.google.com/drive/1C4j9ZmfTV83MbE7LnNWmlQiA5euKFjAp?usp=sharing
The error is
OSError: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
During handling of the above exception, another exception occurred:
NVMLError_LibraryNotFound Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/pynvml.py in _nvmlCheckReturn(ret)
977 def _nvmlCheckReturn(ret):
978 if (ret != NVML_SUCCESS):
--> 979 raise NVMLError(ret)
980 return ret
981
NVMLError_LibraryNotFound: NVML Shared Library Not Found
does it work with other, convential models
@ember belfry oh my, I got error because I didn't set T4 as run time type, I was using CPU.
Now I got
/usr/local/lib/python3.10/dist-packages/vllm/config.py in verify_with_scheduler_config(self, scheduler_config)
1345 def verify_with_scheduler_config(self, scheduler_config: SchedulerConfig):
1346 if scheduler_config.max_num_batched_tokens > 65528:
-> 1347 raise ValueError(
1348 "Due to limitations of the custom LoRA CUDA kernel, "
1349 "max_num_batched_tokens must be <= 65528 when "
ValueError: Due to limitations of the custom LoRA CUDA kernel, max_num_batched_tokens must be <= 65528 when LoRA is enabled.
When I tried llm = LLM(model="unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit", enable_lora=True,enforce_eager=True)
If I treid llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_lora=True) I got
OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 0 has a total capacity of 14.75 GiB of which 17.06 MiB is free. Process 19553 has 14.73 GiB memory in use. Of the allocated memory 14.60 GiB is allocated by PyTorch, and 4.34 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Oh VLLM?
i technically do have code
on the first error just set max_num_batched_tokens = 65528
I got ```
OutOfMemoryError: CUDA out of memory. Tried to allocate 250.00 MiB. GPU 0 has a total capacity of 14.75 GiB of which 229.06 MiB is free. Process 2212 has 14.52 GiB memory in use. Of the allocated memory 14.38 GiB is allocated by PyTorch, and 16.61 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
with this line
`llm = LLM(model="pacozaa/mistral-sharegpt90k-merged_16bit",enforce_eager=True,max_num_batched_tokens = 65528)`
yes
colab doesnt have enough vram to load a 16 bit model
its not possible
sorry for not telling you earlier
you have to either use swap (slow) or go to kaggle 2x t4 and do some very weird ray fixing (buggy)
or quant it
Hmm it's ok. I will try with 4bit. Tomorrw.
Guys can you give code for how to use 4bit model with vllm
@rigid spindle @ember belfry @shrewd hornet @trail haven
Finetuned from model : unsloth/llama-3-8b-bnb-4bit
Any idea guys??
Found solution: https://docs.vllm.ai/en/latest/quantization/bnb.html