Is there any colab notebook for inferencing unsloth model using vllm | Unsloth AI | Page 1

rigid spindle Aug 14, 2024, 5:08 PM

#

Hi I have been trying to use unsloth fine-tune model with vllm. Still no success. Anyone has successfully using unsloth with vllm before?

ember belfry Aug 14, 2024, 9:55 PM

#

yes

#

well i used it with aphrodite

#

but

#

same stuff

#

vllm has example notebooks are they not working

rigid spindle Aug 14, 2024, 11:26 PM

#

Thank you! Would you mind point me to the link? They have a code. I tried it but does not work with Unsloth fine tune. Will share it later what I used.

ember belfry Aug 14, 2024, 11:57 PM

#

its the same one you found, the official vllm notebook

#

and all models should work with it

#

if it doesnt you might've exported it wrong

#

whats the error?

rigid spindle Aug 15, 2024, 2:49 AM

#

Here is the colab
https://colab.research.google.com/drive/1C4j9ZmfTV83MbE7LnNWmlQiA5euKFjAp?usp=sharing
The error is

OSError: libnvidia-ml.so.1: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

NVMLError_LibraryNotFound                 Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/pynvml.py in _nvmlCheckReturn(ret)
    977 def _nvmlCheckReturn(ret):
    978     if (ret != NVML_SUCCESS):
--> 979         raise NVMLError(ret)
    980     return ret
    981 

NVMLError_LibraryNotFound: NVML Shared Library Not Found

Google Colab

ember belfry Aug 15, 2024, 2:50 AM

#

does it work with other, convential models

rigid spindle Aug 15, 2024, 4:01 AM

#

@ember belfry oh my, I got error because I didn't set T4 as run time type, I was using CPU.

#

Now I got

/usr/local/lib/python3.10/dist-packages/vllm/config.py in verify_with_scheduler_config(self, scheduler_config)
   1345     def verify_with_scheduler_config(self, scheduler_config: SchedulerConfig):
   1346         if scheduler_config.max_num_batched_tokens > 65528:
-> 1347             raise ValueError(
   1348                 "Due to limitations of the custom LoRA CUDA kernel, "
   1349                 "max_num_batched_tokens must be <= 65528 when "

ValueError: Due to limitations of the custom LoRA CUDA kernel, max_num_batched_tokens must be <= 65528 when LoRA is enabled.

When I tried llm = LLM(model="unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit", enable_lora=True,enforce_eager=True)

#

If I treid llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_lora=True) I got

OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 0 has a total capacity of 14.75 GiB of which 17.06 MiB is free. Process 19553 has 14.73 GiB memory in use. Of the allocated memory 14.60 GiB is allocated by PyTorch, and 4.34 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

shrewd hornet Aug 15, 2024, 7:19 AM

#

Oh VLLM?

#

i technically do have code

#

on the first error just set max_num_batched_tokens = 65528

rigid spindle Aug 15, 2024, 9:54 AM

#

I got ```
OutOfMemoryError: CUDA out of memory. Tried to allocate 250.00 MiB. GPU 0 has a total capacity of 14.75 GiB of which 229.06 MiB is free. Process 2212 has 14.52 GiB memory in use. Of the allocated memory 14.38 GiB is allocated by PyTorch, and 16.61 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

with this line
`llm = LLM(model="pacozaa/mistral-sharegpt90k-merged_16bit",enforce_eager=True,max_num_batched_tokens = 65528)`

ember belfry Aug 15, 2024, 11:24 AM

#

yes

#

colab doesnt have enough vram to load a 16 bit model

#

its not possible

#

sorry for not telling you earlier

#

you have to either use swap (slow) or go to kaggle 2x t4 and do some very weird ray fixing (buggy)

#

or quant it

rigid spindle Aug 15, 2024, 4:24 PM

#

Hmm it's ok. I will try with 4bit. Tomorrw.

unkempt ocean Aug 19, 2024, 5:02 PM

#

Guys can you give code for how to use 4bit model with vllm

#

@rigid spindle @ember belfry @shrewd hornet @trail haven

#

Finetuned from model : unsloth/llama-3-8b-bnb-4bit

#

Any idea guys??

unkempt ocean Aug 20, 2024, 4:06 PM

#

Found solution: https://docs.vllm.ai/en/latest/quantization/bnb.html

#Is there any colab notebook for inferencing unsloth model using vllm