#Distributing model across multiple GPUs using vLLM

9 messages · Page 1 of 1 (latest)

balmy flax
#

vLLM has parameter TENSOR_PARALLEL_SIZE to distribute model across multiple GPUs but is this parameter supported in serverless vLLM template? I tried setting it but the inference time was the same for model running on single GPU vs multiple GPUs

inner wagon
#

cc: @edgy ocean

edgy ocean
#

You don't need it, as it's automatically set to the number of GPUs of the worker

daring ocean
#

If im not wrong it works with 8gpu but not 6

edgy ocean
#

Yeah that’s a vllm issue, it doesn’t allow 6 or 10

tropic viper
#

vLLM specifically says 64 / (GPU Count) must have no modulus.

#

So, 1 , 2, 4, 8, 16, 32, and 64.

daring ocean
#

Ah that sucs