#very slow inference time
14 messages · Page 1 of 1 (latest)
for increasing inference speed you can try quantization and pruning of the model. What framework you are using to create the model. Different framework have different ways to quantize the model. If I know about framework you used I can share some resources that you can follow.
Im using a pytorch model
But if i use quantization then that will decreasr the models perfotmance by a lot, right?
Not necessarily. A more advanced technique would be to do model distillation where you train a smaller model using the larger model as the "teacher". But that requires a lot of GPU power.
I will share some resources for that in some time. Also for quantization model accuracy may decrease a little bit but its always tradeoff between speed and accuracy. Like if you have a usecase where you require less inference time then you can go for optimization techniques by sacrificing ~1-2% accuracy with 10x+ decrease in inference time.
Also you can try distillation like DrDub suggested but yeah need high GPU power and some initial work for setting up the pipeline.
Hey guys, could you please share some articles/guides with me on quantizating bigger language models in pytorch ?
?
When are you going to share the resources?
Ahh I forget to share( thanks to overtime everyday 🥲 , actually I worked mostly with Image model earlier and quantized it using (its kind of similar approach I used)
https://gist.github.com/martinferianc/d6090fffb4c95efed6f1152d5fde079d
If you have created custom torch model then you can go through this
https://pytorch.org/docs/stable/quantization.html
for language model if you are using Huggingface then I think you can convert easily
https://huggingface.co/docs/optimum/onnxruntime/usage_guides/quantization
Also I remember I optimized the translation model using this earlier
https://github.com/OpenNMT/CTranslate2
Very nice. Thanks for sharing.
Thats nice, thank you
But if I convert the model to a onnx model then im no longer possible to use it with the huggingface pipeline
Oh nevermind I solved that issue