#very slow inference time

14 messages · Page 1 of 1 (latest)

robust fox
#

I have a text summarization model, that is about 900mb's big, but for some reason the inference takes up like 15-20 minutes on cpu. Is this usual?, are there any ways to speed up the inference on cpu?

fiery venture
robust fox
#

Im using a pytorch model

#

But if i use quantization then that will decreasr the models perfotmance by a lot, right?

clever cobalt
#

Not necessarily. A more advanced technique would be to do model distillation where you train a smaller model using the larger model as the "teacher". But that requires a lot of GPU power.

fiery venture
# robust fox Im using a pytorch model

I will share some resources for that in some time. Also for quantization model accuracy may decrease a little bit but its always tradeoff between speed and accuracy. Like if you have a usecase where you require less inference time then you can go for optimization techniques by sacrificing ~1-2% accuracy with 10x+ decrease in inference time.

Also you can try distillation like DrDub suggested but yeah need high GPU power and some initial work for setting up the pipeline.

robust fox
#

Hey guys, could you please share some articles/guides with me on quantizating bigger language models in pytorch ?

robust fox
#

?

robust fox
fiery venture
# robust fox When are you going to share the resources?

Ahh I forget to share( thanks to overtime everyday 🥲 , actually I worked mostly with Image model earlier and quantized it using (its kind of similar approach I used)
https://gist.github.com/martinferianc/d6090fffb4c95efed6f1152d5fde079d

If you have created custom torch model then you can go through this
https://pytorch.org/docs/stable/quantization.html

for language model if you are using Huggingface then I think you can convert easily
https://huggingface.co/docs/optimum/onnxruntime/usage_guides/quantization

Also I remember I optimized the translation model using this earlier
https://github.com/OpenNMT/CTranslate2

Gist

Quantisation example in PyTorch . GitHub Gist: instantly share code, notes, and snippets.

GitHub

Fast inference engine for Transformer models. Contribute to OpenNMT/CTranslate2 development by creating an account on GitHub.

clever cobalt
#

Very nice. Thanks for sharing.

robust fox
#

Thats nice, thank you

robust fox
#

But if I convert the model to a onnx model then im no longer possible to use it with the huggingface pipeline

robust fox
#

Oh nevermind I solved that issue