very slow inference time | Learn AI Together | Page 1

robust fox Dec 8, 2022, 8:34 AM

#

I have a text summarization model, that is about 900mb's big, but for some reason the inference takes up like 15-20 minutes on cpu. Is this usual?, are there any ways to speed up the inference on cpu?

fiery venture Dec 9, 2022, 6:35 AM

#

robust fox I have a text summarization model, that is about 900mb's big, but for some reaso...

for increasing inference speed you can try quantization and pruning of the model. What framework you are using to create the model. Different framework have different ways to quantize the model. If I know about framework you used I can share some resources that you can follow.

robust fox Dec 9, 2022, 6:41 PM

#

Im using a pytorch model

#

But if i use quantization then that will decreasr the models perfotmance by a lot, right?

clever cobalt Dec 9, 2022, 11:51 PM

#

Not necessarily. A more advanced technique would be to do model distillation where you train a smaller model using the larger model as the "teacher". But that requires a lot of GPU power.

fiery venture Dec 10, 2022, 8:34 AM

#

robust fox Im using a pytorch model

I will share some resources for that in some time. Also for quantization model accuracy may decrease a little bit but its always tradeoff between speed and accuracy. Like if you have a usecase where you require less inference time then you can go for optimization techniques by sacrificing ~1-2% accuracy with 10x+ decrease in inference time.

Also you can try distillation like DrDub suggested but yeah need high GPU power and some initial work for setting up the pipeline.

robust fox Dec 11, 2022, 2:58 PM

#

Hey guys, could you please share some articles/guides with me on quantizating bigger language models in pytorch ?

robust fox Dec 12, 2022, 5:06 PM

#

?

robust fox Dec 13, 2022, 6:39 PM

#

fiery venture I will share some resources for that in some time. Also for quantization model a...

When are you going to share the resources?

fiery venture Dec 13, 2022, 8:50 PM

#

robust fox When are you going to share the resources?

Ahh I forget to share( thanks to overtime everyday 🥲 , actually I worked mostly with Image model earlier and quantized it using (its kind of similar approach I used)
https://gist.github.com/martinferianc/d6090fffb4c95efed6f1152d5fde079d

If you have created custom torch model then you can go through this
https://pytorch.org/docs/stable/quantization.html

for language model if you are using Huggingface then I think you can convert easily
https://huggingface.co/docs/optimum/onnxruntime/usage_guides/quantization

Also I remember I optimized the translation model using this earlier
https://github.com/OpenNMT/CTranslate2

Gist

Quantisation example in PyTorch

Quantisation example in PyTorch . GitHub Gist: instantly share code, notes, and snippets.

Quantization

GitHub

GitHub - OpenNMT/CTranslate2: Fast inference engine for Transformer...

Fast inference engine for Transformer models. Contribute to OpenNMT/CTranslate2 development by creating an account on GitHub.

clever cobalt Dec 14, 2022, 4:54 AM

#

Very nice. Thanks for sharing.

robust fox Dec 14, 2022, 7:46 AM

#

Thats nice, thank you

robust fox Dec 14, 2022, 11:20 AM

#

But if I convert the model to a onnx model then im no longer possible to use it with the huggingface pipeline

robust fox Dec 14, 2022, 5:38 PM

#

Oh nevermind I solved that issue

#very slow inference time