OpenAI provides an API service that allows you to deploy your AI application. The API usage is charged by the number of tokens used, which essentially means you're charged for inference usage. You can find more details on the OpenAI Pricing page (https://openai.com/pricing).
To manage your API service, you can upgrade to pay-as-you-go by adding a payment method in your API account billing settings (https://platform.openai.com/account/billing/overview).
However, regarding your requirement to keep the AI model loaded in VRAM continuously, it's important to note that even if you abort your stream, you will still be charged for the context every time. This could be expensive as your user will squat/block your VRAM, without actually producing any useful output.
One potential solution could be to use a separate model to predict whether it's worthwhile trying to answer the user query before actually going into generation. If the predicted user output deviates too far from what the user is actually entering, you can restart the generation.
Another simpler option could be to implement a submit button, which would only start the inference process once the user has finished entering their query.
For deploying your AI application, you might also consider using the Tair Cloud Vector Database (https://www.alibabacloud.com/help/en/tair/latest/getting-started-overview) which allows fast deployment.
Please note that these are general suggestions and the best solution would depend on the specifics of your application and requirements.
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon: