#Model Caching for GGUF repos

5 messages · Page 1 of 1 (latest)

native shuttle
#

Hi team, I have a suggestion for the Model Caching feature in Serverless.

  • Currently, it seems to cache the entire Hugging Face repository.
  • This is inefficient for GGUF LLM quantizations (e.g., https://huggingface.co/unsloth/GLM-4.5-Air-GGUF). These repositories often contain multiple full-model copies (different quantizations), leading to significantly wasted download time/storage.
    Suggestion: Please consider adding an option to specify a subset of files/regex from a Hugging Face repository to cache. This would allow users to download only the specific GGUF quantization they intend to use.
solemn juniperBOT
thin atlas
#

Best way to implement this is to allow a URL to be saved to a specific path

#

So users can inject individual files

#

For now another way you can do it with the KoboldCpp template (Loadbalanced serverless) is to use network storage