#Training Flux Schnell on serverless

1 messages · Page 1 of 1 (latest)

lusty quarry
#

Hi there, i am using your pods to run ostris/ai-toolkit to train flux on custom images, the thing is now i want to use your serverless endpoint capabilities, can you help me out? do you have some kind of template or guide on how to do it?

wintry zodiac
#

@lusty quarry Hii!

I have the dev serverless already! I'll update schnell soon

lusty quarry
wintry zodiac
lusty quarry
#

Ok man, thx

lusty quarry
wintry zodiac
#

{
"input": {
"lora_file_name": "laksheya-geraldine_viswanathan-FLUX",
"trigger_word": "geraldine viswanathan",
"gender":"woman",
"data_url": "dataset_zip url"
},
"s3Config": {
"accessId": "accessId",
"accessSecret": "accessSecret",
"bucketName": "flux-lora",
"endpointUrl": "https://minio-api.cloud.com"
}
}

#

@lusty quarry

lusty quarry
#

Thanks for sharing I will check it out

lusty quarry
# wintry zodiac

what does this image contain?

FROM navinhariharan/flux-lora:latest

how are you handling the long time proccess of training a model?

wintry zodiac
#

Disable this for long time proccess

#

FROM navinhariharan/flux-lora:latest

These contain the flux models dev and schnell

lusty quarry
#

Thank you for the help 🫡

wintry zodiac
#

So the lora is trained and sent to your s3 bucket!

lusty quarry
wintry zodiac
lusty quarry
wintry zodiac
#

open source s3

lusty quarry
#

I will take a look

wintry zodiac
#

Sure! If you have issues let me know! I'll be happy to help!

lusty quarry
#

Do you have any tips to get better results?

#

Or to make it train faster?

wintry zodiac
lusty quarry
#

i was using ai-toolkit

#

what hardware are you using?

wintry zodiac
lusty quarry
wintry zodiac
#

You can deploy this to get started!

wintry zodiac
#

The lora size is small too without loss of quality!

wintry zodiac
lusty quarry
#

With ai-toolkit i am getting about 30-40 min for 1000 steps

wintry zodiac
lusty quarry
#

ok, that makes sense

#

are you doing some kind of image selection/preprocessing?

wintry zodiac
lusty quarry
#

you arent excluding low quality ones, resizing, etc?

wintry zodiac
#

I mix a bit of everything!

lusty quarry
#

what have you put in this image navinhariharan/flux-lora:latest i want to costumize it, can you share the source?

wintry zodiac
#

These are auto downloaded by ai-toolkit! Instead of exporting env for HF_TOKEN

I downloaded and made a docker image

#

That lives here

/huggingface/

lusty quarry
#

i want to store those models in a network volume, so it can be shared between serverless instances

lusty quarry
lusty quarry
#

another thing:

def train_lora(job):

if 's3Config' in job:
    s3_config = job["s3Config"]
    job_input = job["input"]
    job_input = download(job_input)
    if edityaml(job_input) == True:
        if job_input['gender'].lower() in ['woman','female','girl']:
            job = get_job('config/woman.yaml', None)
        elif job_input['gender'].lower() in ['man','male','boy']:
            job = get_job('config/man.yaml', None)
        job.run()

how are you able to run the job, where does the get_job function come from?

lusty quarry
#

Yes but then you call job.run

wintry zodiac
#

runpod.serverless.start({"handler": train_lora})

This will call the function train_lora with the input json! that is...

job = {
"input": {
"lora_file_name": "laksheya-geraldine_viswanathan-FLUX",
"trigger_word": "geraldine viswanathan",
"gender":"woman",
"data_url": "dataset_zip url"
},
"s3Config": {
"accessId": "accessId",
"accessSecret": "accessSecret",
"bucketName": "flux-lora",
"endpointUrl": "https://minio-api.cloud.com"
}
}

wintry zodiac
lusty quarry
#

Anda where is that function?

#

The train_lora ?

wintry zodiac
#

@lusty quarry Line 31

lusty quarry
#

sorry man it was a pretty stupid question, thats what i get for trying to do n things at a time ahaha

wintry zodiac
lusty quarry
#

Have you managed to successfully use network volumes in serverless?

wintry zodiac
latent steeple
#

is this due the container size

#

And may I know what is the inference time , it taking for an image to generate on A100 or any other gpus , for me its taking 15 seconds
,

#

@wintry zodiac

wintry zodiac
#

@latent steeple what is your input?

Please remove any credentials you have and send

#

Looks like an error while downloading dataset

latent steeple
#

I am using flux and sdxl models in this deployment,

When ever user sends flux lora request, I will generate of flux lora

Same applies to sdxl

#

Input is

Lora blob url
Modeltype

#

What should be the container size

wintry zodiac
#

That's all fine!

How are you sending in the training dataset?

#

@latent steeple

latent steeple
#

This system doesn't need datasets , it just use the models from huggingface , it will import models from huggingface and download the lora and will use that lora for inference

wintry zodiac
latent steeple
#

getting this error when I am using runpod-volume

#

Use a more specific base image for efficiency

FROM runpod/base:0.6.2-cuda12.2.0

Set environment variables

ENV HF_HUB_ENABLE_HF_TRANSFER=0
PYTHONDONTWRITEBYTECODE=1
PYTHONUNBUFFERED=1
HF_HOME=/runpod-volume/huggingface-cache
HUGGINGFACE_HUB_CACHE=/runpod-volume/huggingface-cache/hub
WORKSPACE=/runpod-volume

RUN ls -a /

Create necessary directories

RUN mkdir -p ${WORKSPACE}/app ${HF_HOME}

Copy requirements first to leverage Docker cache for dependencies

COPY requirements.txt ${WORKSPACE}/

Install dependencies in a single RUN statement to reduce layers

RUN python3.11 -m pip install --no-cache-dir --upgrade pip &&
python3.11 -m pip install --no-cache-dir -r ${WORKSPACE}/requirements.txt &&
rm ${WORKSPACE}/requirements.txt

Copy source code to /runpod-volume/app

COPY test_input.json ${WORKSPACE}/app/
COPY src ${WORKSPACE}/app/src

Set the working directory

WORKDIR ${WORKSPACE}/app/src

Use the built-in handler script from the source

CMD ["python3.11", "-u", "runpod_handler.py"]

tawny wyvern
#

@latent steeple @wintry zodiac

Did you guys ever get this working, I’m trying to do the same thing with ai-toolkit. Flux dev model.

Any code you can share? There are some things in your docker image @wintry zodiac id love to be able to edit

tawny wyvern
#

thank you!! 😭😭

wintry zodiac
tawny wyvern
#

That’s okay ! I should be able to reverse engineer 🙂

#

Thank you so much!!

wintry zodiac
tawny wyvern
#

Deal sounds good!

wintry zodiac
#

@tawny wyvern Are you free now?

tawny wyvern
#

@wintry zodiac amazing okay thanks!!

I uploaded the contents of the docker image to a private github, did you want me to share it with you private?

wintry zodiac
wintry zodiac
wintry zodiac
tawny wyvern
naive drift
#

@tawny wyvern @wintry zodiac I built a Docker image using this repo https://github.com/newideas99/flux-training-docker and successfully trained Lora using Runpod serverless endpoints. However, when I run the trained Lora, I get this error: "Exception: Error while deserializing header: HeaderTooLarge." I am no expert, but the Lora safetensor file might be corrupted, and the reason behind the corruption is the Docker base image "navinhariharan/fluxd-model."

Any help is appreciated.
Best,
Jesse

GitHub

Contribute to newideas99/flux-training-docker development by creating an account on GitHub.

wintry zodiac
naive drift
#

thanks for your quick reply. i am using the lora.safetensors[uploaded to my s3 storage by runpod-serverless.py handler.] file on replicate.

#

@wintry zodiac I have tried to train multiple LoRas, and I got the same errors.

#

i tried to run this lora in comfyUI too, and it gave me same error

wintry zodiac
#

@naive drift Your request header is too large

naive drift
#

@wintry zodiac what does it mean?

wintry zodiac
naive drift
#

sure

naive drift
naive drift
#

@wintry zodiac it would be great help if you could provide dockerfile of this image as well. navinhariharan/fluxd-model

#

thanks

wintry zodiac
wintry zodiac
wintry zodiac
naive drift
#

thank you so much navin, i appreciate it. I'll provide you the logs from desktop shorty, thanks again

last mountain
#

How did you download the lora? using what

naive drift
#

2025-06-02 00:55:17.380 | INFO | fp8.lora_loading:restore_base_weights:600 - Unloaded 304 layers
2025-06-02 00:55:17.382 | SUCCESS | fp8.lora_loading:unload_loras:571 - LoRAs unloaded in 0.0042s
free=26730077900800
Downloading weights
downloading weights from https://lora-urls.co/xzy.safetensors
Downloaded weights in 8.33s
2025-06-02 00:55:25.713 | INFO | fp8.lora_loading:convert_lora_weights:502 - Loading LoRA weights for /src/weights-cache/f14ea1f2c70aca45
Traceback (most recent call last):
File "/root/.pyenv/versions/3.11.12/lib/python3.11/site-packages/cog/server/worker.py", line 352, in _predict
result = predict(**payload)
^^^^^^^^^^^^^^^^^^
File "/src/predict.py", line 566, in predict
model.handle_loras(
File "/src/bfl_predictor.py", line 118, in handle_loras
load_lora(model, lora_path, lora_scale, self.store_clones)
File "/root/.pyenv/versions/3.11.12/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/src/fp8/lora_loading.py", line 543, in load_lora
lora_weights = convert_lora_weights(lora_path, has_guidance)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/src/fp8/lora_loading.py", line 503, in convert_lora_weights
lora_weights = load_file(lora_path, device="cuda")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.12/lib/python3.11/site-packages/safetensors/torch.py", line 311, in load_file
with safe_open(filename, framework="pt", device=device) as f:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
safetensors_rust.SafetensorError: Error while deserializing header: HeaderTooLarge

#

@wintry zodiac I have pasted logs from the replicate here.

#

@wintry zodiac I think the GitHub repo is related to ComfyUI, not with the "navinhariharan/fluxd-model", which I requested.

#

@last mountain The trained Lora was uploaded via script[worker] to my S3 bucket, and I am loading it via URL into Replicate Inference.

last mountain
#

Oh its on replicate not runpod

#

maybe the safetensors arent valid?

#

you're downloading the wrong file, downloadnig a corrupted file or the file was corrupteed

naive drift
#

@ You mean the trained Lora is corrupted, right?

last mountain
#

probably, or the download process wrecks the lora

#

you can probably check the hash

#

if the downloaded files & the one in your s3 is same then its not because of the download

#

and also check in other way to use the lora, who knows its the replicate that cannot load the lora

#

it may work somewhere else

#

if it doesnt then your lora is probably corrupted

naive drift
#

@last mountain thanks, I will check this out.

#

@last mountain I have verified and found that the downloading process doesn't make any difference to the file.

#

Hashes match, so my Docker image is the culprit

last mountain
#

Does it work somewhere else?

naive drift
#

no, it's not working anywhere I tried, over replicate and in comfyUI as well, and both gave me the same error.

#

I used repo and tweaked it a bit for my use case, I think the issue lies in the base image 'navinhariharan/fluxd-model" since the layer image doesn't hold anything related to the training process itself,

https://github.com/newideas99/flux-training-docker

#

i also tried to build an image from scratch, but that didn't work. 😥

last mountain
#

maybe its the lora model

naive drift
last mountain
#

Your welcome glad you found it!